build

package
v0.0.0-...-78728ec Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 16, 2026 License: AGPL-3.0 Imports: 34 Imported by: 0

Documentation

Overview

Package build is the indexer orchestrator: discover → parse → chunk → embed → store. Pulled out of cmd/ckv so it stays testable as a library (`ckv build` becomes a thin Cobra wrapper) and so the future CKS integration can call Run() directly.

Index

Constants

This section is empty.

Variables

View Source
var ErrEmbedderMismatch = errors.New("reindex: embedder identity does not match manifest")

ErrEmbedderMismatch signals that the embedder passed to Reindex does not match the embedder recorded in the manifest. Reindex would mix embeddings from two different models in the same store, which breaks retrieval. The caller must either use the original embedder or run a full `ckv build` to replace the index.

View Source
var ErrNoManifest = errors.New("reindex: no manifest at OutDir — run `ckv build` first")

ErrNoManifest signals that ReindexOptions.OutDir has no prior index. Reindex needs a baseline IndexedHead to diff against; the caller should run `ckv build` first.

Functions

func FetchMergedPRs

func FetchMergedPRs(ctx context.Context, srcRoot string, opts PRFetchOptions) ([]prdoc.PRMeta, error)

FetchMergedPRs calls `gh pr list` to get merged PRs, then `gh pr view` for each to get body + commits. Returns parsed PRMeta ready for prdoc.Parse. Requires `gh` CLI authenticated.

func PreCheckByEstimate

func PreCheckByEstimate(rawNeedMB uint64, w io.Writer) error

PreCheckByEstimate verifies host memory headroom against an already-known RAM estimate in MB. Intended for the CLI layer, where the embedder isn't constructed yet but the estimate is recoverable from model config (e.g. bgeonnx.EstimatedRAMMB(opts)).

rawNeedMB == 0 disables the check (caller doesn't know — treat as fail-open). CKV_MEM_GUARD=off or a missing MemProvider also disables.

Types

type MemProvider

type MemProvider interface {
	Read() (MemStat, error)
}

MemProvider is the OS-specific memory reader. memory_<goos>.go files set defaultMemProvider in init(); tests inject mocks via the same variable.

type MemStat

type MemStat struct {
	TotalMB     uint64
	AvailableMB uint64
}

MemStat is a snapshot of host memory in MB.

type Options

type Options struct {
	SrcRoot   string
	OutDir    string
	Embedder  types.Embedder // required
	CKVIgnore []string       // extra ignore patterns from --ckvignore CLI flag
	BatchSize int            // embedding batch size; 0 → 32
	Now       func() time.Time
	Footprint *footprint.Logger // optional; nil → no logging
	// ProgressOut receives human-readable per-file progress lines.
	// nil disables progress entirely (the library-mode default so
	// embedded callers don't get surprise stderr writes). The CLI
	// sets this to os.Stderr; tests can inject a bytes.Buffer.
	ProgressOut io.Writer

	// DisableContextualPrefix turns off the rule-based contextual prefix.
	// The default (zero value, prefix on) prepends a one-line
	// "language: X. file: Y. symbol: Z." sentence to each chunk's embed
	// text — improving recall@1 on natural-language queries at ~5%
	// throughput cost. Disable for A/B measurement against the raw-text
	// baseline. Chunk IDs and the stored Text are unaffected either way.
	DisableContextualPrefix bool

	// PR corpus. When non-nil, the build fetches merged PRs
	// via `gh` CLI and indexes their descriptions + commit messages
	// as additional chunks alongside the source code.
	PRFetch *PRFetchOptions

	// PolicyPath is the path to a policy yaml (e.g. policy/stablenet.yaml).
	// When set and the file exists, every emitted chunk is annotated with
	// Category + ModificationGuidance based on its path. Empty disables
	// classification — chunks ship with Category="" and Guidance=nil.
	PolicyPath string

	// DocsRoots are extra directories walked for markdown AFTER SrcRoot.
	// Files found here are tagged Category="domain" and cited by their
	// path relative to the docs root. Used to embed an out-of-tree curated
	// corpus (the cks domain-knowledge entries + authoritative docs) in
	// the same index. These roots are not git repos, so chunks carry no
	// commit hash.
	DocsRoots []string

	// CKGPath is the path to a CKG data directory (containing graph.db).
	// When set, the builder loads an in-memory (file_path, start_line)
	// index from ckg and resolves each emitted source chunk's CKGNodeID
	// via ckgalign.Lookup — the 1:1 alignment that cks composer uses to
	// disambiguate same-named symbols across packages. Empty disables
	// alignment (CKGNodeID stays ""). Docs-corpus chunks are NOT aligned
	// (ckg has no node for curated markdown). Open failures abort the
	// build with a clear error rather than silently skipping alignment.
	CKGPath string

	// FilesFromPath is the path to a JSON file with include/exclude glob
	// patterns. When set, only files whose repo-relative path passes the
	// filterlist.FilterList.Allow check are sent to the embedder — for
	// ALL languages (Go, Solidity, TypeScript, JavaScript, Markdown).
	// Empty string (the default) disables the allowlist: all discovered
	// files are eligible as before. See internal/filterlist for the JSON
	// schema: {"include": [...globs...], "exclude": [...globs...]}.
	FilesFromPath string
}

Options carry the CLI/programmatic configuration. SrcRoot and OutDir are required; everything else has a documented default.

type PRFetchOptions

type PRFetchOptions struct {
	Repo  string    // "owner/repo"; inferred from git remote if empty
	Since time.Time // only PRs merged after this date
	Limit int       // max PRs to fetch; 0 → 100
}

PRFetchOptions controls which PRs to fetch for corpus indexing.

type PRTagger

type PRTagger interface {
	UpdateRecentPRs(ctx context.Context, filePRs map[string][]types.PRRef) (int, error)
}

PRTagger can tag source chunks with PR breadcrumbs. Implemented by sqlitevec.Store.

type ReindexOptions

type ReindexOptions struct {
	SrcRoot   string
	OutDir    string
	Embedder  types.Embedder // must match the embedder identity in the manifest
	CKVIgnore []string       // extra ignore patterns from --ckvignore CLI flag

	// Since is the commit the diff is computed against. Empty means
	// "use manifest.IndexedHead" (the common case). Pass a specific
	// SHA to override (e.g., "main~5") for catch-up reindex.
	Since string

	// Files, when non-empty, bypasses the git diff and forces reindex
	// of exactly these src-relative paths. Useful when the caller
	// already knows the change set (CI hook, fsnotify watcher) or when
	// reindexing files that aren't yet committed.
	Files []string

	BatchSize int               // 0 → defaultBatch (32)
	Now       func() time.Time  // 0 → time.Now
	Footprint *footprint.Logger // optional; nil → no logging

	// ProgressOut receives human-readable per-file progress lines.
	// nil disables progress entirely (library-mode default).
	ProgressOut io.Writer

	// DisableContextualPrefix mirrors Options.DisableContextualPrefix
	// for the reindex path so partial rebuilds match what the original
	// build produced. Keep both at the same value across build+reindex
	// — mixing prefixed and raw embeddings in one store would degrade
	// retrieval.
	DisableContextualPrefix bool

	// PolicyPath mirrors Options.PolicyPath. Reindexed chunks pass
	// through the policy loader so category/guidance stay current with
	// the yaml even when only some files change.
	PolicyPath string
}

ReindexOptions configures a partial rebuild. SrcRoot, OutDir, and Embedder are required; everything else has a documented default.

type ReindexResult

type ReindexResult struct {
	// FilesProcessed is the count of files actually re-embedded
	// (added + modified). Deletions don't count here.
	FilesProcessed int
	// FilesAdded / FilesModified / FilesDeleted partition the changed
	// set by git diff status so callers can report a useful summary.
	FilesAdded    int
	FilesModified int
	FilesDeleted  int
	// FilesSkipped is files in the diff that didn't match any parser
	// (e.g., changed README.txt is in the diff but ckv doesn't index
	// .txt today). Surfaced so users know the diff size != reindex size.
	FilesSkipped int
	// Chunks aggregates chunk.Stats across every file processed.
	Chunks chunk.Stats
	// PrevHead and NewHead bracket the reindex range.
	PrevHead string
	NewHead  string
	BuiltAt  string
	DBPath   string
}

ReindexResult is what Reindex returns to the caller.

func Reindex

func Reindex(ctx context.Context, o ReindexOptions) (*ReindexResult, error)

Reindex re-embeds only the files that changed between the manifest's IndexedHead (or ReindexOptions.Since) and the source tree's current git HEAD. Idempotent: re-running with no changes is a no-op except the manifest's BuiltAt timestamp.

Pipeline:

  1. Load manifest → get PrevHead + verify Embedder identity.
  2. Compute the change set: - if ReindexOptions.Files is set, use it verbatim; - else `git diff --name-status PrevHead..HEAD` partitions paths into added / modified / deleted (renames split into delete+add).
  3. For deletions: store.DeleteByFile.
  4. For adds + modifications: parse → chunk → DeleteByFile (idempotent for adds) → embed → upsert.
  5. Update manifest IndexedHead and BuiltAt.

Files that fall outside the supported language set are silently skipped (reported in FilesSkipped) so a diff that touches docs/ markdown + go files works without manual filtering.

type Result

type Result struct {
	FilesIndexed int
	Chunks       chunk.Stats
	IndexedHead  string
	BuiltAt      string
	DBPath       string
}

Result is what Run returns to the CLI for the summary log.

func Run

func Run(ctx context.Context, o Options) (*Result, error)

Run executes the full indexing pipeline once. Idempotent: re-running against the same OutDir updates chunks in place (Upsert semantics).

Pipeline:

  1. Detect git HEAD of SrcRoot (for citation.commit_hash).
  2. Walk SrcRoot via discover; skip non-source / oversized / ignored.
  3. For each Go file: parse → chunk → embed → upsert.
  4. Write manifest.json + DB-side manifest table.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL