Documentation
¶
Overview ¶
Package build is the indexer orchestrator: discover → parse → chunk → embed → store. Pulled out of cmd/ckv so it stays testable as a library (`ckv build` becomes a thin Cobra wrapper) and so the future CKS integration can call Run() directly.
Index ¶
- Variables
- func FetchMergedPRs(ctx context.Context, srcRoot string, opts PRFetchOptions) ([]prdoc.PRMeta, error)
- func PreCheckByEstimate(rawNeedMB uint64, w io.Writer) error
- type MemProvider
- type MemStat
- type Options
- type PRFetchOptions
- type PRTagger
- type ReindexOptions
- type ReindexResult
- type Result
Constants ¶
This section is empty.
Variables ¶
var ErrEmbedderMismatch = errors.New("reindex: embedder identity does not match manifest")
ErrEmbedderMismatch signals that the embedder passed to Reindex does not match the embedder recorded in the manifest. Reindex would mix embeddings from two different models in the same store, which breaks retrieval. The caller must either use the original embedder or run a full `ckv build` to replace the index.
var ErrNoManifest = errors.New("reindex: no manifest at OutDir — run `ckv build` first")
ErrNoManifest signals that ReindexOptions.OutDir has no prior index. Reindex needs a baseline IndexedHead to diff against; the caller should run `ckv build` first.
Functions ¶
func FetchMergedPRs ¶
func FetchMergedPRs(ctx context.Context, srcRoot string, opts PRFetchOptions) ([]prdoc.PRMeta, error)
FetchMergedPRs calls `gh pr list` to get merged PRs, then `gh pr view` for each to get body + commits. Returns parsed PRMeta ready for prdoc.Parse. Requires `gh` CLI authenticated.
func PreCheckByEstimate ¶
PreCheckByEstimate verifies host memory headroom against an already-known RAM estimate in MB. Intended for the CLI layer, where the embedder isn't constructed yet but the estimate is recoverable from model config (e.g. bgeonnx.EstimatedRAMMB(opts)).
rawNeedMB == 0 disables the check (caller doesn't know — treat as fail-open). CKV_MEM_GUARD=off or a missing MemProvider also disables.
Types ¶
type MemProvider ¶
MemProvider is the OS-specific memory reader. memory_<goos>.go files set defaultMemProvider in init(); tests inject mocks via the same variable.
type Options ¶
type Options struct {
SrcRoot string
OutDir string
Embedder types.Embedder // required
CKVIgnore []string // extra ignore patterns from --ckvignore CLI flag
BatchSize int // embedding batch size; 0 → 32
Now func() time.Time
Footprint *footprint.Logger // optional; nil → no logging
// ProgressOut receives human-readable per-file progress lines.
// nil disables progress entirely (the library-mode default so
// embedded callers don't get surprise stderr writes). The CLI
// sets this to os.Stderr; tests can inject a bytes.Buffer.
ProgressOut io.Writer
// DisableContextualPrefix turns off the rule-based contextual prefix.
// The default (zero value, prefix on) prepends a one-line
// "language: X. file: Y. symbol: Z." sentence to each chunk's embed
// text — improving recall@1 on natural-language queries at ~5%
// throughput cost. Disable for A/B measurement against the raw-text
// baseline. Chunk IDs and the stored Text are unaffected either way.
DisableContextualPrefix bool
// PR corpus. When non-nil, the build fetches merged PRs
// via `gh` CLI and indexes their descriptions + commit messages
// as additional chunks alongside the source code.
PRFetch *PRFetchOptions
// PolicyPath is the path to a policy yaml (e.g. policy/stablenet.yaml).
// When set and the file exists, every emitted chunk is annotated with
// Category + ModificationGuidance based on its path. Empty disables
// classification — chunks ship with Category="" and Guidance=nil.
PolicyPath string
// DocsRoots are extra directories walked for markdown AFTER SrcRoot.
// Files found here are tagged Category="domain" and cited by their
// path relative to the docs root. Used to embed an out-of-tree curated
// corpus (the cks domain-knowledge entries + authoritative docs) in
// the same index. These roots are not git repos, so chunks carry no
// commit hash.
DocsRoots []string
// CKGPath is the path to a CKG data directory (containing graph.db).
// When set, the builder loads an in-memory (file_path, start_line)
// index from ckg and resolves each emitted source chunk's CKGNodeID
// via ckgalign.Lookup — the 1:1 alignment that cks composer uses to
// disambiguate same-named symbols across packages. Empty disables
// alignment (CKGNodeID stays ""). Docs-corpus chunks are NOT aligned
// (ckg has no node for curated markdown). Open failures abort the
// build with a clear error rather than silently skipping alignment.
CKGPath string
// FilesFromPath is the path to a JSON file with include/exclude glob
// patterns. When set, only files whose repo-relative path passes the
// filterlist.FilterList.Allow check are sent to the embedder — for
// ALL languages (Go, Solidity, TypeScript, JavaScript, Markdown).
// Empty string (the default) disables the allowlist: all discovered
// files are eligible as before. See internal/filterlist for the JSON
// schema: {"include": [...globs...], "exclude": [...globs...]}.
FilesFromPath string
}
Options carry the CLI/programmatic configuration. SrcRoot and OutDir are required; everything else has a documented default.
type PRFetchOptions ¶
type PRFetchOptions struct {
Repo string // "owner/repo"; inferred from git remote if empty
Since time.Time // only PRs merged after this date
Limit int // max PRs to fetch; 0 → 100
}
PRFetchOptions controls which PRs to fetch for corpus indexing.
type PRTagger ¶
type PRTagger interface {
UpdateRecentPRs(ctx context.Context, filePRs map[string][]types.PRRef) (int, error)
}
PRTagger can tag source chunks with PR breadcrumbs. Implemented by sqlitevec.Store.
type ReindexOptions ¶
type ReindexOptions struct {
SrcRoot string
OutDir string
Embedder types.Embedder // must match the embedder identity in the manifest
CKVIgnore []string // extra ignore patterns from --ckvignore CLI flag
// Since is the commit the diff is computed against. Empty means
// "use manifest.IndexedHead" (the common case). Pass a specific
// SHA to override (e.g., "main~5") for catch-up reindex.
Since string
// Files, when non-empty, bypasses the git diff and forces reindex
// of exactly these src-relative paths. Useful when the caller
// already knows the change set (CI hook, fsnotify watcher) or when
// reindexing files that aren't yet committed.
Files []string
BatchSize int // 0 → defaultBatch (32)
Now func() time.Time // 0 → time.Now
Footprint *footprint.Logger // optional; nil → no logging
// ProgressOut receives human-readable per-file progress lines.
// nil disables progress entirely (library-mode default).
ProgressOut io.Writer
// DisableContextualPrefix mirrors Options.DisableContextualPrefix
// for the reindex path so partial rebuilds match what the original
// build produced. Keep both at the same value across build+reindex
// — mixing prefixed and raw embeddings in one store would degrade
// retrieval.
DisableContextualPrefix bool
// PolicyPath mirrors Options.PolicyPath. Reindexed chunks pass
// through the policy loader so category/guidance stay current with
// the yaml even when only some files change.
PolicyPath string
}
ReindexOptions configures a partial rebuild. SrcRoot, OutDir, and Embedder are required; everything else has a documented default.
type ReindexResult ¶
type ReindexResult struct {
// FilesProcessed is the count of files actually re-embedded
// (added + modified). Deletions don't count here.
FilesProcessed int
// FilesAdded / FilesModified / FilesDeleted partition the changed
// set by git diff status so callers can report a useful summary.
FilesAdded int
FilesModified int
FilesDeleted int
// FilesSkipped is files in the diff that didn't match any parser
// (e.g., changed README.txt is in the diff but ckv doesn't index
// .txt today). Surfaced so users know the diff size != reindex size.
FilesSkipped int
// Chunks aggregates chunk.Stats across every file processed.
Chunks chunk.Stats
// PrevHead and NewHead bracket the reindex range.
PrevHead string
NewHead string
BuiltAt string
DBPath string
}
ReindexResult is what Reindex returns to the caller.
func Reindex ¶
func Reindex(ctx context.Context, o ReindexOptions) (*ReindexResult, error)
Reindex re-embeds only the files that changed between the manifest's IndexedHead (or ReindexOptions.Since) and the source tree's current git HEAD. Idempotent: re-running with no changes is a no-op except the manifest's BuiltAt timestamp.
Pipeline:
- Load manifest → get PrevHead + verify Embedder identity.
- Compute the change set: - if ReindexOptions.Files is set, use it verbatim; - else `git diff --name-status PrevHead..HEAD` partitions paths into added / modified / deleted (renames split into delete+add).
- For deletions: store.DeleteByFile.
- For adds + modifications: parse → chunk → DeleteByFile (idempotent for adds) → embed → upsert.
- Update manifest IndexedHead and BuiltAt.
Files that fall outside the supported language set are silently skipped (reported in FilesSkipped) so a diff that touches docs/ markdown + go files works without manual filtering.
type Result ¶
type Result struct {
FilesIndexed int
Chunks chunk.Stats
IndexedHead string
BuiltAt string
DBPath string
}
Result is what Run returns to the CLI for the summary log.
func Run ¶
Run executes the full indexing pipeline once. Idempotent: re-running against the same OutDir updates chunks in place (Upsert semantics).
Pipeline:
- Detect git HEAD of SrcRoot (for citation.commit_hash).
- Walk SrcRoot via discover; skip non-source / oversized / ignored.
- For each Go file: parse → chunk → embed → upsert.
- Write manifest.json + DB-side manifest table.