Documentation
¶
Overview ¶
BM25 additive scoring on top of the MinHash codegraph. Layered on the existing Jaccard ranking so both signals are returned per result and the caller (or future rank-fusion code) can reason about them independently.
BM25 addresses SPEC §8.2 Issue #2 (ubiquitous noise) at the scoring level rather than the token-filtering level: common tokens like "get" or "error" get low IDF weight and contribute little to the score even when they're shared between query and symbol. Rare tokens like "postgresql" or "kubernetes" dominate the score when they match.
Unlike the stopwords filter, BM25 is SYMMETRIC by design: a token's IDF is the same whether it shows up in a query or a symbol, so scoring is reciprocal and doesn't need asymmetric filtering logic.
TypeScript parser backed by tree-sitter. Replaces GenericParser for .ts and .tsx files when celeste is built with CGo enabled. The regex-based GenericParser produced symbols that matched identifier shapes but could not resolve call-graph edges through TypeScript's type-aware method dispatch, leaving most TS interfaces with edgeCount=0 in the codegraph (documented in SPEC §8.2 and surfaced by the Task 19 ⚠ zero-edge warning). An AST-based parser sees the real call sites and writes the edges that were previously missing.
Scope for v2.0.0: TypeScript (.ts and .tsx) only. Python and Rust stay on the regex GenericParser for now — they aren't the validation target for this task and they have no zero-edge warnings in the bundled benchmark corpus.
CGo caveat: this file and its dependencies pull in tree-sitter's C runtime and the bundled TypeScript/TSX grammars. The //go:build cgo constraint at the top gates compilation on CGO_ENABLED=1 — when cross-building release binaries from a Linux host for darwin/windows the Go toolchain disables CGo implicitly, and the stub in parser_ts_stub.go takes over. Stub builds fall back to the regex GenericParser for TypeScript files; they still work, just without the tree-sitter edge-resolution improvement. Users who want the full experience must either build from source with CGo enabled or wait for the v2.1.0 release workflow which will cross-compile against a proper C toolchain (zig CC or matrix of native runners).
Structural rerank layer.
After the Jaccard + BM25 fused ranking produces a preliminary order, this layer applies a scalar rescore that incorporates features the fusion doesn't see: how many query tokens actually matched the symbol's shingle set, how well-connected the symbol is in the call graph, and what KIND of symbol it is. The goal is to bubble the "obviously relevant" results above close-score ties where the fused order alone is ambiguous.
Pure Go, zero dependencies, zero cloud calls. All signals come from fields already present on SearchResult, so this layer has no effect on indexing latency and only a tiny reorder cost at query time.
The Reranker interface below is a deliberate seam: the current StructuralReranker is hand-tuned feature engineering, but a future EmbeddingReranker (local llama.cpp bridge, xAI embeddings endpoint, ONNX sentence-transformers, ...) can drop in at the same call site without touching the search pipeline.
Stopwords runtime integration — embeds stopwords.json at build time and exposes parsed lookup sets for ShinglesForSymbol and SemanticSearch to consume.
The embedded file is produced by celeste-stopwords and licensed under CC BY 4.0 — see stopwords_NOTICE.md for attribution.
Index ¶
- Constants
- func BytesToSeeds(data []byte) ([]uint64, error)
- func ComputeBM25Score(queryTokens []string, docTokens map[string]int, docLength int, ...) float64
- func ComputeFusedRanking(jaccardRanks, bm25Ranks map[int64]int) []int64
- func DefaultIndexPath(projectRoot string) string
- func DetectLanguage(filename string) string
- func DetectProjectLanguage(dir string) string
- func FormatConfidenceLine(r SearchResult) string
- func IsDemotable(flags []PathFlag) bool
- func IsIndexableFile(filename string) bool
- func JaccardSimilarity(a, b MinHashSignature) float64
- func PathFlagStrings(flags []PathFlag) []string
- func SeedsToBytes(seeds []uint64) []byte
- func ShinglesForSymbol(sym Symbol, source []byte, lang string) []string
- func ShouldSkipPath(path string) bool
- type BM25CorpusStats
- type CodeSmell
- type CodeSmellKind
- type Edge
- type EdgeKind
- type FileEdge
- type FileRecord
- type FunctionEdgeInfo
- type GenericParser
- type GitignoreFilter
- type GoParser
- type Indexer
- func (idx *Indexer) Build() error
- func (idx *Indexer) Close() error
- func (idx *Indexer) FindCodeSmells(kinds []CodeSmellKind, maxResults int, includeTests bool) ([]CodeSmell, error)
- func (idx *Indexer) FindLazyRedirects(maxResults int, includeTests bool) ([]LazyRedirectResult, error)
- func (idx *Indexer) KeywordSearch(query string, limit int) ([]Symbol, error)
- func (idx *Indexer) PackageGraph() ([]PackageInfo, []PackageEdge, error)
- func (idx *Indexer) ProjectSummary() string
- func (idx *Indexer) SemanticSearch(query string, topK int) ([]SearchResult, error)
- func (idx *Indexer) SemanticSearchWithOptions(query string, opts SemanticSearchOptions) ([]SearchResult, error)
- func (idx *Indexer) Stats() (*StoreStats, error)
- func (idx *Indexer) Store() *Store
- func (idx *Indexer) Update() error
- type LazyRedirectCandidate
- type LazyRedirectResult
- type MinHashEntry
- type MinHashSignature
- type MinHasher
- type PackageEdge
- type PackageInfo
- type ParseResult
- type PathFlag
- type RawEdge
- type Reranker
- type SearchResult
- type SemanticSearchOptions
- type StopWords
- type Store
- func (s *Store) AddEdge(sourceID, targetID int64, kind EdgeKind) error
- func (s *Store) Close() error
- func (s *Store) DeleteFile(path string) error
- func (s *Store) DeleteFileSymbols(file string) error
- func (s *Store) FindAllFunctionsWithEdges() ([]FunctionEdgeInfo, error)
- func (s *Store) FindLazyRedirectCandidates(includeTests bool) ([]LazyRedirectCandidate, error)
- func (s *Store) FindStubs(includeTests bool) ([]StubResult, error)
- func (s *Store) GetAllFiles() ([]FileRecord, error)
- func (s *Store) GetAllMinHashes() ([]MinHashEntry, error)
- func (s *Store) GetEdgesFrom(sourceID int64) ([]Edge, error)
- func (s *Store) GetEdgesTo(targetID int64) ([]Edge, error)
- func (s *Store) GetFile(path string) (*FileRecord, error)
- func (s *Store) GetFileGraph() ([]FileEdge, error)
- func (s *Store) GetIDFs(tokens []string) (map[string]float64, error)
- func (s *Store) GetMeta(key string) ([]byte, error)
- func (s *Store) GetMinHash(symbolID int64) (MinHashSignature, error)
- func (s *Store) GetPackageGraph() ([]PackageInfo, []PackageEdge, error)
- func (s *Store) GetSymbol(id int64) (*Symbol, error)
- func (s *Store) GetSymbolIDByName(name string) (int64, bool)
- func (s *Store) GetSymbolTokens(symbolID int64) (map[string]int, int, error)
- func (s *Store) GetSymbolsByFile(file string) ([]Symbol, error)
- func (s *Store) GetSymbolsByPackage(pkg string) ([]Symbol, error)
- func (s *Store) ReadBM25Stats() (*BM25CorpusStats, error)
- func (s *Store) RebuildTokenStats() (*BM25CorpusStats, error)
- func (s *Store) SearchSymbolsByName(query string) ([]Symbol, error)
- func (s *Store) SetMeta(key string, value []byte) error
- func (s *Store) Stats() (*StoreStats, error)
- func (s *Store) UpdateMinHash(symbolID int64, sig MinHashSignature) error
- func (s *Store) UpsertFile(f FileRecord) error
- func (s *Store) UpsertSymbol(sym Symbol) (int64, error)
- func (s *Store) UpsertSymbolTokens(symbolID int64, tokens []string) error
- type StoreStats
- type StructuralReranker
- type StubResult
- type Symbol
- type SymbolKind
- type TSParser
- type TokenStat
Constants ¶
const ( // WarnDemotedTest — result was demoted because its path matches a // test directory or test filename suffix. WarnDemotedTest = "demoted: test path" // WarnDemotedMock — result was demoted because its path is in a // mocks/, fixtures/, or stubs/ directory. WarnDemotedMock = "demoted: mock path" // WarnDemotedDeclaration — result was demoted because the symbol // lives in a .d.ts / .d.mts declaration-only file. Useful for TS // consumers: declaration files describe API surfaces but have no // runtime code, so a matching declaration probably isn't what you // want when looking for implementations. WarnDemotedDeclaration = "demoted: declaration-only file" // WarnDemotedVendored — result was demoted because the symbol is // in a vendored third-party directory (vendor/, node_modules/, // third_party/). WarnDemotedVendored = "demoted: vendored code" // WarnDemotedGenerated — result was demoted because the symbol is // in a build-output directory (dist/, build/, .next/, target/). WarnDemotedGenerated = "demoted: generated code" // WarnZeroEdge — the symbol has zero incoming AND zero outgoing // edges in the code graph. Two possible interpretations: // 1. Genuine dead code. Nothing calls it, it calls nothing. // 2. Parser limitation. The regex parser for TS/Python/Rust // cannot resolve many call sites and edges for non-Go // languages are systematically undercounted. An LLM should // NOT conclude "dead code" from this warning alone — verify // by reading the file. // SPEC §8.2 Issue #2 documents this ambiguity. WarnZeroEdge = "zero edges — may be dead code or parser limitation" // WarnLowConfidence — Jaccard similarity is below 0.10. Results // at this tier are right at the signal/noise boundary for MinHash // with 128 hash functions (pairwise-independent FNV variant). An // LLM should treat these as "maybe relevant" not "definitely // relevant" and verify by reading the source. WarnLowConfidence = "low confidence (jaccard < 0.10)" // WarnDeclarationOnlyType — symbol is a pure type/interface with // no body and zero edges. Common in TS type declaration files // and Go interface-only types. Probably not runtime code the // user wants to find. WarnDeclarationOnlyType = "type/interface declaration without references" )
Confidence warning constants. These strings are stable across releases because callers (LLM tool users, UIs, scripts) may match on them directly. Add new ones freely but do NOT rename or remove existing ones without a version bump.
const DefaultNumHashes = 128
DefaultNumHashes is the number of hash functions used for MinHash signatures. 128 provides good accuracy with sub-10ms query time for 50k symbols.
Variables ¶
This section is empty.
Functions ¶
func BytesToSeeds ¶ added in v1.9.0
BytesToSeeds deserializes a byte slice back into a seed slice. Returns an error if the length is not a multiple of 8.
func ComputeBM25Score ¶ added in v1.9.0
func ComputeBM25Score(queryTokens []string, docTokens map[string]int, docLength int, idf map[string]float64, avgDocLen float64) float64
ComputeBM25Score computes the BM25 score for a single symbol against a query.
queryTokens: deduplicated lowercase tokens from the query docTokens: map[token] = term frequency (TF) for this symbol docLength: total token count for the symbol idf: map[token] = precomputed IDF for each query token avgDocLen: average doc length across the corpus
Pure function, no store access. Callers resolve the inputs from stored data first, then call this repeatedly across candidate set.
func ComputeFusedRanking ¶ added in v1.9.0
ComputeFusedRanking combines two ranked lists into a single ranking using Reciprocal Rank Fusion: each entry's fused score is the sum of 1/(k + rank) across the lists it appears in. Higher fused score = better overall rank.
byID maps symbol ID to its position in each list (1-indexed). A symbol absent from a list contributes nothing for that list.
This is deliberately not a method on any type — it's a pure function over data structures so it's trivial to test in isolation.
func DefaultIndexPath ¶ added in v1.8.3
DefaultIndexPath returns the path to the code graph database for a project. It stores the index under ~/.celeste/projects/<hash>/codegraph.db to avoid polluting the project directory.
func DetectLanguage ¶
DetectLanguage returns the language for a file based on its extension. Returns empty string if the language is not recognized.
func DetectProjectLanguage ¶
DetectProjectLanguage determines the primary language of a project by checking for manifest files in the given directory.
func FormatConfidenceLine ¶ added in v1.9.0
func FormatConfidenceLine(r SearchResult) string
FormatConfidenceLine returns a human-readable one-line summary of a SearchResult's confidence metadata, suitable for appending to CLI / tool output. Empty string if there's nothing notable.
Example output:
" ⚠ demoted: mock path; zero edges — may be dead code or parser limitation; edges=0" " edges=12"
func IsDemotable ¶ added in v1.9.0
IsDemotable returns true if the flag set is non-empty — i.e., at least one demotion reason applies. Pure convenience helper.
func IsIndexableFile ¶
IsIndexableFile returns true if the file's language has parser support.
func JaccardSimilarity ¶
func JaccardSimilarity(a, b MinHashSignature) float64
JaccardSimilarity estimates the Jaccard similarity between two MinHash signatures. Returns a value between 0.0 (completely different) and 1.0 (identical sets).
IMPORTANT: both signatures must have been computed with the SAME hash seeds for this to produce meaningful results. Comparing signatures from different MinHashers (different seeds) yields noise.
func PathFlagStrings ¶ added in v1.9.0
PathFlagStrings converts a []PathFlag to a []string for serialization to JSON / API responses / logs.
func SeedsToBytes ¶ added in v1.9.0
SeedsToBytes serializes a seed slice to bytes for persistence. Layout is little-endian uint64 × N, for a total of 8*N bytes.
func ShinglesForSymbol ¶
ShinglesForSymbol generates enriched shingles for a symbol, used as input to MinHash for semantic similarity search. Each shingle is a lowercased token derived from the symbol's name, types, body references, package, and comments.
The final token list is filtered through the embedded stopwords.json (celeste-stopwords v1.0.0, CC BY 4.0) via stopWords.Filter. The filter applies the universal set plus the per-language set identified by lang. Pass "" for lang to apply only the universal set — this is the right choice for callers that don't know the file's language.
func ShouldSkipPath ¶
ShouldSkipPath returns true if the path should be excluded from indexing.
Types ¶
type BM25CorpusStats ¶ added in v1.9.0
BM25CorpusStats holds the corpus-wide statistics BM25 needs for scoring: total document count and average document length (in shingle tokens). Computed once at the end of Build() and cached via the meta table so query-time scoring is a single lookup.
type CodeSmell ¶ added in v1.8.3
type CodeSmell struct {
Kind CodeSmellKind `json:"kind"`
Name string `json:"name"`
File string `json:"file"`
Line int `json:"line"`
FuncKind string `json:"func_kind"`
OutEdges int `json:"outgoing_edges"`
InEdges int `json:"incoming_edges"`
Score float64 `json:"score"`
Reason string `json:"reason"`
Signature string `json:"signature,omitempty"`
Snippet string `json:"snippet,omitempty"`
}
CodeSmell represents a structurally detected code issue.
type CodeSmellKind ¶ added in v1.8.3
type CodeSmellKind string
CodeSmellKind categorizes the type of code smell detected.
const ( SmellLazyRedirect CodeSmellKind = "LAZY_REDIRECT" SmellStub CodeSmellKind = "STUB" SmellPlaceholder CodeSmellKind = "PLACEHOLDER" SmellTodoFixme CodeSmellKind = "TODO_FIXME" SmellEmptyHandler CodeSmellKind = "EMPTY_HANDLER" SmellHardcoded CodeSmellKind = "HARDCODED" )
type FileRecord ¶
type FileRecord struct {
Path string
Language string
Size int64
ContentHash string
IndexedAt int64
}
FileRecord tracks indexed files for incremental updates.
type FunctionEdgeInfo ¶ added in v1.8.3
type FunctionEdgeInfo struct {
Name string
File string
Line int
Kind string
Signature string
OutEdges int
InEdges int
}
FunctionEdgeInfo holds a function's identity and edge counts for analysis.
type GenericParser ¶
type GenericParser struct {
// contains filtered or unexported fields
}
GenericParser extracts symbols from non-Go source files using regex patterns. Covers Python, JavaScript, TypeScript, and Rust. No call graph (would need tree-sitter / CGo). Focuses on declarations: functions, classes, imports.
func NewGenericParser ¶
func NewGenericParser(language string) *GenericParser
NewGenericParser creates a parser for the given language.
func (*GenericParser) ParseFile ¶
func (p *GenericParser) ParseFile(path string) (*ParseResult, error)
ParseFile parses a source file and extracts symbols using regex.
type GitignoreFilter ¶ added in v1.8.3
type GitignoreFilter struct {
// contains filtered or unexported fields
}
GitignoreFilter holds compiled gitignore patterns for matching.
func LoadGitignore ¶ added in v1.8.3
func LoadGitignore(projectRoot string) *GitignoreFilter
LoadGitignore reads a .gitignore file and returns a filter. Returns nil (no filter) if the file doesn't exist or can't be read.
func (*GitignoreFilter) ShouldSkip ¶ added in v1.8.3
func (f *GitignoreFilter) ShouldSkip(relPath string, isDir bool) bool
ShouldSkip returns true if the given relative path should be ignored. isDir indicates whether the path is a directory.
type GoParser ¶
type GoParser struct{}
GoParser extracts symbols and edges from Go source files using go/ast.
type Indexer ¶
type Indexer struct {
// contains filtered or unexported fields
}
Indexer manages the code graph lifecycle: build, update, and query.
func NewIndexer ¶
NewIndexer creates an indexer for the given workspace, using the specified SQLite database path.
Reloads the MinHasher seeds from the store's meta table if present so stored signatures remain comparable across process invocations. If no seeds are stored (fresh index or pre-v1.9.0 index), generates fresh random seeds that will be persisted on the first Build().
func NewIndexerWithStore ¶ added in v1.8.3
NewIndexerWithStore creates an indexer using an existing store. This is useful for testing where the store is set up manually. Unlike NewIndexer, does NOT attempt to load seeds from the store — the caller is responsible for passing a store that either has no meta row yet or whose seeds are irrelevant for the test.
func (*Indexer) Build ¶
Build performs a full index of the workspace. Walks the file tree, parses source files, extracts symbols and edges, computes MinHash signatures, and stores everything in SQLite.
func (*Indexer) Close ¶
Close releases the underlying database connection and any native resources held by the tree-sitter TS parser.
func (*Indexer) FindCodeSmells ¶ added in v1.8.3
func (idx *Indexer) FindCodeSmells(kinds []CodeSmellKind, maxResults int, includeTests bool) ([]CodeSmell, error)
FindCodeSmells performs a single-pass structural analysis over all functions in the graph, detecting multiple code smell patterns simultaneously. This is more efficient than separate queries and more powerful than grep because it combines graph structure (edges, connectivity) with body analysis.
func (*Indexer) FindLazyRedirects ¶ added in v1.8.3
func (idx *Indexer) FindLazyRedirects(maxResults int, includeTests bool) ([]LazyRedirectResult, error)
FindLazyRedirects uses structural analysis to detect functions whose names imply complex behavior but whose graph structure shows they're trivially simple. This goes beyond grep-based detection by measuring the divergence between a function's semantic vocabulary (shingles) and its actual call graph connectivity.
Scoring factors:
- Name complexity: action verbs in name suggest the function should DO work
- Edge poverty: fewer outgoing edges = less actual work done
- Shingle richness: domain-specific vocabulary in body that doesn't connect to edges
Returns results sorted by divergence score (highest = most suspicious).
func (*Indexer) KeywordSearch ¶
KeywordSearch finds symbols matching a keyword query using SQL LIKE.
func (*Indexer) PackageGraph ¶ added in v1.8.3
func (idx *Indexer) PackageGraph() ([]PackageInfo, []PackageEdge, error)
PackageGraph returns package-level connectivity for visualization.
func (*Indexer) ProjectSummary ¶
ProjectSummary returns a brief summary suitable for the system prompt.
func (*Indexer) SemanticSearch ¶
func (idx *Indexer) SemanticSearch(query string, topK int) ([]SearchResult, error)
SemanticSearch finds symbols semantically similar to the query string. The query is split into shingles, MinHashed, then compared against all symbol signatures using brute-force Jaccard similarity.
Applies the path-based post-filter by default — test/mock/generated/ vendored/declaration results are partitioned below clean-path results of comparable similarity. Use SemanticSearchWithOptions to disable.
func (*Indexer) SemanticSearchWithOptions ¶ added in v1.9.0
func (idx *Indexer) SemanticSearchWithOptions(query string, opts SemanticSearchOptions) ([]SearchResult, error)
SemanticSearchWithOptions is the full-options variant of SemanticSearch.
func (*Indexer) Stats ¶
func (idx *Indexer) Stats() (*StoreStats, error)
Stats returns aggregate stats for the indexed codebase.
type LazyRedirectCandidate ¶ added in v1.8.3
type LazyRedirectCandidate struct {
Name string
File string
Line int
Kind string
OutEdges int
InEdges int
Signature string
}
LazyRedirectCandidate represents a function whose name implies complex behavior but whose graph structure shows it's structurally trivial — a potential lazy redirect.
type LazyRedirectResult ¶ added in v1.8.3
type LazyRedirectResult struct {
Name string `json:"name"`
File string `json:"file"`
Line int `json:"line"`
Kind string `json:"kind"`
OutEdges int `json:"outgoing_edges"`
InEdges int `json:"incoming_edges"`
Score float64 `json:"divergence_score"`
Reason string `json:"reason"`
Signature string `json:"signature"`
}
LazyRedirectResult is a scored candidate for lazy redirect detection.
type MinHashEntry ¶
type MinHashEntry struct {
SymbolID int64
Signature MinHashSignature
}
MinHashEntry pairs a symbol ID with its MinHash signature for bulk queries.
type MinHashSignature ¶
type MinHashSignature []uint64
MinHashSignature is a fixed-length array of hash values for similarity search.
type MinHasher ¶
type MinHasher struct {
// contains filtered or unexported fields
}
MinHasher computes MinHash signatures for sets of shingles. Uses FNV-1a with different uint64 seeds to simulate N independent hash functions. Seeds are a fixed []uint64 so the hasher can be persisted to the codegraph store and restored across process invocations — essential for reliable cross-process semantic search.
func NewMinHasher ¶
NewMinHasher creates a MinHasher with the specified number of hash functions, generating fresh random seeds from crypto/rand. Use NewMinHasherFromSeeds when reloading a persisted hasher from the store.
func NewMinHasherFromSeeds ¶ added in v1.9.0
NewMinHasherFromSeeds creates a MinHasher with pre-determined seeds, typically reloaded from the codegraph store's meta table. This is the critical path for cross-process signature stability: a MinHash signature computed with seeds S can only be compared to another signature computed with the SAME seeds S. Persisting the seeds and restoring them on Open is what makes SemanticSearch work across process boundaries.
func (*MinHasher) Seeds ¶ added in v1.9.0
Seeds returns a copy of the hasher's seeds. Used by the indexer to persist them into the codegraph store's meta table at build time.
func (*MinHasher) Signature ¶
func (m *MinHasher) Signature(shingles []string) MinHashSignature
Signature computes the MinHash signature for a set of shingles. Each element of the returned slice is the minimum hash value across all shingles for that hash function.
type PackageEdge ¶ added in v1.8.3
PackageEdge represents a connection between two packages.
type PackageInfo ¶ added in v1.8.3
PackageInfo holds package-level stats for visualization.
type ParseResult ¶
type ParseResult struct {
Symbols []Symbol
Edges []RawEdge
Source []byte // raw file content for shingle generation
}
ParseResult holds the symbols and edges extracted from a single file.
type PathFlag ¶ added in v1.9.0
type PathFlag string
PathFlag is a machine-readable marker attached to a search result when the symbol's file path matches a known pattern that affects its interpretation — test fixture, mock, type declaration, vendored code, build output, etc.
Flags are computed at query time (not stored in the index), so adding new flag categories does not invalidate existing codegraph databases. Callers can read SearchResult.PathFlags to understand WHY a symbol was demoted from the "clean" ranking tier.
const ( // FlagTest — symbol lives in a test file or test directory. These // are genuine test helpers: TestFoo functions in Go's _test.go files, // tests/*.py in Python, *.spec.ts / *.test.ts in TypeScript, and so on. FlagTest PathFlag = "test" // FlagMock — symbol is in a mocks/, fixtures/, or stubs/ directory. // Mock handlers, fake services, test doubles. These pollute queries // like "http request handler middleware" because they share // discriminative tokens with production middleware without BEING // production middleware. Q2 in the grafana A/B test was 100% mock // handlers for exactly this reason. FlagMock PathFlag = "mock" // FlagDeclaration — symbol is in a pure type declaration file // (e.g. TypeScript .d.ts). These describe an API surface but have // no runtime code. Usually undesirable as a semantic search match // because the user is looking for implementations, not declarations. // JQueryStatic lives in a .d.ts file and this flag would demote it // even without the splitCamelCase fix. FlagDeclaration PathFlag = "declaration" // FlagVendored — symbol is in a vendored dependency or third-party // package directory (vendor/, node_modules/, bower_components/). // These are external code the user didn't write. Usually irrelevant. FlagVendored PathFlag = "vendored" // FlagGenerated — symbol is in a generated-code output directory // (dist/, build/, .next/, out/, target/). Post-compile artifacts, // transpiled output, build caches. Never what the user wants. FlagGenerated PathFlag = "generated" )
func ClassifyPath ¶ added in v1.9.0
ClassifyPath inspects a file path and returns the set of PathFlags that apply. Empty result means the symbol is in a "clean" path with no demotion warranted.
Deterministic, fast (O(path length)), and order-independent: the same path always produces the same flag set.
type RawEdge ¶
RawEdge is an unresolved edge that uses symbol names instead of IDs. Resolved to Edge (with IDs) when inserted into the store.
type Reranker ¶ added in v1.9.0
type Reranker interface {
Rerank(results []SearchResult, queryTokenCount int) []SearchResult
}
Reranker rescores a set of SearchResult candidates using features beyond the baseline Jaccard + BM25 fusion. Implementations MUST be pure — no network, no file I/O beyond what was passed in. Callers are responsible for cloning the input if they need the original order preserved; Rerank is allowed to mutate the slice in place.
queryTokenCount is the number of distinct query shingles that made it through the stop-word filter. Rerankers that compute a matched-token ratio need this for normalization.
type SearchResult ¶
type SearchResult struct {
Symbol Symbol
Similarity float64
// BM25Score is the additive per-symbol BM25 score for this query,
// computed alongside the Jaccard similarity at search time. Not a
// replacement for Similarity — both signals are returned so callers
// (or a downstream re-rank layer) can reason about them independently.
// Zero when the BM25 corpus stats table is empty (pre-v1.9.0 index).
BM25Score float64
// MatchedTokens are the query tokens that appeared in this symbol's
// filtered shingle set (intersection of query and symbol tokens).
// Populated only when BM25 scoring is active. Useful reasoning output
// for LLMs: "this result matched because it contains X, Y, Z".
MatchedTokens []string
PathFlags []string
EdgeCount int
ConfidenceWarnings []string
}
SearchResult pairs a symbol with its similarity score and a set of machine-readable reasoning fields that tell an LLM (or a human) WHY this result was returned and how confident celeste is in it.
PathFlags: markers attached when the symbol's file path triggered the path-based post-filter — e.g. ["test"], ["mock", "generated"]. Clean- path results have an empty PathFlags slice. SemanticSearch demotes flagged results below clean results by default; see SemanticSearchOptions.ApplyPathFilter to disable.
EdgeCount: total incoming + outgoing edges on this symbol in the code graph. A function that is called from 4 places and calls 2 others has EdgeCount=6. Zero-edge symbols are suspicious — they may be genuine dead code, but they may also be symbols the parser failed to resolve (especially TS/Python/Rust where the regex parser can't follow call sites through type definitions). SPEC §8.2 Issue #2 documents this ambiguity explicitly; LLMs should NOT treat EdgeCount=0 as proof of dead code without corroborating evidence.
ConfidenceWarnings: human-readable strings describing caveats about this result. Derived at query time from PathFlags, EdgeCount, Kind, and Similarity — no schema change, no precomputation. Callers should surface these to whoever consumes the search results so low-quality matches are recognized as such instead of being treated as confident answers.
type SemanticSearchOptions ¶ added in v1.9.0
type SemanticSearchOptions struct {
// TopK is the maximum number of results to return. Required.
TopK int
// MinSimilarity is the Jaccard floor below which results are dropped
// entirely. Zero means use the default (0.05).
MinSimilarity float64
// ApplyPathFilter, when true, demotes results whose file path matches
// a known "noisy" pattern (test/mock/generated/vendored/declaration)
// below clean-path results. Default when using SemanticSearch is true.
// Set false for raw unfiltered results.
ApplyPathFilter bool
// Reranker, when non-nil, is applied to the candidate list after
// the Jaccard + BM25 fusion and before the path filter tiering.
// A pluggable seam — the default (set via SemanticSearch) is
// StructuralReranker which does pure-Go feature-based rescoring.
// Future cloud/local embedding rerankers can implement this
// interface without touching the search pipeline.
//
// Pass a zero value (nil) together with
// DisableRerank=true to get the pre-Task-24 behavior (fusion-only).
Reranker Reranker
// DisableRerank bypasses the Reranker even if one is set.
// Useful for A/B testing and for callers that want the raw
// fused ordering without any structural adjustments.
DisableRerank bool
}
SemanticSearchOptions configures SemanticSearch behavior. Existing callers of SemanticSearch(query, topK) get the default behavior — path filter ON, structural rerank ON — without any changes.
type StopWords ¶ added in v1.9.0
type StopWords struct {
Version string
Universal map[string]bool
ByLang map[string]map[string]bool
Compound map[string]bool
}
StopWords holds the parsed lookup sets used at shingle-generation time and query-tokenization time. Built once at init.
func (*StopWords) Filter ¶ added in v1.9.0
Filter removes any tokens in the universal set or in the per-language set for the given lang from the input slice. Empty lang means universal-only filtering. The input slice is NOT mutated.
Preserves the order of surviving tokens. Returns a freshly allocated slice (safe for the caller to hold).
func (*StopWords) IsCompound ¶ added in v1.9.0
IsCompound returns true if the lowercased identifier is in the compound_identifiers list. Used by splitIdentifier to keep known compound names (jquery, github, mysql, ...) atomic instead of decomposing them into parts that pollute searches.
The splitCamelCase fix in v1.9.0 (min-3-uppercase rule) already handles most compound-name cases structurally, so this lookup is a belt-and-suspenders layer: it catches snake_case compounds (mysql_config → would split to ["mysql", "config"] without this, but we WANT "mysql" to stay atomic because splitting to "my"+"sql" or similar is worse) and any lowercase-only compounds that splitCamelCase wouldn't touch at all.
func (*StopWords) UniversalSize ¶ added in v1.9.0
UniversalSize returns the number of universal stop words. Used by the anchor test to assert the embedded file isn't obviously broken.
type Store ¶
type Store struct {
// contains filtered or unexported fields
}
Store manages the SQLite database for the code graph.
func NewStore ¶
NewStore opens (or creates) a SQLite database at the given path and initializes the schema.
func (*Store) DeleteFile ¶
DeleteFile removes a file record.
func (*Store) DeleteFileSymbols ¶
DeleteFileSymbols removes all symbols (and their edges) for a file.
func (*Store) FindAllFunctionsWithEdges ¶ added in v1.8.3
func (s *Store) FindAllFunctionsWithEdges() ([]FunctionEdgeInfo, error)
FindAllFunctionsWithEdges returns all functions/methods with their edge counts. Used by the unified code smell detector for single-pass analysis.
func (*Store) FindLazyRedirectCandidates ¶ added in v1.8.3
func (s *Store) FindLazyRedirectCandidates(includeTests bool) ([]LazyRedirectCandidate, error)
FindLazyRedirectCandidates returns functions/methods with low outgoing edges (0-2) that are NOT known leaf patterns (constructors, getters, interface impls). These are candidates for lazy redirect analysis via shingle/edge divergence.
func (*Store) FindStubs ¶ added in v1.8.3
func (s *Store) FindStubs(includeTests bool) ([]StubResult, error)
FindStubs returns functions/methods with zero outgoing call edges. These are likely stubs, placeholders, or dead code.
func (*Store) GetAllFiles ¶
func (s *Store) GetAllFiles() ([]FileRecord, error)
GetAllFiles returns all indexed file records.
func (*Store) GetAllMinHashes ¶
func (s *Store) GetAllMinHashes() ([]MinHashEntry, error)
GetAllMinHashes retrieves all symbol IDs and their MinHash signatures for similarity search. Symbols without a signature are skipped.
func (*Store) GetEdgesFrom ¶
GetEdgesFrom returns all outgoing edges from the given symbol.
func (*Store) GetEdgesTo ¶
GetEdgesTo returns all incoming edges to the given symbol.
func (*Store) GetFile ¶
func (s *Store) GetFile(path string) (*FileRecord, error)
GetFile retrieves a file record by path.
func (*Store) GetFileGraph ¶ added in v1.8.3
GetFileGraph returns file-level connectivity data for visualization. Works for all languages — shows which files call into other files.
func (*Store) GetIDFs ¶ added in v1.9.0
GetIDFs reads IDF values for a set of tokens in one batched query. Returns a map containing only tokens that exist in token_stats — missing tokens contribute zero to BM25 scores.
func (*Store) GetMeta ¶ added in v1.9.0
GetMeta reads a raw byte value from the meta key/value table. Returns (nil, nil) if the key is not present — callers should treat nil as "not set" and decide whether to generate and persist.
func (*Store) GetMinHash ¶
func (s *Store) GetMinHash(symbolID int64) (MinHashSignature, error)
GetMinHash retrieves the MinHash signature for a symbol.
func (*Store) GetPackageGraph ¶ added in v1.8.3
func (s *Store) GetPackageGraph() ([]PackageInfo, []PackageEdge, error)
GetPackageGraph returns package-level connectivity data for visualization.
func (*Store) GetSymbolIDByName ¶ added in v1.8.2
GetSymbolIDByName returns the ID of a symbol by exact name match. If multiple symbols share the same name, returns the first found.
func (*Store) GetSymbolTokens ¶ added in v1.9.0
GetSymbolTokens reads the stored TF map for a single symbol. Used at query time to compute BM25 scores. Returns an empty map (not nil) if the symbol has no token rows, so callers can treat it as "zero contribution" without nil-checks.
func (*Store) GetSymbolsByFile ¶
GetSymbolsByFile returns all symbols in the given file.
func (*Store) GetSymbolsByPackage ¶
GetSymbolsByPackage returns all symbols in the given package.
func (*Store) ReadBM25Stats ¶ added in v1.9.0
func (s *Store) ReadBM25Stats() (*BM25CorpusStats, error)
ReadBM25Stats reads the cached corpus-wide BM25 stats. Returns (nil, nil) if the meta row is absent (fresh index or pre-BM25 index).
func (*Store) RebuildTokenStats ¶ added in v1.9.0
func (s *Store) RebuildTokenStats() (*BM25CorpusStats, error)
RebuildTokenStats walks the entire symbol_tokens table and computes df + idf for every token. Replaces the contents of token_stats atomically (delete-all + insert) so re-runs produce a consistent state. Called at the end of Build() and Update() — cheap compared to the full indexing pass because it's just aggregation over rows we just wrote.
Also computes the corpus-wide NumDocs + AvgDocLength stats and persists them to the meta table so query time can read them in a single lookup instead of COUNT(DISTINCT symbol_id) and AVG() scans.
func (*Store) SearchSymbolsByName ¶
SearchSymbolsByName returns symbols whose name contains the query (case-insensitive).
func (*Store) SetMeta ¶ added in v1.9.0
SetMeta writes a raw byte value to the meta key/value table. Upserts on conflict so the caller can treat this as idempotent.
func (*Store) Stats ¶
func (s *Store) Stats() (*StoreStats, error)
Stats returns aggregate counts for the indexed codebase.
func (*Store) UpdateMinHash ¶
func (s *Store) UpdateMinHash(symbolID int64, sig MinHashSignature) error
UpdateMinHash stores the MinHash signature for a symbol.
func (*Store) UpsertFile ¶
func (s *Store) UpsertFile(f FileRecord) error
UpsertFile inserts or updates a file record.
func (*Store) UpsertSymbol ¶
UpsertSymbol inserts or updates a symbol. Uniqueness is determined by (name, kind, package, file). Returns the row ID.
func (*Store) UpsertSymbolTokens ¶ added in v1.9.0
UpsertSymbolTokens writes the per-symbol token frequencies for a given symbol. Called from indexFile after the shingles are computed so we have the raw frequencies before deduplication collapses them to 1-per-token.
tokens is passed as a slice (not a set) because we want TF counts: the same token appearing twice in the shingle stream should count 2. Celeste's current shingle pipeline dedupes, so TF is always 1 in practice, but we preserve the more general API for future extractor improvements that might count frequency more accurately.
type StoreStats ¶
type StoreStats struct {
TotalSymbols int
TotalEdges int
TotalFiles int
SymbolsByKind map[SymbolKind]int
FilesByLang map[string]int
}
StoreStats holds aggregate counts for the indexed codebase.
type StructuralReranker ¶ added in v1.9.0
type StructuralReranker struct {
// MatchedTokenWeight scales the matched-token-ratio contribution.
// Default 1.0 — a full-match symbol gets +1.0 added to its base
// score, which is significant relative to the typical Jaccard
// range of 0.1-0.2 but doesn't trivially override BM25.
MatchedTokenWeight float64
// EdgeDensityWeight scales the log-normalized edge count
// contribution. Default 0.3 — mild boost; edge count alone
// shouldn't overwhelm real textual relevance.
EdgeDensityWeight float64
// KindBoostFunction is the additive weight for function / method
// symbols. Default 0.15 — small but enough to break ties in favor
// of actual implementations over type aliases.
KindBoostFunction float64
// ZeroEdgePenalty is the additive weight (usually negative) for
// function/method symbols with zero edges. Default -0.25 — pushes
// likely-dead-code below real matches without entirely removing it.
ZeroEdgePenalty float64
}
StructuralReranker is the default Reranker shipped in v1.9.0. It scores each candidate using a weighted combination of features that the RRF fusion can't see:
MatchedTokenRatio: fraction of query tokens that appear in the symbol's filtered shingle set. A symbol matching 4/4 query tokens should rank above one matching 1/4 even if the Jaccard estimator happens to put them at similar percentiles.
EdgeDensity: log-normalized edge count. Well-connected symbols are more likely to be the "real" implementation of a feature than zero-edge stub interfaces. Capped logarithmically so that a symbol with 200 edges doesn't dwarf one with 20.
KindWeight: function and method symbols get a small boost over type/interface declarations for implementation-hunting queries. Tuned on the SPEC §5.1 benchmark queries which all target "find me the code that does X" rather than "find me the type definition for X".
ZeroEdgePenalty: a symbol with zero edges and kind in {function, method} is either dead code or a parser limitation. Push it below other candidates that actually have connectivity.
All features are normalized to [0,1]-ish ranges before the weighted sum. Weights are exposed on the struct so callers can A/B tune without recompiling; the zero value uses sensible defaults picked by hand-inspection on the Task 23 content-control benchmark.
func NewStructuralReranker ¶ added in v1.9.0
func NewStructuralReranker() *StructuralReranker
NewStructuralReranker returns a reranker with the default weights picked by hand-inspection on the content-control benchmark. Callers that want to experiment can construct StructuralReranker{} directly with custom weights instead.
func (*StructuralReranker) Rerank ¶ added in v1.9.0
func (r *StructuralReranker) Rerank(results []SearchResult, queryTokenCount int) []SearchResult
Rerank applies the structural rescore to results and returns a new ordering. The original Similarity / BM25Score fields on each SearchResult are preserved; only the slice order changes. Callers can audit the rerank by comparing the old order to the new one.
Ties (exact equal structural scores) are broken by the incoming order so the rerank is stable relative to the fused ranking. This matters because the fused ranking already encodes meaningful signal — we're enhancing it, not replacing it, and ties should fall back to "trust the upstream signal".
type StubResult ¶ added in v1.8.3
StubResult represents a function/method with zero outgoing call edges.
type Symbol ¶
type Symbol struct {
ID int64
Name string
Kind SymbolKind
Package string
File string
Line int
Signature string
}
Symbol represents a code entity (function, type, interface, etc.).
type SymbolKind ¶
type SymbolKind string
SymbolKind identifies the kind of code symbol.
const ( SymbolFunction SymbolKind = "function" SymbolMethod SymbolKind = "method" SymbolType SymbolKind = "type" SymbolInterface SymbolKind = "interface" SymbolConst SymbolKind = "const" SymbolVar SymbolKind = "var" SymbolStruct SymbolKind = "struct" SymbolImport SymbolKind = "import" SymbolClass SymbolKind = "class" )
type TSParser ¶ added in v1.9.0
type TSParser struct {
// contains filtered or unexported fields
}
TSParser parses TypeScript and TSX source files using tree-sitter. One parser holds both language pointers — selection is per-file by extension. The underlying tree_sitter.Parser is re-used across files (Parse() resets the internal state) so allocation stays cheap.
func NewTSParser ¶ added in v1.9.0
func NewTSParser() *TSParser
NewTSParser initializes the tree-sitter parser with the TypeScript and TSX grammars loaded. Returns an error if grammar wiring fails (shouldn't happen in practice — the grammars are statically linked).