codegraph

package
v1.9.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 14, 2026 License: MIT Imports: 24 Imported by: 0

Documentation

Overview

BM25 additive scoring on top of the MinHash codegraph. Layered on the existing Jaccard ranking so both signals are returned per result and the caller (or future rank-fusion code) can reason about them independently.

BM25 addresses SPEC §8.2 Issue #2 (ubiquitous noise) at the scoring level rather than the token-filtering level: common tokens like "get" or "error" get low IDF weight and contribute little to the score even when they're shared between query and symbol. Rare tokens like "postgresql" or "kubernetes" dominate the score when they match.

Unlike the stopwords filter, BM25 is SYMMETRIC by design: a token's IDF is the same whether it shows up in a query or a symbol, so scoring is reciprocal and doesn't need asymmetric filtering logic.

TypeScript parser backed by tree-sitter. Replaces GenericParser for .ts and .tsx files when celeste is built with CGo enabled. The regex-based GenericParser produced symbols that matched identifier shapes but could not resolve call-graph edges through TypeScript's type-aware method dispatch, leaving most TS interfaces with edgeCount=0 in the codegraph (documented in SPEC §8.2 and surfaced by the Task 19 ⚠ zero-edge warning). An AST-based parser sees the real call sites and writes the edges that were previously missing.

Scope for v2.0.0: TypeScript (.ts and .tsx) only. Python and Rust stay on the regex GenericParser for now — they aren't the validation target for this task and they have no zero-edge warnings in the bundled benchmark corpus.

CGo caveat: this file and its dependencies pull in tree-sitter's C runtime and the bundled TypeScript/TSX grammars. The //go:build cgo constraint at the top gates compilation on CGO_ENABLED=1 — when cross-building release binaries from a Linux host for darwin/windows the Go toolchain disables CGo implicitly, and the stub in parser_ts_stub.go takes over. Stub builds fall back to the regex GenericParser for TypeScript files; they still work, just without the tree-sitter edge-resolution improvement. Users who want the full experience must either build from source with CGo enabled or wait for the v2.1.0 release workflow which will cross-compile against a proper C toolchain (zig CC or matrix of native runners).

Structural rerank layer.

After the Jaccard + BM25 fused ranking produces a preliminary order, this layer applies a scalar rescore that incorporates features the fusion doesn't see: how many query tokens actually matched the symbol's shingle set, how well-connected the symbol is in the call graph, and what KIND of symbol it is. The goal is to bubble the "obviously relevant" results above close-score ties where the fused order alone is ambiguous.

Pure Go, zero dependencies, zero cloud calls. All signals come from fields already present on SearchResult, so this layer has no effect on indexing latency and only a tiny reorder cost at query time.

The Reranker interface below is a deliberate seam: the current StructuralReranker is hand-tuned feature engineering, but a future EmbeddingReranker (local llama.cpp bridge, xAI embeddings endpoint, ONNX sentence-transformers, ...) can drop in at the same call site without touching the search pipeline.

Stopwords runtime integration — embeds stopwords.json at build time and exposes parsed lookup sets for ShinglesForSymbol and SemanticSearch to consume.

The embedded file is produced by celeste-stopwords and licensed under CC BY 4.0 — see stopwords_NOTICE.md for attribution.

Index

Constants

View Source
const (
	// WarnDemotedTest — result was demoted because its path matches a
	// test directory or test filename suffix.
	WarnDemotedTest = "demoted: test path"

	// WarnDemotedMock — result was demoted because its path is in a
	// mocks/, fixtures/, or stubs/ directory.
	WarnDemotedMock = "demoted: mock path"

	// WarnDemotedDeclaration — result was demoted because the symbol
	// lives in a .d.ts / .d.mts declaration-only file. Useful for TS
	// consumers: declaration files describe API surfaces but have no
	// runtime code, so a matching declaration probably isn't what you
	// want when looking for implementations.
	WarnDemotedDeclaration = "demoted: declaration-only file"

	// WarnDemotedVendored — result was demoted because the symbol is
	// in a vendored third-party directory (vendor/, node_modules/,
	// third_party/).
	WarnDemotedVendored = "demoted: vendored code"

	// WarnDemotedGenerated — result was demoted because the symbol is
	// in a build-output directory (dist/, build/, .next/, target/).
	WarnDemotedGenerated = "demoted: generated code"

	// WarnZeroEdge — the symbol has zero incoming AND zero outgoing
	// edges in the code graph. Two possible interpretations:
	//   1. Genuine dead code. Nothing calls it, it calls nothing.
	//   2. Parser limitation. The regex parser for TS/Python/Rust
	//      cannot resolve many call sites and edges for non-Go
	//      languages are systematically undercounted. An LLM should
	//      NOT conclude "dead code" from this warning alone — verify
	//      by reading the file.
	// SPEC §8.2 Issue #2 documents this ambiguity.
	WarnZeroEdge = "zero edges — may be dead code or parser limitation"

	// WarnLowConfidence — Jaccard similarity is below 0.10. Results
	// at this tier are right at the signal/noise boundary for MinHash
	// with 128 hash functions (pairwise-independent FNV variant). An
	// LLM should treat these as "maybe relevant" not "definitely
	// relevant" and verify by reading the source.
	WarnLowConfidence = "low confidence (jaccard < 0.10)"

	// WarnDeclarationOnlyType — symbol is a pure type/interface with
	// no body and zero edges. Common in TS type declaration files
	// and Go interface-only types. Probably not runtime code the
	// user wants to find.
	WarnDeclarationOnlyType = "type/interface declaration without references"
)

Confidence warning constants. These strings are stable across releases because callers (LLM tool users, UIs, scripts) may match on them directly. Add new ones freely but do NOT rename or remove existing ones without a version bump.

View Source
const DefaultNumHashes = 128

DefaultNumHashes is the number of hash functions used for MinHash signatures. 128 provides good accuracy with sub-10ms query time for 50k symbols.

Variables

This section is empty.

Functions

func BytesToSeeds added in v1.9.0

func BytesToSeeds(data []byte) ([]uint64, error)

BytesToSeeds deserializes a byte slice back into a seed slice. Returns an error if the length is not a multiple of 8.

func ComputeBM25Score added in v1.9.0

func ComputeBM25Score(queryTokens []string, docTokens map[string]int, docLength int, idf map[string]float64, avgDocLen float64) float64

ComputeBM25Score computes the BM25 score for a single symbol against a query.

queryTokens: deduplicated lowercase tokens from the query docTokens: map[token] = term frequency (TF) for this symbol docLength: total token count for the symbol idf: map[token] = precomputed IDF for each query token avgDocLen: average doc length across the corpus

Pure function, no store access. Callers resolve the inputs from stored data first, then call this repeatedly across candidate set.

func ComputeFusedRanking added in v1.9.0

func ComputeFusedRanking(jaccardRanks, bm25Ranks map[int64]int) []int64

ComputeFusedRanking combines two ranked lists into a single ranking using Reciprocal Rank Fusion: each entry's fused score is the sum of 1/(k + rank) across the lists it appears in. Higher fused score = better overall rank.

byID maps symbol ID to its position in each list (1-indexed). A symbol absent from a list contributes nothing for that list.

This is deliberately not a method on any type — it's a pure function over data structures so it's trivial to test in isolation.

func DefaultIndexPath added in v1.8.3

func DefaultIndexPath(projectRoot string) string

DefaultIndexPath returns the path to the code graph database for a project. It stores the index under ~/.celeste/projects/<hash>/codegraph.db to avoid polluting the project directory.

func DetectLanguage

func DetectLanguage(filename string) string

DetectLanguage returns the language for a file based on its extension. Returns empty string if the language is not recognized.

func DetectProjectLanguage

func DetectProjectLanguage(dir string) string

DetectProjectLanguage determines the primary language of a project by checking for manifest files in the given directory.

func FormatConfidenceLine added in v1.9.0

func FormatConfidenceLine(r SearchResult) string

FormatConfidenceLine returns a human-readable one-line summary of a SearchResult's confidence metadata, suitable for appending to CLI / tool output. Empty string if there's nothing notable.

Example output:

"  ⚠ demoted: mock path; zero edges — may be dead code or parser limitation; edges=0"
"  edges=12"

func IsDemotable added in v1.9.0

func IsDemotable(flags []PathFlag) bool

IsDemotable returns true if the flag set is non-empty — i.e., at least one demotion reason applies. Pure convenience helper.

func IsIndexableFile

func IsIndexableFile(filename string) bool

IsIndexableFile returns true if the file's language has parser support.

func JaccardSimilarity

func JaccardSimilarity(a, b MinHashSignature) float64

JaccardSimilarity estimates the Jaccard similarity between two MinHash signatures. Returns a value between 0.0 (completely different) and 1.0 (identical sets).

IMPORTANT: both signatures must have been computed with the SAME hash seeds for this to produce meaningful results. Comparing signatures from different MinHashers (different seeds) yields noise.

func PathFlagStrings added in v1.9.0

func PathFlagStrings(flags []PathFlag) []string

PathFlagStrings converts a []PathFlag to a []string for serialization to JSON / API responses / logs.

func SeedsToBytes added in v1.9.0

func SeedsToBytes(seeds []uint64) []byte

SeedsToBytes serializes a seed slice to bytes for persistence. Layout is little-endian uint64 × N, for a total of 8*N bytes.

func ShinglesForSymbol

func ShinglesForSymbol(sym Symbol, source []byte, lang string) []string

ShinglesForSymbol generates enriched shingles for a symbol, used as input to MinHash for semantic similarity search. Each shingle is a lowercased token derived from the symbol's name, types, body references, package, and comments.

The final token list is filtered through the embedded stopwords.json (celeste-stopwords v1.0.0, CC BY 4.0) via stopWords.Filter. The filter applies the universal set plus the per-language set identified by lang. Pass "" for lang to apply only the universal set — this is the right choice for callers that don't know the file's language.

func ShouldSkipPath

func ShouldSkipPath(path string) bool

ShouldSkipPath returns true if the path should be excluded from indexing.

Types

type BM25CorpusStats added in v1.9.0

type BM25CorpusStats struct {
	NumDocs      int
	AvgDocLength float64
}

BM25CorpusStats holds the corpus-wide statistics BM25 needs for scoring: total document count and average document length (in shingle tokens). Computed once at the end of Build() and cached via the meta table so query-time scoring is a single lookup.

type CodeSmell added in v1.8.3

type CodeSmell struct {
	Kind      CodeSmellKind `json:"kind"`
	Name      string        `json:"name"`
	File      string        `json:"file"`
	Line      int           `json:"line"`
	FuncKind  string        `json:"func_kind"`
	OutEdges  int           `json:"outgoing_edges"`
	InEdges   int           `json:"incoming_edges"`
	Score     float64       `json:"score"`
	Reason    string        `json:"reason"`
	Signature string        `json:"signature,omitempty"`
	Snippet   string        `json:"snippet,omitempty"`
}

CodeSmell represents a structurally detected code issue.

type CodeSmellKind added in v1.8.3

type CodeSmellKind string

CodeSmellKind categorizes the type of code smell detected.

const (
	SmellLazyRedirect CodeSmellKind = "LAZY_REDIRECT"
	SmellStub         CodeSmellKind = "STUB"
	SmellPlaceholder  CodeSmellKind = "PLACEHOLDER"
	SmellTodoFixme    CodeSmellKind = "TODO_FIXME"
	SmellEmptyHandler CodeSmellKind = "EMPTY_HANDLER"
	SmellHardcoded    CodeSmellKind = "HARDCODED"
)

type Edge

type Edge struct {
	SourceID int64
	TargetID int64
	Kind     EdgeKind
}

Edge represents a relationship between two symbols.

type EdgeKind

type EdgeKind string

EdgeKind identifies the kind of relationship between symbols.

const (
	EdgeCalls      EdgeKind = "calls"
	EdgeImports    EdgeKind = "imports"
	EdgeImplements EdgeKind = "implements"
	EdgeEmbeds     EdgeKind = "embeds"
	EdgeReferences EdgeKind = "references"
)

type FileEdge added in v1.8.3

type FileEdge struct {
	Source string
	Target string
	Count  int
}

FileEdge represents a connection between two files.

type FileRecord

type FileRecord struct {
	Path        string
	Language    string
	Size        int64
	ContentHash string
	IndexedAt   int64
}

FileRecord tracks indexed files for incremental updates.

type FunctionEdgeInfo added in v1.8.3

type FunctionEdgeInfo struct {
	Name      string
	File      string
	Line      int
	Kind      string
	Signature string
	OutEdges  int
	InEdges   int
}

FunctionEdgeInfo holds a function's identity and edge counts for analysis.

type GenericParser

type GenericParser struct {
	// contains filtered or unexported fields
}

GenericParser extracts symbols from non-Go source files using regex patterns. Covers Python, JavaScript, TypeScript, and Rust. No call graph (would need tree-sitter / CGo). Focuses on declarations: functions, classes, imports.

func NewGenericParser

func NewGenericParser(language string) *GenericParser

NewGenericParser creates a parser for the given language.

func (*GenericParser) ParseFile

func (p *GenericParser) ParseFile(path string) (*ParseResult, error)

ParseFile parses a source file and extracts symbols using regex.

type GitignoreFilter added in v1.8.3

type GitignoreFilter struct {
	// contains filtered or unexported fields
}

GitignoreFilter holds compiled gitignore patterns for matching.

func LoadGitignore added in v1.8.3

func LoadGitignore(projectRoot string) *GitignoreFilter

LoadGitignore reads a .gitignore file and returns a filter. Returns nil (no filter) if the file doesn't exist or can't be read.

func (*GitignoreFilter) ShouldSkip added in v1.8.3

func (f *GitignoreFilter) ShouldSkip(relPath string, isDir bool) bool

ShouldSkip returns true if the given relative path should be ignored. isDir indicates whether the path is a directory.

type GoParser

type GoParser struct{}

GoParser extracts symbols and edges from Go source files using go/ast.

func NewGoParser

func NewGoParser() *GoParser

NewGoParser creates a new Go AST parser.

func (*GoParser) ParseFile

func (p *GoParser) ParseFile(path string) (*ParseResult, error)

ParseFile parses a single Go source file and extracts symbols and edges.

type Indexer

type Indexer struct {
	// contains filtered or unexported fields
}

Indexer manages the code graph lifecycle: build, update, and query.

func NewIndexer

func NewIndexer(workspace, dbPath string) (*Indexer, error)

NewIndexer creates an indexer for the given workspace, using the specified SQLite database path.

Reloads the MinHasher seeds from the store's meta table if present so stored signatures remain comparable across process invocations. If no seeds are stored (fresh index or pre-v1.9.0 index), generates fresh random seeds that will be persisted on the first Build().

func NewIndexerWithStore added in v1.8.3

func NewIndexerWithStore(store *Store, workspace string) *Indexer

NewIndexerWithStore creates an indexer using an existing store. This is useful for testing where the store is set up manually. Unlike NewIndexer, does NOT attempt to load seeds from the store — the caller is responsible for passing a store that either has no meta row yet or whose seeds are irrelevant for the test.

func (*Indexer) Build

func (idx *Indexer) Build() error

Build performs a full index of the workspace. Walks the file tree, parses source files, extracts symbols and edges, computes MinHash signatures, and stores everything in SQLite.

func (*Indexer) Close

func (idx *Indexer) Close() error

Close releases the underlying database connection and any native resources held by the tree-sitter TS parser.

func (*Indexer) FindCodeSmells added in v1.8.3

func (idx *Indexer) FindCodeSmells(kinds []CodeSmellKind, maxResults int, includeTests bool) ([]CodeSmell, error)

FindCodeSmells performs a single-pass structural analysis over all functions in the graph, detecting multiple code smell patterns simultaneously. This is more efficient than separate queries and more powerful than grep because it combines graph structure (edges, connectivity) with body analysis.

func (*Indexer) FindLazyRedirects added in v1.8.3

func (idx *Indexer) FindLazyRedirects(maxResults int, includeTests bool) ([]LazyRedirectResult, error)

FindLazyRedirects uses structural analysis to detect functions whose names imply complex behavior but whose graph structure shows they're trivially simple. This goes beyond grep-based detection by measuring the divergence between a function's semantic vocabulary (shingles) and its actual call graph connectivity.

Scoring factors:

  • Name complexity: action verbs in name suggest the function should DO work
  • Edge poverty: fewer outgoing edges = less actual work done
  • Shingle richness: domain-specific vocabulary in body that doesn't connect to edges

Returns results sorted by divergence score (highest = most suspicious).

func (*Indexer) KeywordSearch

func (idx *Indexer) KeywordSearch(query string, limit int) ([]Symbol, error)

KeywordSearch finds symbols matching a keyword query using SQL LIKE.

func (*Indexer) PackageGraph added in v1.8.3

func (idx *Indexer) PackageGraph() ([]PackageInfo, []PackageEdge, error)

PackageGraph returns package-level connectivity for visualization.

func (*Indexer) ProjectSummary

func (idx *Indexer) ProjectSummary() string

ProjectSummary returns a brief summary suitable for the system prompt.

func (*Indexer) SemanticSearch

func (idx *Indexer) SemanticSearch(query string, topK int) ([]SearchResult, error)

SemanticSearch finds symbols semantically similar to the query string. The query is split into shingles, MinHashed, then compared against all symbol signatures using brute-force Jaccard similarity.

Applies the path-based post-filter by default — test/mock/generated/ vendored/declaration results are partitioned below clean-path results of comparable similarity. Use SemanticSearchWithOptions to disable.

func (*Indexer) SemanticSearchWithOptions added in v1.9.0

func (idx *Indexer) SemanticSearchWithOptions(query string, opts SemanticSearchOptions) ([]SearchResult, error)

SemanticSearchWithOptions is the full-options variant of SemanticSearch.

func (*Indexer) Stats

func (idx *Indexer) Stats() (*StoreStats, error)

Stats returns aggregate stats for the indexed codebase.

func (*Indexer) Store

func (idx *Indexer) Store() *Store

Store returns the underlying store for direct queries (used by tools).

func (*Indexer) Update

func (idx *Indexer) Update() error

Update performs an incremental update. Only re-indexes files whose content hash has changed since the last index. Removes symbols for deleted files.

type LazyRedirectCandidate added in v1.8.3

type LazyRedirectCandidate struct {
	Name      string
	File      string
	Line      int
	Kind      string
	OutEdges  int
	InEdges   int
	Signature string
}

LazyRedirectCandidate represents a function whose name implies complex behavior but whose graph structure shows it's structurally trivial — a potential lazy redirect.

type LazyRedirectResult added in v1.8.3

type LazyRedirectResult struct {
	Name      string  `json:"name"`
	File      string  `json:"file"`
	Line      int     `json:"line"`
	Kind      string  `json:"kind"`
	OutEdges  int     `json:"outgoing_edges"`
	InEdges   int     `json:"incoming_edges"`
	Score     float64 `json:"divergence_score"`
	Reason    string  `json:"reason"`
	Signature string  `json:"signature"`
}

LazyRedirectResult is a scored candidate for lazy redirect detection.

type MinHashEntry

type MinHashEntry struct {
	SymbolID  int64
	Signature MinHashSignature
}

MinHashEntry pairs a symbol ID with its MinHash signature for bulk queries.

type MinHashSignature

type MinHashSignature []uint64

MinHashSignature is a fixed-length array of hash values for similarity search.

type MinHasher

type MinHasher struct {
	// contains filtered or unexported fields
}

MinHasher computes MinHash signatures for sets of shingles. Uses FNV-1a with different uint64 seeds to simulate N independent hash functions. Seeds are a fixed []uint64 so the hasher can be persisted to the codegraph store and restored across process invocations — essential for reliable cross-process semantic search.

func NewMinHasher

func NewMinHasher(numHashes int) *MinHasher

NewMinHasher creates a MinHasher with the specified number of hash functions, generating fresh random seeds from crypto/rand. Use NewMinHasherFromSeeds when reloading a persisted hasher from the store.

func NewMinHasherFromSeeds added in v1.9.0

func NewMinHasherFromSeeds(seeds []uint64) *MinHasher

NewMinHasherFromSeeds creates a MinHasher with pre-determined seeds, typically reloaded from the codegraph store's meta table. This is the critical path for cross-process signature stability: a MinHash signature computed with seeds S can only be compared to another signature computed with the SAME seeds S. Persisting the seeds and restoring them on Open is what makes SemanticSearch work across process boundaries.

func (*MinHasher) NumHashes added in v1.9.0

func (m *MinHasher) NumHashes() int

NumHashes returns the signature length.

func (*MinHasher) Seeds added in v1.9.0

func (m *MinHasher) Seeds() []uint64

Seeds returns a copy of the hasher's seeds. Used by the indexer to persist them into the codegraph store's meta table at build time.

func (*MinHasher) Signature

func (m *MinHasher) Signature(shingles []string) MinHashSignature

Signature computes the MinHash signature for a set of shingles. Each element of the returned slice is the minimum hash value across all shingles for that hash function.

type PackageEdge added in v1.8.3

type PackageEdge struct {
	Source string
	Target string
	Count  int
}

PackageEdge represents a connection between two packages.

type PackageInfo added in v1.8.3

type PackageInfo struct {
	Name        string
	SymbolCount int
	FileCount   int
}

PackageInfo holds package-level stats for visualization.

type ParseResult

type ParseResult struct {
	Symbols []Symbol
	Edges   []RawEdge
	Source  []byte // raw file content for shingle generation
}

ParseResult holds the symbols and edges extracted from a single file.

type PathFlag added in v1.9.0

type PathFlag string

PathFlag is a machine-readable marker attached to a search result when the symbol's file path matches a known pattern that affects its interpretation — test fixture, mock, type declaration, vendored code, build output, etc.

Flags are computed at query time (not stored in the index), so adding new flag categories does not invalidate existing codegraph databases. Callers can read SearchResult.PathFlags to understand WHY a symbol was demoted from the "clean" ranking tier.

const (
	// FlagTest — symbol lives in a test file or test directory. These
	// are genuine test helpers: TestFoo functions in Go's _test.go files,
	// tests/*.py in Python, *.spec.ts / *.test.ts in TypeScript, and so on.
	FlagTest PathFlag = "test"

	// FlagMock — symbol is in a mocks/, fixtures/, or stubs/ directory.
	// Mock handlers, fake services, test doubles. These pollute queries
	// like "http request handler middleware" because they share
	// discriminative tokens with production middleware without BEING
	// production middleware. Q2 in the grafana A/B test was 100% mock
	// handlers for exactly this reason.
	FlagMock PathFlag = "mock"

	// FlagDeclaration — symbol is in a pure type declaration file
	// (e.g. TypeScript .d.ts). These describe an API surface but have
	// no runtime code. Usually undesirable as a semantic search match
	// because the user is looking for implementations, not declarations.
	// JQueryStatic lives in a .d.ts file and this flag would demote it
	// even without the splitCamelCase fix.
	FlagDeclaration PathFlag = "declaration"

	// FlagVendored — symbol is in a vendored dependency or third-party
	// package directory (vendor/, node_modules/, bower_components/).
	// These are external code the user didn't write. Usually irrelevant.
	FlagVendored PathFlag = "vendored"

	// FlagGenerated — symbol is in a generated-code output directory
	// (dist/, build/, .next/, out/, target/). Post-compile artifacts,
	// transpiled output, build caches. Never what the user wants.
	FlagGenerated PathFlag = "generated"
)

func ClassifyPath added in v1.9.0

func ClassifyPath(path string) []PathFlag

ClassifyPath inspects a file path and returns the set of PathFlags that apply. Empty result means the symbol is in a "clean" path with no demotion warranted.

Deterministic, fast (O(path length)), and order-independent: the same path always produces the same flag set.

type RawEdge

type RawEdge struct {
	SourceName string
	TargetName string
	Kind       EdgeKind
}

RawEdge is an unresolved edge that uses symbol names instead of IDs. Resolved to Edge (with IDs) when inserted into the store.

type Reranker added in v1.9.0

type Reranker interface {
	Rerank(results []SearchResult, queryTokenCount int) []SearchResult
}

Reranker rescores a set of SearchResult candidates using features beyond the baseline Jaccard + BM25 fusion. Implementations MUST be pure — no network, no file I/O beyond what was passed in. Callers are responsible for cloning the input if they need the original order preserved; Rerank is allowed to mutate the slice in place.

queryTokenCount is the number of distinct query shingles that made it through the stop-word filter. Rerankers that compute a matched-token ratio need this for normalization.

type SearchResult

type SearchResult struct {
	Symbol     Symbol
	Similarity float64

	// BM25Score is the additive per-symbol BM25 score for this query,
	// computed alongside the Jaccard similarity at search time. Not a
	// replacement for Similarity — both signals are returned so callers
	// (or a downstream re-rank layer) can reason about them independently.
	// Zero when the BM25 corpus stats table is empty (pre-v1.9.0 index).
	BM25Score float64

	// MatchedTokens are the query tokens that appeared in this symbol's
	// filtered shingle set (intersection of query and symbol tokens).
	// Populated only when BM25 scoring is active. Useful reasoning output
	// for LLMs: "this result matched because it contains X, Y, Z".
	MatchedTokens []string

	PathFlags          []string
	EdgeCount          int
	ConfidenceWarnings []string
}

SearchResult pairs a symbol with its similarity score and a set of machine-readable reasoning fields that tell an LLM (or a human) WHY this result was returned and how confident celeste is in it.

PathFlags: markers attached when the symbol's file path triggered the path-based post-filter — e.g. ["test"], ["mock", "generated"]. Clean- path results have an empty PathFlags slice. SemanticSearch demotes flagged results below clean results by default; see SemanticSearchOptions.ApplyPathFilter to disable.

EdgeCount: total incoming + outgoing edges on this symbol in the code graph. A function that is called from 4 places and calls 2 others has EdgeCount=6. Zero-edge symbols are suspicious — they may be genuine dead code, but they may also be symbols the parser failed to resolve (especially TS/Python/Rust where the regex parser can't follow call sites through type definitions). SPEC §8.2 Issue #2 documents this ambiguity explicitly; LLMs should NOT treat EdgeCount=0 as proof of dead code without corroborating evidence.

ConfidenceWarnings: human-readable strings describing caveats about this result. Derived at query time from PathFlags, EdgeCount, Kind, and Similarity — no schema change, no precomputation. Callers should surface these to whoever consumes the search results so low-quality matches are recognized as such instead of being treated as confident answers.

type SemanticSearchOptions added in v1.9.0

type SemanticSearchOptions struct {
	// TopK is the maximum number of results to return. Required.
	TopK int

	// MinSimilarity is the Jaccard floor below which results are dropped
	// entirely. Zero means use the default (0.05).
	MinSimilarity float64

	// ApplyPathFilter, when true, demotes results whose file path matches
	// a known "noisy" pattern (test/mock/generated/vendored/declaration)
	// below clean-path results. Default when using SemanticSearch is true.
	// Set false for raw unfiltered results.
	ApplyPathFilter bool

	// Reranker, when non-nil, is applied to the candidate list after
	// the Jaccard + BM25 fusion and before the path filter tiering.
	// A pluggable seam — the default (set via SemanticSearch) is
	// StructuralReranker which does pure-Go feature-based rescoring.
	// Future cloud/local embedding rerankers can implement this
	// interface without touching the search pipeline.
	//
	// Pass a zero value (nil) together with
	// DisableRerank=true to get the pre-Task-24 behavior (fusion-only).
	Reranker Reranker

	// DisableRerank bypasses the Reranker even if one is set.
	// Useful for A/B testing and for callers that want the raw
	// fused ordering without any structural adjustments.
	DisableRerank bool
}

SemanticSearchOptions configures SemanticSearch behavior. Existing callers of SemanticSearch(query, topK) get the default behavior — path filter ON, structural rerank ON — without any changes.

type StopWords added in v1.9.0

type StopWords struct {
	Version   string
	Universal map[string]bool
	ByLang    map[string]map[string]bool
	Compound  map[string]bool
}

StopWords holds the parsed lookup sets used at shingle-generation time and query-tokenization time. Built once at init.

func (*StopWords) Filter added in v1.9.0

func (s *StopWords) Filter(tokens []string, lang string) []string

Filter removes any tokens in the universal set or in the per-language set for the given lang from the input slice. Empty lang means universal-only filtering. The input slice is NOT mutated.

Preserves the order of surviving tokens. Returns a freshly allocated slice (safe for the caller to hold).

func (*StopWords) IsCompound added in v1.9.0

func (s *StopWords) IsCompound(name string) bool

IsCompound returns true if the lowercased identifier is in the compound_identifiers list. Used by splitIdentifier to keep known compound names (jquery, github, mysql, ...) atomic instead of decomposing them into parts that pollute searches.

The splitCamelCase fix in v1.9.0 (min-3-uppercase rule) already handles most compound-name cases structurally, so this lookup is a belt-and-suspenders layer: it catches snake_case compounds (mysql_config → would split to ["mysql", "config"] without this, but we WANT "mysql" to stay atomic because splitting to "my"+"sql" or similar is worse) and any lowercase-only compounds that splitCamelCase wouldn't touch at all.

func (*StopWords) UniversalSize added in v1.9.0

func (s *StopWords) UniversalSize() int

UniversalSize returns the number of universal stop words. Used by the anchor test to assert the embedded file isn't obviously broken.

type Store

type Store struct {
	// contains filtered or unexported fields
}

Store manages the SQLite database for the code graph.

func NewStore

func NewStore(dbPath string) (*Store, error)

NewStore opens (or creates) a SQLite database at the given path and initializes the schema.

func (*Store) AddEdge

func (s *Store) AddEdge(sourceID, targetID int64, kind EdgeKind) error

AddEdge records a directional relationship between two symbols.

func (*Store) Close

func (s *Store) Close() error

Close closes the underlying database connection.

func (*Store) DeleteFile

func (s *Store) DeleteFile(path string) error

DeleteFile removes a file record.

func (*Store) DeleteFileSymbols

func (s *Store) DeleteFileSymbols(file string) error

DeleteFileSymbols removes all symbols (and their edges) for a file.

func (*Store) FindAllFunctionsWithEdges added in v1.8.3

func (s *Store) FindAllFunctionsWithEdges() ([]FunctionEdgeInfo, error)

FindAllFunctionsWithEdges returns all functions/methods with their edge counts. Used by the unified code smell detector for single-pass analysis.

func (*Store) FindLazyRedirectCandidates added in v1.8.3

func (s *Store) FindLazyRedirectCandidates(includeTests bool) ([]LazyRedirectCandidate, error)

FindLazyRedirectCandidates returns functions/methods with low outgoing edges (0-2) that are NOT known leaf patterns (constructors, getters, interface impls). These are candidates for lazy redirect analysis via shingle/edge divergence.

func (*Store) FindStubs added in v1.8.3

func (s *Store) FindStubs(includeTests bool) ([]StubResult, error)

FindStubs returns functions/methods with zero outgoing call edges. These are likely stubs, placeholders, or dead code.

func (*Store) GetAllFiles

func (s *Store) GetAllFiles() ([]FileRecord, error)

GetAllFiles returns all indexed file records.

func (*Store) GetAllMinHashes

func (s *Store) GetAllMinHashes() ([]MinHashEntry, error)

GetAllMinHashes retrieves all symbol IDs and their MinHash signatures for similarity search. Symbols without a signature are skipped.

func (*Store) GetEdgesFrom

func (s *Store) GetEdgesFrom(sourceID int64) ([]Edge, error)

GetEdgesFrom returns all outgoing edges from the given symbol.

func (*Store) GetEdgesTo

func (s *Store) GetEdgesTo(targetID int64) ([]Edge, error)

GetEdgesTo returns all incoming edges to the given symbol.

func (*Store) GetFile

func (s *Store) GetFile(path string) (*FileRecord, error)

GetFile retrieves a file record by path.

func (*Store) GetFileGraph added in v1.8.3

func (s *Store) GetFileGraph() ([]FileEdge, error)

GetFileGraph returns file-level connectivity data for visualization. Works for all languages — shows which files call into other files.

func (*Store) GetIDFs added in v1.9.0

func (s *Store) GetIDFs(tokens []string) (map[string]float64, error)

GetIDFs reads IDF values for a set of tokens in one batched query. Returns a map containing only tokens that exist in token_stats — missing tokens contribute zero to BM25 scores.

func (*Store) GetMeta added in v1.9.0

func (s *Store) GetMeta(key string) ([]byte, error)

GetMeta reads a raw byte value from the meta key/value table. Returns (nil, nil) if the key is not present — callers should treat nil as "not set" and decide whether to generate and persist.

func (*Store) GetMinHash

func (s *Store) GetMinHash(symbolID int64) (MinHashSignature, error)

GetMinHash retrieves the MinHash signature for a symbol.

func (*Store) GetPackageGraph added in v1.8.3

func (s *Store) GetPackageGraph() ([]PackageInfo, []PackageEdge, error)

GetPackageGraph returns package-level connectivity data for visualization.

func (*Store) GetSymbol

func (s *Store) GetSymbol(id int64) (*Symbol, error)

GetSymbol retrieves a symbol by its ID.

func (*Store) GetSymbolIDByName added in v1.8.2

func (s *Store) GetSymbolIDByName(name string) (int64, bool)

GetSymbolIDByName returns the ID of a symbol by exact name match. If multiple symbols share the same name, returns the first found.

func (*Store) GetSymbolTokens added in v1.9.0

func (s *Store) GetSymbolTokens(symbolID int64) (map[string]int, int, error)

GetSymbolTokens reads the stored TF map for a single symbol. Used at query time to compute BM25 scores. Returns an empty map (not nil) if the symbol has no token rows, so callers can treat it as "zero contribution" without nil-checks.

func (*Store) GetSymbolsByFile

func (s *Store) GetSymbolsByFile(file string) ([]Symbol, error)

GetSymbolsByFile returns all symbols in the given file.

func (*Store) GetSymbolsByPackage

func (s *Store) GetSymbolsByPackage(pkg string) ([]Symbol, error)

GetSymbolsByPackage returns all symbols in the given package.

func (*Store) ReadBM25Stats added in v1.9.0

func (s *Store) ReadBM25Stats() (*BM25CorpusStats, error)

ReadBM25Stats reads the cached corpus-wide BM25 stats. Returns (nil, nil) if the meta row is absent (fresh index or pre-BM25 index).

func (*Store) RebuildTokenStats added in v1.9.0

func (s *Store) RebuildTokenStats() (*BM25CorpusStats, error)

RebuildTokenStats walks the entire symbol_tokens table and computes df + idf for every token. Replaces the contents of token_stats atomically (delete-all + insert) so re-runs produce a consistent state. Called at the end of Build() and Update() — cheap compared to the full indexing pass because it's just aggregation over rows we just wrote.

Also computes the corpus-wide NumDocs + AvgDocLength stats and persists them to the meta table so query time can read them in a single lookup instead of COUNT(DISTINCT symbol_id) and AVG() scans.

func (*Store) SearchSymbolsByName

func (s *Store) SearchSymbolsByName(query string) ([]Symbol, error)

SearchSymbolsByName returns symbols whose name contains the query (case-insensitive).

func (*Store) SetMeta added in v1.9.0

func (s *Store) SetMeta(key string, value []byte) error

SetMeta writes a raw byte value to the meta key/value table. Upserts on conflict so the caller can treat this as idempotent.

func (*Store) Stats

func (s *Store) Stats() (*StoreStats, error)

Stats returns aggregate counts for the indexed codebase.

func (*Store) UpdateMinHash

func (s *Store) UpdateMinHash(symbolID int64, sig MinHashSignature) error

UpdateMinHash stores the MinHash signature for a symbol.

func (*Store) UpsertFile

func (s *Store) UpsertFile(f FileRecord) error

UpsertFile inserts or updates a file record.

func (*Store) UpsertSymbol

func (s *Store) UpsertSymbol(sym Symbol) (int64, error)

UpsertSymbol inserts or updates a symbol. Uniqueness is determined by (name, kind, package, file). Returns the row ID.

func (*Store) UpsertSymbolTokens added in v1.9.0

func (s *Store) UpsertSymbolTokens(symbolID int64, tokens []string) error

UpsertSymbolTokens writes the per-symbol token frequencies for a given symbol. Called from indexFile after the shingles are computed so we have the raw frequencies before deduplication collapses them to 1-per-token.

tokens is passed as a slice (not a set) because we want TF counts: the same token appearing twice in the shingle stream should count 2. Celeste's current shingle pipeline dedupes, so TF is always 1 in practice, but we preserve the more general API for future extractor improvements that might count frequency more accurately.

type StoreStats

type StoreStats struct {
	TotalSymbols  int
	TotalEdges    int
	TotalFiles    int
	SymbolsByKind map[SymbolKind]int
	FilesByLang   map[string]int
}

StoreStats holds aggregate counts for the indexed codebase.

type StructuralReranker added in v1.9.0

type StructuralReranker struct {
	// MatchedTokenWeight scales the matched-token-ratio contribution.
	// Default 1.0 — a full-match symbol gets +1.0 added to its base
	// score, which is significant relative to the typical Jaccard
	// range of 0.1-0.2 but doesn't trivially override BM25.
	MatchedTokenWeight float64

	// EdgeDensityWeight scales the log-normalized edge count
	// contribution. Default 0.3 — mild boost; edge count alone
	// shouldn't overwhelm real textual relevance.
	EdgeDensityWeight float64

	// KindBoostFunction is the additive weight for function / method
	// symbols. Default 0.15 — small but enough to break ties in favor
	// of actual implementations over type aliases.
	KindBoostFunction float64

	// ZeroEdgePenalty is the additive weight (usually negative) for
	// function/method symbols with zero edges. Default -0.25 — pushes
	// likely-dead-code below real matches without entirely removing it.
	ZeroEdgePenalty float64
}

StructuralReranker is the default Reranker shipped in v1.9.0. It scores each candidate using a weighted combination of features that the RRF fusion can't see:

  • MatchedTokenRatio: fraction of query tokens that appear in the symbol's filtered shingle set. A symbol matching 4/4 query tokens should rank above one matching 1/4 even if the Jaccard estimator happens to put them at similar percentiles.

  • EdgeDensity: log-normalized edge count. Well-connected symbols are more likely to be the "real" implementation of a feature than zero-edge stub interfaces. Capped logarithmically so that a symbol with 200 edges doesn't dwarf one with 20.

  • KindWeight: function and method symbols get a small boost over type/interface declarations for implementation-hunting queries. Tuned on the SPEC §5.1 benchmark queries which all target "find me the code that does X" rather than "find me the type definition for X".

  • ZeroEdgePenalty: a symbol with zero edges and kind in {function, method} is either dead code or a parser limitation. Push it below other candidates that actually have connectivity.

All features are normalized to [0,1]-ish ranges before the weighted sum. Weights are exposed on the struct so callers can A/B tune without recompiling; the zero value uses sensible defaults picked by hand-inspection on the Task 23 content-control benchmark.

func NewStructuralReranker added in v1.9.0

func NewStructuralReranker() *StructuralReranker

NewStructuralReranker returns a reranker with the default weights picked by hand-inspection on the content-control benchmark. Callers that want to experiment can construct StructuralReranker{} directly with custom weights instead.

func (*StructuralReranker) Rerank added in v1.9.0

func (r *StructuralReranker) Rerank(results []SearchResult, queryTokenCount int) []SearchResult

Rerank applies the structural rescore to results and returns a new ordering. The original Similarity / BM25Score fields on each SearchResult are preserved; only the slice order changes. Callers can audit the rerank by comparing the old order to the new one.

Ties (exact equal structural scores) are broken by the incoming order so the rerank is stable relative to the fused ranking. This matters because the fused ranking already encodes meaningful signal — we're enhancing it, not replacing it, and ties should fall back to "trust the upstream signal".

type StubResult added in v1.8.3

type StubResult struct {
	Name     string
	File     string
	Line     int
	Kind     string
	OutEdges int
	InEdges  int
}

StubResult represents a function/method with zero outgoing call edges.

type Symbol

type Symbol struct {
	ID        int64
	Name      string
	Kind      SymbolKind
	Package   string
	File      string
	Line      int
	Signature string
}

Symbol represents a code entity (function, type, interface, etc.).

type SymbolKind

type SymbolKind string

SymbolKind identifies the kind of code symbol.

const (
	SymbolFunction  SymbolKind = "function"
	SymbolMethod    SymbolKind = "method"
	SymbolType      SymbolKind = "type"
	SymbolInterface SymbolKind = "interface"
	SymbolConst     SymbolKind = "const"
	SymbolVar       SymbolKind = "var"
	SymbolStruct    SymbolKind = "struct"
	SymbolImport    SymbolKind = "import"
	SymbolClass     SymbolKind = "class"
)

type TSParser added in v1.9.0

type TSParser struct {
	// contains filtered or unexported fields
}

TSParser parses TypeScript and TSX source files using tree-sitter. One parser holds both language pointers — selection is per-file by extension. The underlying tree_sitter.Parser is re-used across files (Parse() resets the internal state) so allocation stays cheap.

func NewTSParser added in v1.9.0

func NewTSParser() *TSParser

NewTSParser initializes the tree-sitter parser with the TypeScript and TSX grammars loaded. Returns an error if grammar wiring fails (shouldn't happen in practice — the grammars are statically linked).

func (*TSParser) Close added in v1.9.0

func (p *TSParser) Close()

Close releases the tree-sitter parser's native resources.

func (*TSParser) ParseFile added in v1.9.0

func (p *TSParser) ParseFile(path string) (*ParseResult, error)

ParseFile reads a .ts or .tsx file and returns extracted symbols + edges. Uses the TSX grammar for .tsx files and the plain TypeScript grammar for everything else.

type TokenStat added in v1.9.0

type TokenStat struct {
	Token string
	DF    int
	IDF   float64
}

TokenStat is a per-token corpus statistic: document frequency (how many symbols contain this token) and precomputed IDF.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL