Documentation
¶
Overview ¶
Package eval scores ckv against a known-query fixture. It produces recall@k, MRR, and citation-accuracy metrics so model/chunker changes can be detected as regressions.
Fixture format (testdata/queries.yaml):
schema_version: "1"
queries:
- id: q1
intent: "..."
expected:
file: server.go
symbol: Server.Listen
kind: Method
line_range: [22, 29]
A query passes when at least one hit in top-K (default K=5) cites the expected file and the hit's line range overlaps the expected one.
"Why-queries" fixtures (testdata/why-queries.yaml) share the same schema with two extra fields per entry:
- pending: true — the corpus does not yet index the answer (e.g. docs corpus or PR/commit corpus not built). Loader allows missing/zero line_range for these entries. They still execute in Run() — they just typically miss until the corpus catches up.
- expected_kind: pr_summary | commit_message | doc_section — the chunk type that should answer the query (informational; the scorer does not enforce it).
Index ¶
Constants ¶
const DefaultK = 5
DefaultK is the top-K used by Run when Options.K is zero.
const FixtureSchemaVersion = "1"
FixtureSchemaVersion is the version this binary writes/expects. Bump on breaking changes; consumers should refuse loading newer majors and warn on newer minors.
Variables ¶
This section is empty.
Functions ¶
func RecordSession ¶
func RecordSession(ctx context.Context, eng *query.Engine, fixturePath string, opts RecordOptions, in io.Reader, out io.Writer) error
RecordSession runs an interactive loop: the user types a query intent, sees top-K results, selects which are correct, and the entry is appended to the fixture file at fixturePath. The loop continues until the user sends an empty line or EOF.
in/out are separated from os.Stdin/os.Stdout for testability.
Types ¶
type Aggregate ¶
type Aggregate struct {
Total int `json:"total"`
Found int `json:"found"` // queries with ≥1 correct hit in top-K
RecallAt1 float64 `json:"recall_at_1"`
RecallAt3 float64 `json:"recall_at_3"`
RecallAt5 float64 `json:"recall_at_5"`
MRR float64 `json:"mrr"`
CitationAccuracy float64 `json:"citation_accuracy"` // mean(citation_correct over queries that found a hit)
// HallucinationRate is the fraction of returned hits across all
// queries whose snippet did not align with the source file. 0
// means perfect — every snippet appears at the cited location.
// Populated only when Options.SrcRoot is set.
HallucinationRate float64 `json:"hallucination_rate,omitempty"`
HallucinationHits int `json:"hallucination_hits,omitempty"` // numerator
TotalHits int `json:"total_hits,omitempty"` // denominator (returned hits, not queries)
}
Aggregate is the corpus-level summary across all queries.
type Expected ¶
type Expected struct {
File string `yaml:"file"`
Symbol string `yaml:"symbol,omitempty"`
Kind types.SymbolKind `yaml:"kind,omitempty"`
LineRange [2]int `yaml:"line_range"`
// Section is an optional human-readable anchor inside File (e.g. a
// markdown heading like "§4 Vector store — decision matrix") used by
// why-queries fixtures. Purely informational — Score() does not match
// on it.
Section string `yaml:"section,omitempty"`
// ExpectedKind hints which chunk_kind should answer the query.
// Values: pr_summary | commit_message | doc_section. Informational
// only — used by eval to filter retrieval by kind.
ExpectedKind string `yaml:"expected_kind,omitempty"`
}
Expected describes the correct retrieval target. LineRange is [start, end] inclusive; the scorer treats any hit overlapping this range (and matching File) as correct.
type Fixture ¶
type Fixture struct {
SchemaVersion string `yaml:"schema_version"`
Queries []Query `yaml:"queries"`
}
Fixture is the parsed top-level document.
func LoadFixture ¶
LoadFixture reads and validates a YAML fixture from path. Validation checks: non-empty schema_version, every query has id + intent + expected.file + a sane line_range.
type Options ¶
type Options struct {
K int // top-K for recall counting (default 5)
Threshold float64 // pass-through to query.Options.Threshold; <0 disables
SrcRoot string // pass-through for citation enforcement; empty → manifest default
// EnableBM25Rerank toggles BM25 candidate-rerank on the eval pass.
// Defaults false so existing baselines are preserved by default.
// Both A and B legs of an A/B comparison use the same fixture +
// engine, only this flag differs.
EnableBM25Rerank bool
}
Options control one eval pass.
type PerQuery ¶
type PerQuery struct {
QueryID string `json:"query_id"`
Intent string `json:"intent"`
FoundRank int `json:"found_rank"` // 1-based rank of first correct hit; 0 if absent
HitsReturned int `json:"hits_returned"`
TopHitFile string `json:"top_hit_file"`
TopHitScore float64 `json:"top_hit_score"`
ReciprocalRank float64 `json:"reciprocal_rank"` // 1/found_rank, or 0
CitationCorrect bool `json:"citation_correct"` // hits-with-matching-file have valid file+line in expected range
// Hallucination metrics. Populated only when SrcRoot is set on the
// eval Options — without a source tree we can't verify a hit's
// snippet against actual file content. HallucinationCount is the
// number of returned hits whose snippet did not survive VerifyHit
// (file_missing / out_of_range / snippet_not_found).
HallucinationCount int `json:"hallucination_count,omitempty"`
HallucinationReason string `json:"hallucination_reason,omitempty"` // first non-empty reason across hits, for triage
}
PerQuery is the scoring result for one fixture entry.
func Score ¶
Score compares one query's response against its expected target. k is the effective top-K used for recall counting. When srcRoot is non-empty, every returned hit is also verified against the source tree and the per-query hallucination_count is populated. Empty srcRoot leaves hallucination fields zero.
type Query ¶
type Query struct {
ID string `yaml:"id"`
Intent string `yaml:"intent"`
Expected Expected `yaml:"expected"`
Notes string `yaml:"notes,omitempty"`
// Pending marks entries whose ground-truth corpus is not yet indexed
// (e.g. docs/PR/commit corpora not yet built). The loader relaxes
// line_range validation for these; Run() still executes them so misses
// are counted in the aggregate, and Score() treats them like any other
// query (typically reporting a miss until the corpus is indexed).
Pending bool `yaml:"pending,omitempty"`
RecordedVia string `yaml:"recorded_via,omitempty"`
Timestamp string `yaml:"timestamp,omitempty"`
}
Query is one ground-truth entry.
type RecordOptions ¶
RecordOptions controls one interactive record session.
type Result ¶
type Result struct {
Fixture string `json:"fixture"`
K int `json:"k"`
Aggregate Aggregate `json:"aggregate"`
PerQuery []PerQuery `json:"per_query"`
}
Result is the full eval pass output.
Directories
¶
| Path | Synopsis |
|---|---|
|
Package prregress implements PR-based regression evaluation: given a merged PR, check out the world *before* it landed, build a ckv index over that snapshot, hand the PR's Background to an agent, and compare the agent's plan against what the PR actually did.
|
Package prregress implements PR-based regression evaluation: given a merged PR, check out the world *before* it landed, build a ckv index over that snapshot, hand the PR's Background to an agent, and compare the agent's plan against what the PR actually did. |