eval

package
v0.0.0-...-78728ec Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 16, 2026 License: AGPL-3.0 Imports: 12 Imported by: 0

Documentation

Overview

Package eval scores ckv against a known-query fixture. It produces recall@k, MRR, and citation-accuracy metrics so model/chunker changes can be detected as regressions.

Fixture format (testdata/queries.yaml):

schema_version: "1"
queries:
  - id: q1
    intent: "..."
    expected:
      file: server.go
      symbol: Server.Listen
      kind: Method
      line_range: [22, 29]

A query passes when at least one hit in top-K (default K=5) cites the expected file and the hit's line range overlaps the expected one.

"Why-queries" fixtures (testdata/why-queries.yaml) share the same schema with two extra fields per entry:

  • pending: true — the corpus does not yet index the answer (e.g. docs corpus or PR/commit corpus not built). Loader allows missing/zero line_range for these entries. They still execute in Run() — they just typically miss until the corpus catches up.
  • expected_kind: pr_summary | commit_message | doc_section — the chunk type that should answer the query (informational; the scorer does not enforce it).

Index

Constants

View Source
const DefaultK = 5

DefaultK is the top-K used by Run when Options.K is zero.

View Source
const FixtureSchemaVersion = "1"

FixtureSchemaVersion is the version this binary writes/expects. Bump on breaking changes; consumers should refuse loading newer majors and warn on newer minors.

Variables

This section is empty.

Functions

func RecordSession

func RecordSession(ctx context.Context, eng *query.Engine, fixturePath string, opts RecordOptions, in io.Reader, out io.Writer) error

RecordSession runs an interactive loop: the user types a query intent, sees top-K results, selects which are correct, and the entry is appended to the fixture file at fixturePath. The loop continues until the user sends an empty line or EOF.

in/out are separated from os.Stdin/os.Stdout for testability.

Types

type Aggregate

type Aggregate struct {
	Total            int     `json:"total"`
	Found            int     `json:"found"` // queries with ≥1 correct hit in top-K
	RecallAt1        float64 `json:"recall_at_1"`
	RecallAt3        float64 `json:"recall_at_3"`
	RecallAt5        float64 `json:"recall_at_5"`
	MRR              float64 `json:"mrr"`
	CitationAccuracy float64 `json:"citation_accuracy"` // mean(citation_correct over queries that found a hit)
	// HallucinationRate is the fraction of returned hits across all
	// queries whose snippet did not align with the source file. 0
	// means perfect — every snippet appears at the cited location.
	// Populated only when Options.SrcRoot is set.
	HallucinationRate float64 `json:"hallucination_rate,omitempty"`
	HallucinationHits int     `json:"hallucination_hits,omitempty"` // numerator
	TotalHits         int     `json:"total_hits,omitempty"`         // denominator (returned hits, not queries)
}

Aggregate is the corpus-level summary across all queries.

func Summarize

func Summarize(perQ []PerQuery) Aggregate

Summarize computes corpus-level metrics from per-query scores. k is the K used at query time; recall@1/3/5 are derived from FoundRank. Hallucination metrics are populated only when per-query HallucinationCount values are present (Score was called with a non-empty srcRoot).

type Expected

type Expected struct {
	File      string           `yaml:"file"`
	Symbol    string           `yaml:"symbol,omitempty"`
	Kind      types.SymbolKind `yaml:"kind,omitempty"`
	LineRange [2]int           `yaml:"line_range"`
	// Section is an optional human-readable anchor inside File (e.g. a
	// markdown heading like "§4 Vector store — decision matrix") used by
	// why-queries fixtures. Purely informational — Score() does not match
	// on it.
	Section string `yaml:"section,omitempty"`
	// ExpectedKind hints which chunk_kind should answer the query.
	// Values: pr_summary | commit_message | doc_section. Informational
	// only — used by eval to filter retrieval by kind.
	ExpectedKind string `yaml:"expected_kind,omitempty"`
}

Expected describes the correct retrieval target. LineRange is [start, end] inclusive; the scorer treats any hit overlapping this range (and matching File) as correct.

type Fixture

type Fixture struct {
	SchemaVersion string  `yaml:"schema_version"`
	Queries       []Query `yaml:"queries"`
}

Fixture is the parsed top-level document.

func LoadFixture

func LoadFixture(path string) (*Fixture, error)

LoadFixture reads and validates a YAML fixture from path. Validation checks: non-empty schema_version, every query has id + intent + expected.file + a sane line_range.

type Options

type Options struct {
	K         int     // top-K for recall counting (default 5)
	Threshold float64 // pass-through to query.Options.Threshold; <0 disables
	SrcRoot   string  // pass-through for citation enforcement; empty → manifest default
	// EnableBM25Rerank toggles BM25 candidate-rerank on the eval pass.
	// Defaults false so existing baselines are preserved by default.
	// Both A and B legs of an A/B comparison use the same fixture +
	// engine, only this flag differs.
	EnableBM25Rerank bool
}

Options control one eval pass.

type PerQuery

type PerQuery struct {
	QueryID         string  `json:"query_id"`
	Intent          string  `json:"intent"`
	FoundRank       int     `json:"found_rank"` // 1-based rank of first correct hit; 0 if absent
	HitsReturned    int     `json:"hits_returned"`
	TopHitFile      string  `json:"top_hit_file"`
	TopHitScore     float64 `json:"top_hit_score"`
	ReciprocalRank  float64 `json:"reciprocal_rank"`  // 1/found_rank, or 0
	CitationCorrect bool    `json:"citation_correct"` // hits-with-matching-file have valid file+line in expected range
	// Hallucination metrics. Populated only when SrcRoot is set on the
	// eval Options — without a source tree we can't verify a hit's
	// snippet against actual file content. HallucinationCount is the
	// number of returned hits whose snippet did not survive VerifyHit
	// (file_missing / out_of_range / snippet_not_found).
	HallucinationCount  int    `json:"hallucination_count,omitempty"`
	HallucinationReason string `json:"hallucination_reason,omitempty"` // first non-empty reason across hits, for triage
}

PerQuery is the scoring result for one fixture entry.

func Score

func Score(q Query, resp *query.Response, k int, srcRoot string) PerQuery

Score compares one query's response against its expected target. k is the effective top-K used for recall counting. When srcRoot is non-empty, every returned hit is also verified against the source tree and the per-query hallucination_count is populated. Empty srcRoot leaves hallucination fields zero.

type Query

type Query struct {
	ID       string   `yaml:"id"`
	Intent   string   `yaml:"intent"`
	Expected Expected `yaml:"expected"`
	Notes    string   `yaml:"notes,omitempty"`

	// Pending marks entries whose ground-truth corpus is not yet indexed
	// (e.g. docs/PR/commit corpora not yet built). The loader relaxes
	// line_range validation for these; Run() still executes them so misses
	// are counted in the aggregate, and Score() treats them like any other
	// query (typically reporting a miss until the corpus is indexed).
	Pending bool `yaml:"pending,omitempty"`

	RecordedVia string `yaml:"recorded_via,omitempty"`
	Timestamp   string `yaml:"timestamp,omitempty"`
}

Query is one ground-truth entry.

type RecordOptions

type RecordOptions struct {
	K                int
	Threshold        float64
	SrcRoot          string
	EnableBM25Rerank bool
}

RecordOptions controls one interactive record session.

type Result

type Result struct {
	Fixture   string     `json:"fixture"`
	K         int        `json:"k"`
	Aggregate Aggregate  `json:"aggregate"`
	PerQuery  []PerQuery `json:"per_query"`
}

Result is the full eval pass output.

func Run

func Run(ctx context.Context, eng *query.Engine, fx *Fixture, opts Options) (*Result, error)

Run executes every query in fx against eng and returns a Result. Errors during a single query are folded into PerQuery (FoundRank=0, HitsReturned=0) so one bad fixture doesn't abort the whole pass.

Directories

Path Synopsis
Package prregress implements PR-based regression evaluation: given a merged PR, check out the world *before* it landed, build a ckv index over that snapshot, hand the PR's Background to an agent, and compare the agent's plan against what the PR actually did.
Package prregress implements PR-based regression evaluation: given a merged PR, check out the world *before* it landed, build a ckv index over that snapshot, hand the PR's Background to an agent, and compare the agent's plan against what the PR actually did.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL