Documentation
¶
Overview ¶
Package eval is the LLM-based benchmark harness for CKG.
It compares graph-context (β/γ/δ baselines) against raw-file context (α baseline) to validate the project's value proposition. The rest of CKG (build/serve/mcp/export-static/export-postgres/audit) is LLM-free; eval is the only part that pulls in LLM SDKs.
EXTRACTION NOTE: this package is slated to move to a sister repo (e.g. code-knowledge-graph-eval). When that happens:
- Move: internal/eval/, cmd/ckg/eval.go, eval/tasks/*.yaml
- Imports: already use pkg/store (this repo's public read surface); the new module imports it as a regular Go module dep.
- Drop from this repo's go.mod: github.com/anthropics/anthropic-sdk-go github.com/0xmhha/cli-wrapper
- CI: remove the CI=true skip in llm_cli_test.go::TestCLIClient_Complete_Smoke_ClaudeFallback once the new repo's CI runs the test reliably (the skip was added to dodge a cli-wrapper Manager-reuse race).
- Open TODO at extraction time: smartContext in runner.go duplicates the get_context_for_task MCP tool; extract to a shared package (e.g. pkg/contextcompose) so both eval and mcp share one implementation.
The paths and test names above (cmd/ckg/eval.go, eval/tasks/*.yaml, TestCLIClient_Complete_Smoke_ClaudeFallback, ...) reference concrete repo state. If you rename or move any of them before extraction lands, update this doc in the same commit so the checklist doesn't go stale.
Package eval runs the four-baseline measurement (spec §9). Each task is a YAML file; baselines differ only in the MCP tool allowlist (and α uses no tools at all).
Index ¶
- Variables
- func AllowedTools(b Baseline) []string
- func FilterHallucinations(text string, hallu HallucinationResult) (filtered string, warnings []string)
- func PrecisionRecall(got, want []string) (precision, recall float64)
- func RubricCheck(output string, rubric []string) (hits, total int)
- func SystemPrompt(b Baseline) string
- func WriteReport(path string, results []Result) error
- type APIClient
- type Baseline
- type CLIClient
- type CLIClientOptions
- type Citation
- type CitationResult
- type Expected
- type HallucinationResult
- type LLMClient
- type LLMResult
- type Result
- type Scoring
- type Task
Constants ¶
This section is empty.
Variables ¶
var ErrClaudeNotFound = errors.New("claude CLI binary not found in PATH; provide --llm-claude-binary")
ErrClaudeNotFound is returned when --llm-backend=cli is requested but the claude binary cannot be located (PATH lookup failed and no override was provided).
var ErrCliwrapAgentNotFound = errors.New(
"cliwrap-agent path not provided: set CLIWRAP_AGENT env var or " +
"pass CLIClientOptions.AgentPath (see https://github.com/0xmhha/cli-wrapper for installation)",
)
ErrCliwrapAgentNotFound is returned when cliwrap-agent path cannot be resolved. Set the CLIWRAP_AGENT environment variable or provide CLIClientOptions.AgentPath. See https://github.com/0xmhha/cli-wrapper.
var ErrNoAPIKey = errors.New("ANTHROPIC_API_KEY not set")
ErrNoAPIKey is returned by NewAPIClient when ANTHROPIC_API_KEY is unset.
Functions ¶
func AllowedTools ¶
AllowedTools maps a baseline to the set of MCP tool names the LLM may call. α returns nil (no tools).
func FilterHallucinations ¶
func FilterHallucinations(text string, hallu HallucinationResult) (filtered string, warnings []string)
FilterHallucinations rewrites `text` so each symbol mention that HallucinationResult flagged as Hallucinated is replaced with an inline marker. The function does NOT remove sentences or paragraphs — that decision belongs to the consumer (a model answer can carry useful prose around a single bad symbol; full strip would throw the useful prose away too).
The returned `warnings` slice carries a short human-readable summary of every replacement plus the qname-diverged mentions (the latter are not replaced — they resolved by bare name and might still be correct — but they warrant a flag).
Axis 4 of the 4-axis evaluation roadmap (2026-05-22). Sits downstream of T-04 ValidateMentions: the hallucination metric only *measures*, the filter *acts on the measurement* before the text reaches a consumer. Together they give 0%-error a two-step path: prompt-engineering reduces the rate at the source, the filter scrubs whatever leaks through.
Design rules:
- Replacement only: `[unverified: <symbol>]`. No deletion.
- Word-boundary aware: `Token.transferFrom` is not rewritten when `Token.transfer` is flagged. strings.Replace with a boundary check covers the common Go/Sol/TS dotted-identifier shape.
- Idempotent: re-filtering an already-filtered text is a no-op because the marker `[unverified: ...]` does not contain a bare dotted identifier the validator would re-flag.
- Empty hallucination set → returns text unchanged + nil warnings.
func PrecisionRecall ¶
PrecisionRecall returns precision and recall when comparing got and want as unordered string sets.
func RubricCheck ¶
RubricCheck performs naive case-insensitive substring matching of each rubric item's keywords against the output text. V0 is intentionally crude — manual review is expected for high-stakes scoring.
func SystemPrompt ¶
SystemPrompt returns the system prompt fragment that primes the LLM about what's available. α also receives raw file dumps appended to user content.
func WriteReport ¶
WriteReport summarizes results.csv into a Markdown report (spec §9.5).
T-04 V1 (2026-05-21): every baseline row now carries hallucination statistics in addition to score. The summary table gains two columns (avg hallucination rate, total mentions), and a per-task detail section lists the literal hallucinated / qname-diverged symbols for triage.
Axis 1 (2026-05-22): the summary table now also reports population standard deviation alongside each mean, so multi-shot runs (--n-runs > 1) surface the LLM-side non-determinism the third smoke run made visible (3 single-shots produced 0/0/4 hallucinations). Single-shot runs (n=1) report std=0 — the columns stay structurally consistent across shot counts.
Types ¶
type APIClient ¶
type APIClient struct {
// contains filtered or unexported fields
}
APIClient wraps the Anthropic Messages API. Construct one per ckg eval run.
func NewAPIClient ¶
NewAPIClient constructs an APIClient. It reads ANTHROPIC_API_KEY from the environment and returns ErrNoAPIKey when unset.
func (*APIClient) Close ¶
Close releases resources. APIClient holds none; this is a no-op so the interface contract is uniform across backends.
type Baseline ¶
type Baseline string
Baseline determines what tools the LLM may call and how raw context is supplied (α only). See spec §9.1.
type CLIClient ¶
type CLIClient struct {
// contains filtered or unexported fields
}
CLIClient runs `claude -p` via cli-wrapper as the LLM backend. It is a drop-in alternative to APIClient: each call to Complete spawns one subprocess, waits for exit, parses the JSON the binary writes to stdout, and returns an LLMResult. The cli-wrapper Manager is reused across invocations and torn down by Close.
func NewCLIClient ¶
func NewCLIClient(opts CLIClientOptions) (*CLIClient, error)
NewCLIClient constructs a CLIClient. It resolves the claude binary path (override or PATH lookup), locates cliwrap-agent, and constructs a cli-wrapper Manager that will be reused across Complete calls.
func (*CLIClient) Complete ¶
Complete spawns one `claude -p` invocation, waits for exit, snapshots stdout, and parses the result.
Flags used:
- -p / --print: non-interactive mode required for piping/JSON output
- --no-session-persistence: do not write sessions to disk (prevents user's session DB from being polluted by eval runs)
- --output-format json: single JSON document on stdout (schema below)
Note: --bare is intentionally NOT used. Without it, claude uses normal auth (OAuth, keychain, ANTHROPIC_API_KEY — whatever the user configured), which is required for Pro/Max users who rely on OAuth/keychain auth.
The `system` argument, if non-empty, is forwarded as --append-system-prompt. The `user` argument is the final positional prompt.
func (*CLIClient) CompleteWithTools ¶
func (c *CLIClient) CompleteWithTools(ctx context.Context, system, user string, store pkgstore.Reader) (LLMResult, error)
Close shuts down the underlying cli-wrapper Manager, draining its WAL outbox. A 5s timeout is used so a wedged shutdown does not hang the eval run; failures are returned to the caller.
The deferred recover guards against an upstream lifecycle bug surfaced by smoke-run 2026-05-21: when h.Start() fails on a processHandle that's already registered with the Manager, the subsequent Shutdown walks the handle list, calls processHandle.Stop, and reaches into a nil ipc.Conn (Seqs nil dereference). Without the recover, the runner's `defer llm.Close()` panics and overwrites the real error ("start claude: ...exec...") that the user actually needs to see. With the recover, Close converts the panic into a normal error return so the runner surfaces both: the original spawn failure on stderr from runOne, and this Close error from the final Run return value. CompleteWithTools runs the γ multi-turn tool-use loop via the prompt-based protocol. See gamma_loop.go.
type CLIClientOptions ¶
type CLIClientOptions struct {
// Binary is the path to the `claude` executable. If empty,
// exec.LookPath("claude") is used; if that fails, NewCLIClient
// returns ErrClaudeNotFound.
Binary string
// AgentPath is the absolute path to the cliwrap-agent binary. If
// empty, the CLIWRAP_AGENT environment variable is consulted. CKG
// does NOT install cliwrap-agent; set CLIWRAP_AGENT or pass this
// field explicitly. See https://github.com/0xmhha/cli-wrapper.
AgentPath string
// RuntimeDir is where cli-wrapper stores per-process WAL/state. If
// empty, a directory under os.TempDir() is created.
RuntimeDir string
}
CLIClientOptions configures CLIClient construction.
type Citation ¶
Citation is a single file:line reference extracted from LLM output.
func ExtractCitations ¶
ExtractCitations pulls file:line references from LLM output text.
type CitationResult ¶
type CitationResult struct {
Total int
FileExists int
LineInNode int
Hallucinated []Citation
Precision float64
}
CitationResult is the per-response classification of file:line citations extracted from an LLM output (T-03).
func ValidateCitations ¶
func ValidateCitations(output string, store pkgstore.Reader) (CitationResult, error)
ValidateCitations checks every file:line citation in output against the graph store. For each citation:
- FileExists increments if NodesByFilePath returns any nodes
- LineInNode increments if the cited line falls within at least one node's [start_line, end_line] range
Precision = LineInNode / Total (0 when Total == 0). store may be nil — returns zero result without error.
type Expected ¶
type Expected struct {
// symbol_set kind
Symbols []string `yaml:"symbols,omitempty"`
// code_patch kind
MustUseSymbols []string `yaml:"must_use_symbols,omitempty"`
MustCall []string `yaml:"must_call,omitempty"`
MustNotBreakSig bool `yaml:"must_not_break_signature,omitempty"`
// rubric kind
Rubric []string `yaml:"rubric,omitempty"`
}
type HallucinationResult ¶
type HallucinationResult struct {
// Total mentions extracted from the output, after lowercase
// deduplication. A response that mentions "core.NewBlockChain"
// three times contributes 1 to Total.
Total int
// Found is the subset of mentions whose bare name (last
// dot-segment) resolves in the store, *case-insensitively*.
Found []string
// QnameDiverged is the subset of Found where the bare name
// resolved but no candidate node had a qualified name matching
// the mentioned dotted form. The mention is plausible but the
// package prefix is wrong (`eth.NewBlockChain` vs
// `core.NewBlockChain`). V0 surfaces these for manual triage
// without counting them against Rate.
QnameDiverged []string
// Hallucinated is the subset of mentions where the bare name
// did not resolve at all.
Hallucinated []string
// Rate is the hallucination fraction: len(Hallucinated) / Total.
Rate float64
}
HallucinationResult is the per-response classification of the symbol mentions extracted from an LLM output.
Rate = len(Hallucinated) / Total. Total = 0 (no mentions extracted) produces Rate = 0; the call site that wants to distinguish "answer had no symbols" from "answer had only valid symbols" reads Total directly.
func ValidateMentions ¶
func ValidateMentions(output string, store pkgstore.Reader) (HallucinationResult, error)
ValidateMentions classifies every symbol mention in `output` as Found, QnameDiverged, or Hallucinated by looking each up in `store`. The tokenizer is extractSymbols (the same path scoreTask uses).
store may be nil — in which case every mention is recorded as Found with Rate = 0. This keeps call sites that lack a graph (e.g. the rubric-only scoring path) from short-circuiting on a nil dereference; hallucination measurement is opt-in.
type LLMClient ¶
type LLMClient interface {
Complete(ctx context.Context, system, user string) (LLMResult, error)
CompleteWithTools(ctx context.Context, system, user string,
store pkgstore.Reader) (LLMResult, error)
Close() error
}
LLMClient is the abstraction the eval runner uses for completions. The Anthropic Messages API (APIClient) and the Claude Code CLI (CLIClient) both implement it. Close releases backend-specific resources (e.g., shutting down a cli-wrapper Manager); APIClient.Close is a no-op.
CompleteWithTools runs the γ multi-turn tool-use loop via prompt-based pseudo-tool-use (see gamma_loop.go). Both backends implement the same protocol on top of plain Complete calls — no API-specific support required.
type LLMResult ¶
type LLMResult struct {
OutputText string
InputTokens int
OutputTokens int
CacheReadTokens int
CacheCreateTokens int
NumToolCalls int
// NumCachedCalls is populated only by γ (runGammaPromptLoop): it
// counts (name, args) tuples the LLM re-issued that were served
// from the in-loop cache instead of re-traversing the store. The
// sum NumToolCalls + NumCachedCalls reports the total fan-out the
// model attempted; NumToolCalls alone reports the work the store
// actually did. Other baselines leave this zero. P2 #6.
NumCachedCalls int
UserPromptBytes int
}
LLMResult bundles a single completion's output text and usage counters. UserPromptBytes is populated only by γ (CompleteWithTools) where the multi-turn loop accumulates a much larger user message than the runner's pre-tool prompt. Other baselines leave it zero and the runner falls back to its own len(user) measurement.
type Result ¶
type Result struct {
TaskID string
Baseline Baseline
// RunIdx identifies which of the N repeats of a (task, baseline)
// pair this row represents (Axis 1, 2026-05-22). Single-shot eval
// runs leave it at 0; multi-shot runs (--n-runs > 1) fill it with
// 0..N-1. The report aggregator groups rows by (TaskID, Baseline)
// and computes mean ± std across RunIdx, surfacing the
// non-determinism the third smoke run made unmistakable (3 runs
// of the same fixture produced 0, 0, and 4 hallucinated symbols).
RunIdx int
// UserPromptBytes is the application-level size of the
// per-invocation user prompt the runner built for this row
// (post-baseline-specific append: raw files for α,
// get_subgraph result for β, smartContext for δ). It is the
// only "prompt size" measurement that is independent of
// claude CLI's internal prompt cache state, which carries
// Claude Code's workspace context across invocations and
// inflates cached_tokens to hundreds of thousands. H1's
// question — "does δ supply less context than α?" — answers
// cleanly against this field; cached_tokens reads the
// CLI-side cache pattern instead and is the wrong proxy
// (audit 2026-05-22).
UserPromptBytes int
InputTokens int
OutputTokens int
CachedTokens int
Score float64
LatencyMS int64
NumToolCalls int
Stale bool
RawOutput string
// Hallucination is the per-response classification of every symbol
// mention the LLM emitted, looked up against the same store the
// runner used to answer. T-04 V1 (HANDOFF.md 2026-05-11, wired
// 2026-05-21). Populated by runOne after scoreTask; nil store on
// the runOne path is the rubric-only short-circuit and produces
// Total=0 / Rate=0. The detailed Found/QnameDiverged/Hallucinated
// lists are surfaced via the report.md path, not CSV, because
// the lists are variable-length and would balloon CSV column
// count beyond what spreadsheet readers handle cleanly.
Hallucination HallucinationResult
// Citation is the per-response file:line accuracy check (T-03).
// Populated by runOne after scoreTask; measures whether the LLM's
// source citations point to real files and valid line ranges.
Citation CitationResult
}
Result is one row in the CSV.
func Run ¶
func Run(ctx context.Context, tasks []Task, baselines []Baseline, graphDir string, llm LLMClient, outDir string, nRuns int) ([]Result, error)
Run loops tasks × baselines × nRuns and writes results.csv plus report.md. Each (task, baseline) pair runs nRuns times; per-run rows carry RunIdx 0..nRuns-1 so the report aggregator can compute mean ± std across repeats (Axis 1, 2026-05-22). nRuns ≤ 0 is treated as 1 for backwards compatibility with single-shot callers.
Run takes ownership of llm: it is Closed when Run returns, regardless of error path. Callers must NOT Close llm themselves.
type Task ¶
type Task struct {
ID string `yaml:"id"`
Corpus string `yaml:"corpus"` // "synthetic" | "real" | absolute path
CorpusPath string `yaml:"corpus_path"` // optional override
Description string `yaml:"description"`
ExpectedKind string `yaml:"expected_kind"` // "symbol_set" | "code_patch" | "rubric"
Expected Expected `yaml:"expected"`
Scoring Scoring `yaml:"scoring"`
}
Task mirrors the eval/tasks/*.yaml schema (spec §9.3).
Source Files
¶
Directories
¶
| Path | Synopsis |
|---|---|
|
Package retrieval implements the LLM-free retrieval-accuracy measurement (EV1 Phase 2) — load YAML probe fixtures, dispatch each to the matching MCP tool through StoreReader, score result symbols against an expected set with recall / precision / F1.
|
Package retrieval implements the LLM-free retrieval-accuracy measurement (EV1 Phase 2) — load YAML probe fixtures, dispatch each to the matching MCP tool through StoreReader, score result symbols against an expected set with recall / precision / F1. |