eval

package

v0.0.0-...-42ce3c7 Latest Latest Go to latest Published: May 29, 2026 License: AGPL-3.0 Imports: 24 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/0xmhha/code-knowledge-graph

Links

Open Source Insights

Documentation ¶

Overview ¶

Package eval is the LLM-based benchmark harness for CKG.

It compares graph-context (β/γ/δ baselines) against raw-file context (α baseline) to validate the project's value proposition. The rest of CKG (build/serve/mcp/export-static/export-postgres/audit) is LLM-free; eval is the only part that pulls in LLM SDKs.

EXTRACTION NOTE: this package is slated to move to a sister repo (e.g. code-knowledge-graph-eval). When that happens:

Move: internal/eval/, cmd/ckg/eval.go, eval/tasks/*.yaml
Imports: already use pkg/store (this repo's public read surface); the new module imports it as a regular Go module dep.
Drop from this repo's go.mod: github.com/anthropics/anthropic-sdk-go github.com/0xmhha/cli-wrapper
CI: remove the CI=true skip in llm_cli_test.go::TestCLIClient_Complete_Smoke_ClaudeFallback once the new repo's CI runs the test reliably (the skip was added to dodge a cli-wrapper Manager-reuse race).
Open TODO at extraction time: smartContext in runner.go duplicates the get_context_for_task MCP tool; extract to a shared package (e.g. pkg/contextcompose) so both eval and mcp share one implementation.

The paths and test names above (cmd/ckg/eval.go, eval/tasks/*.yaml, TestCLIClient_Complete_Smoke_ClaudeFallback, ...) reference concrete repo state. If you rename or move any of them before extraction lands, update this doc in the same commit so the checklist doesn't go stale.

Package eval runs the four-baseline measurement (spec §9). Each task is a YAML file; baselines differ only in the MCP tool allowlist (and α uses no tools at all).

Index ¶

Variables
func AllowedTools(b Baseline) []string
func FilterHallucinations(text string, hallu HallucinationResult) (filtered string, warnings []string)
func PrecisionRecall(got, want []string) (precision, recall float64)
func RubricCheck(output string, rubric []string) (hits, total int)
func SystemPrompt(b Baseline) string
func WriteReport(path string, results []Result) error
type APIClient
- func NewAPIClient(model string) (*APIClient, error)
- func (l *APIClient) Close() error
- func (l *APIClient) Complete(ctx context.Context, system, user string) (LLMResult, error)
- func (l *APIClient) CompleteWithTools(ctx context.Context, system, user string, store pkgstore.Reader) (LLMResult, error)
type Baseline
type CLIClient
- func NewCLIClient(opts CLIClientOptions) (*CLIClient, error)
- func (c *CLIClient) Close() (err error)
- func (c *CLIClient) Complete(ctx context.Context, system, user string) (LLMResult, error)
- func (c *CLIClient) CompleteWithTools(ctx context.Context, system, user string, store pkgstore.Reader) (LLMResult, error)
type CLIClientOptions
type Citation
- func ExtractCitations(text string) []Citation
type CitationResult
- func ValidateCitations(output string, store pkgstore.Reader) (CitationResult, error)
type Expected
type HallucinationResult
- func ValidateMentions(output string, store pkgstore.Reader) (HallucinationResult, error)
type LLMClient
type LLMResult
type Result
- func Run(ctx context.Context, tasks []Task, baselines []Baseline, graphDir string, ...) ([]Result, error)
type Scoring
type Task
- func LoadTasks(glob string) ([]Task, error)

Constants ¶

This section is empty.

Variables ¶

View Source

var ErrClaudeNotFound = errors.New("claude CLI binary not found in PATH; provide --llm-claude-binary")

ErrClaudeNotFound is returned when --llm-backend=cli is requested but the claude binary cannot be located (PATH lookup failed and no override was provided).

View Source

var ErrCliwrapAgentNotFound = errors.New(
	"cliwrap-agent path not provided: set CLIWRAP_AGENT env var or " +
		"pass CLIClientOptions.AgentPath (see https://github.com/0xmhha/cli-wrapper for installation)",
)

ErrCliwrapAgentNotFound is returned when cliwrap-agent path cannot be resolved. Set the CLIWRAP_AGENT environment variable or provide CLIClientOptions.AgentPath. See https://github.com/0xmhha/cli-wrapper.

View Source

var ErrNoAPIKey = errors.New("ANTHROPIC_API_KEY not set")

ErrNoAPIKey is returned by NewAPIClient when ANTHROPIC_API_KEY is unset.

Functions ¶

func AllowedTools ¶

func AllowedTools(b Baseline) []string

AllowedTools maps a baseline to the set of MCP tool names the LLM may call. α returns nil (no tools).

func FilterHallucinations ¶

func FilterHallucinations(text string, hallu HallucinationResult) (filtered string, warnings []string)

FilterHallucinations rewrites `text` so each symbol mention that HallucinationResult flagged as Hallucinated is replaced with an inline marker. The function does NOT remove sentences or paragraphs — that decision belongs to the consumer (a model answer can carry useful prose around a single bad symbol; full strip would throw the useful prose away too).

The returned `warnings` slice carries a short human-readable summary of every replacement plus the qname-diverged mentions (the latter are not replaced — they resolved by bare name and might still be correct — but they warrant a flag).

Axis 4 of the 4-axis evaluation roadmap (2026-05-22). Sits downstream of T-04 ValidateMentions: the hallucination metric only *measures*, the filter *acts on the measurement* before the text reaches a consumer. Together they give 0%-error a two-step path: prompt-engineering reduces the rate at the source, the filter scrubs whatever leaks through.

Design rules:

Replacement only: `[unverified: <symbol>]`. No deletion.
Word-boundary aware: `Token.transferFrom` is not rewritten when `Token.transfer` is flagged. strings.Replace with a boundary check covers the common Go/Sol/TS dotted-identifier shape.
Idempotent: re-filtering an already-filtered text is a no-op because the marker `[unverified: ...]` does not contain a bare dotted identifier the validator would re-flag.
Empty hallucination set → returns text unchanged + nil warnings.

func PrecisionRecall ¶

func PrecisionRecall(got, want []string) (precision, recall float64)

PrecisionRecall returns precision and recall when comparing got and want as unordered string sets.

func RubricCheck ¶

func RubricCheck(output string, rubric []string) (hits, total int)

RubricCheck performs naive case-insensitive substring matching of each rubric item's keywords against the output text. V0 is intentionally crude — manual review is expected for high-stakes scoring.

func SystemPrompt ¶

func SystemPrompt(b Baseline) string

SystemPrompt returns the system prompt fragment that primes the LLM about what's available. α also receives raw file dumps appended to user content.

func WriteReport ¶

func WriteReport(path string, results []Result) error

WriteReport summarizes results.csv into a Markdown report (spec §9.5).

T-04 V1 (2026-05-21): every baseline row now carries hallucination statistics in addition to score. The summary table gains two columns (avg hallucination rate, total mentions), and a per-task detail section lists the literal hallucinated / qname-diverged symbols for triage.

Axis 1 (2026-05-22): the summary table now also reports population standard deviation alongside each mean, so multi-shot runs (--n-runs > 1) surface the LLM-side non-determinism the third smoke run made visible (3 single-shots produced 0/0/4 hallucinations). Single-shot runs (n=1) report std=0 — the columns stay structurally consistent across shot counts.

Types ¶

type APIClient ¶

type APIClient struct {
	// contains filtered or unexported fields
}

APIClient wraps the Anthropic Messages API. Construct one per ckg eval run.

func NewAPIClient ¶

func NewAPIClient(model string) (*APIClient, error)

NewAPIClient constructs an APIClient. It reads ANTHROPIC_API_KEY from the environment and returns ErrNoAPIKey when unset.

func (*APIClient) Close ¶

func (l *APIClient) Close() error

Close releases resources. APIClient holds none; this is a no-op so the interface contract is uniform across backends.

func (*APIClient) Complete ¶

func (l *APIClient) Complete(ctx context.Context, system, user string) (LLMResult, error)

Complete runs a single message exchange via the Anthropic API. The V0 implementation does not loop over tool_use round-trips; the runner pre-resolves any tool calls in-process before invoking Complete.

func (*APIClient) CompleteWithTools ¶

func (l *APIClient) CompleteWithTools(ctx context.Context, system, user string,
	store pkgstore.Reader) (LLMResult, error)

CompleteWithTools runs the γ multi-turn tool-use loop via the prompt-based protocol. See gamma_loop.go.

type Baseline ¶

type Baseline string

Baseline determines what tools the LLM may call and how raw context is supplied (α only). See spec §9.1.

const (
	BaselineAlpha Baseline = "alpha" // raw file dump, no tools
	BaselineBeta  Baseline = "beta"  // get_subgraph(root, depth=99), 1 tool
	BaselineGamma Baseline = "gamma" // 5 granular tools
	BaselineDelta Baseline = "delta" // get_context_for_task only (smart)
)

type CLIClient ¶

type CLIClient struct {
	// contains filtered or unexported fields
}

CLIClient runs `claude -p` via cli-wrapper as the LLM backend. It is a drop-in alternative to APIClient: each call to Complete spawns one subprocess, waits for exit, parses the JSON the binary writes to stdout, and returns an LLMResult. The cli-wrapper Manager is reused across invocations and torn down by Close.

func NewCLIClient ¶

func NewCLIClient(opts CLIClientOptions) (*CLIClient, error)

NewCLIClient constructs a CLIClient. It resolves the claude binary path (override or PATH lookup), locates cliwrap-agent, and constructs a cli-wrapper Manager that will be reused across Complete calls.

func (*CLIClient) Close ¶

func (c *CLIClient) Close() (err error)

func (*CLIClient) Complete ¶

func (c *CLIClient) Complete(ctx context.Context, system, user string) (LLMResult, error)

Complete spawns one `claude -p` invocation, waits for exit, snapshots stdout, and parses the result.

Flags used:

-p / --print: non-interactive mode required for piping/JSON output
--no-session-persistence: do not write sessions to disk (prevents user's session DB from being polluted by eval runs)
--output-format json: single JSON document on stdout (schema below)

Note: --bare is intentionally NOT used. Without it, claude uses normal auth (OAuth, keychain, ANTHROPIC_API_KEY — whatever the user configured), which is required for Pro/Max users who rely on OAuth/keychain auth.

The `system` argument, if non-empty, is forwarded as --append-system-prompt. The `user` argument is the final positional prompt.

func (*CLIClient) CompleteWithTools ¶

func (c *CLIClient) CompleteWithTools(ctx context.Context, system, user string,
	store pkgstore.Reader) (LLMResult, error)

Close shuts down the underlying cli-wrapper Manager, draining its WAL outbox. A 5s timeout is used so a wedged shutdown does not hang the eval run; failures are returned to the caller.

The deferred recover guards against an upstream lifecycle bug surfaced by smoke-run 2026-05-21: when h.Start() fails on a processHandle that's already registered with the Manager, the subsequent Shutdown walks the handle list, calls processHandle.Stop, and reaches into a nil ipc.Conn (Seqs nil dereference). Without the recover, the runner's `defer llm.Close()` panics and overwrites the real error ("start claude: ...exec...") that the user actually needs to see. With the recover, Close converts the panic into a normal error return so the runner surfaces both: the original spawn failure on stderr from runOne, and this Close error from the final Run return value. CompleteWithTools runs the γ multi-turn tool-use loop via the prompt-based protocol. See gamma_loop.go.

type CLIClientOptions ¶

type CLIClientOptions struct {
	// Binary is the path to the `claude` executable. If empty,
	// exec.LookPath("claude") is used; if that fails, NewCLIClient
	// returns ErrClaudeNotFound.
	Binary string

	// AgentPath is the absolute path to the cliwrap-agent binary. If
	// empty, the CLIWRAP_AGENT environment variable is consulted. CKG
	// does NOT install cliwrap-agent; set CLIWRAP_AGENT or pass this
	// field explicitly. See https://github.com/0xmhha/cli-wrapper.
	AgentPath string

	// RuntimeDir is where cli-wrapper stores per-process WAL/state. If
	// empty, a directory under os.TempDir() is created.
	RuntimeDir string
}

CLIClientOptions configures CLIClient construction.

type Citation ¶

type Citation struct {
	File string
	Line int
}

Citation is a single file:line reference extracted from LLM output.

func ExtractCitations ¶

func ExtractCitations(text string) []Citation

ExtractCitations pulls file:line references from LLM output text.

type CitationResult ¶

type CitationResult struct {
	Total        int
	FileExists   int
	LineInNode   int
	Hallucinated []Citation
	Precision    float64
}

CitationResult is the per-response classification of file:line citations extracted from an LLM output (T-03).

func ValidateCitations ¶

func ValidateCitations(output string, store pkgstore.Reader) (CitationResult, error)

ValidateCitations checks every file:line citation in output against the graph store. For each citation:

FileExists increments if NodesByFilePath returns any nodes
LineInNode increments if the cited line falls within at least one node's [start_line, end_line] range

Precision = LineInNode / Total (0 when Total == 0). store may be nil — returns zero result without error.

type Expected ¶

type Expected struct {
	// symbol_set kind
	Symbols []string `yaml:"symbols,omitempty"`
	// code_patch kind
	MustUseSymbols  []string `yaml:"must_use_symbols,omitempty"`
	MustCall        []string `yaml:"must_call,omitempty"`
	MustNotBreakSig bool     `yaml:"must_not_break_signature,omitempty"`
	// rubric kind
	Rubric []string `yaml:"rubric,omitempty"`
}

type HallucinationResult ¶

type HallucinationResult struct {
	// Total mentions extracted from the output, after lowercase
	// deduplication. A response that mentions "core.NewBlockChain"
	// three times contributes 1 to Total.
	Total int

	// Found is the subset of mentions whose bare name (last
	// dot-segment) resolves in the store, *case-insensitively*.
	Found []string

	// QnameDiverged is the subset of Found where the bare name
	// resolved but no candidate node had a qualified name matching
	// the mentioned dotted form. The mention is plausible but the
	// package prefix is wrong (`eth.NewBlockChain` vs
	// `core.NewBlockChain`). V0 surfaces these for manual triage
	// without counting them against Rate.
	QnameDiverged []string

	// Hallucinated is the subset of mentions where the bare name
	// did not resolve at all.
	Hallucinated []string

	// Rate is the hallucination fraction: len(Hallucinated) / Total.
	Rate float64
}

HallucinationResult is the per-response classification of the symbol mentions extracted from an LLM output.

Rate = len(Hallucinated) / Total. Total = 0 (no mentions extracted) produces Rate = 0; the call site that wants to distinguish "answer had no symbols" from "answer had only valid symbols" reads Total directly.

func ValidateMentions ¶

func ValidateMentions(output string, store pkgstore.Reader) (HallucinationResult, error)

ValidateMentions classifies every symbol mention in `output` as Found, QnameDiverged, or Hallucinated by looking each up in `store`. The tokenizer is extractSymbols (the same path scoreTask uses).

store may be nil — in which case every mention is recorded as Found with Rate = 0. This keeps call sites that lack a graph (e.g. the rubric-only scoring path) from short-circuiting on a nil dereference; hallucination measurement is opt-in.

type LLMClient ¶

type LLMClient interface {
	Complete(ctx context.Context, system, user string) (LLMResult, error)
	CompleteWithTools(ctx context.Context, system, user string,
		store pkgstore.Reader) (LLMResult, error)
	Close() error
}

LLMClient is the abstraction the eval runner uses for completions. The Anthropic Messages API (APIClient) and the Claude Code CLI (CLIClient) both implement it. Close releases backend-specific resources (e.g., shutting down a cli-wrapper Manager); APIClient.Close is a no-op.

CompleteWithTools runs the γ multi-turn tool-use loop via prompt-based pseudo-tool-use (see gamma_loop.go). Both backends implement the same protocol on top of plain Complete calls — no API-specific support required.

type LLMResult ¶

type LLMResult struct {
	OutputText        string
	InputTokens       int
	OutputTokens      int
	CacheReadTokens   int
	CacheCreateTokens int
	NumToolCalls      int
	// NumCachedCalls is populated only by γ (runGammaPromptLoop): it
	// counts (name, args) tuples the LLM re-issued that were served
	// from the in-loop cache instead of re-traversing the store. The
	// sum NumToolCalls + NumCachedCalls reports the total fan-out the
	// model attempted; NumToolCalls alone reports the work the store
	// actually did. Other baselines leave this zero. P2 #6.
	NumCachedCalls  int
	UserPromptBytes int
}

LLMResult bundles a single completion's output text and usage counters. UserPromptBytes is populated only by γ (CompleteWithTools) where the multi-turn loop accumulates a much larger user message than the runner's pre-tool prompt. Other baselines leave it zero and the runner falls back to its own len(user) measurement.

type Result ¶

type Result struct {
	TaskID   string
	Baseline Baseline
	// RunIdx identifies which of the N repeats of a (task, baseline)
	// pair this row represents (Axis 1, 2026-05-22). Single-shot eval
	// runs leave it at 0; multi-shot runs (--n-runs > 1) fill it with
	// 0..N-1. The report aggregator groups rows by (TaskID, Baseline)
	// and computes mean ± std across RunIdx, surfacing the
	// non-determinism the third smoke run made unmistakable (3 runs
	// of the same fixture produced 0, 0, and 4 hallucinated symbols).
	RunIdx int
	// UserPromptBytes is the application-level size of the
	// per-invocation user prompt the runner built for this row
	// (post-baseline-specific append: raw files for α,
	// get_subgraph result for β, smartContext for δ). It is the
	// only "prompt size" measurement that is independent of
	// claude CLI's internal prompt cache state, which carries
	// Claude Code's workspace context across invocations and
	// inflates cached_tokens to hundreds of thousands. H1's
	// question — "does δ supply less context than α?" — answers
	// cleanly against this field; cached_tokens reads the
	// CLI-side cache pattern instead and is the wrong proxy
	// (audit 2026-05-22).
	UserPromptBytes int
	InputTokens     int
	OutputTokens    int
	CachedTokens    int
	Score           float64
	LatencyMS       int64
	NumToolCalls    int
	Stale           bool
	RawOutput       string

	// Hallucination is the per-response classification of every symbol
	// mention the LLM emitted, looked up against the same store the
	// runner used to answer. T-04 V1 (HANDOFF.md 2026-05-11, wired
	// 2026-05-21). Populated by runOne after scoreTask; nil store on
	// the runOne path is the rubric-only short-circuit and produces
	// Total=0 / Rate=0. The detailed Found/QnameDiverged/Hallucinated
	// lists are surfaced via the report.md path, not CSV, because
	// the lists are variable-length and would balloon CSV column
	// count beyond what spreadsheet readers handle cleanly.
	Hallucination HallucinationResult

	// Citation is the per-response file:line accuracy check (T-03).
	// Populated by runOne after scoreTask; measures whether the LLM's
	// source citations point to real files and valid line ranges.
	Citation CitationResult
}

Result is one row in the CSV.

func Run ¶

func Run(ctx context.Context, tasks []Task, baselines []Baseline,
	graphDir string, llm LLMClient, outDir string, nRuns int) ([]Result, error)

Run loops tasks × baselines × nRuns and writes results.csv plus report.md. Each (task, baseline) pair runs nRuns times; per-run rows carry RunIdx 0..nRuns-1 so the report aggregator can compute mean ± std across repeats (Axis 1, 2026-05-22). nRuns ≤ 0 is treated as 1 for backwards compatibility with single-shot callers.

Run takes ownership of llm: it is Closed when Run returns, regardless of error path. Callers must NOT Close llm themselves.

type Scoring ¶

type Scoring struct {
	Type      string             `yaml:"type"` // "precision_recall" | "rubric"
	Threshold map[string]float64 `yaml:"threshold,omitempty"`
}

type Task ¶

type Task struct {
	ID           string   `yaml:"id"`
	Corpus       string   `yaml:"corpus"`      // "synthetic" | "real" | absolute path
	CorpusPath   string   `yaml:"corpus_path"` // optional override
	Description  string   `yaml:"description"`
	ExpectedKind string   `yaml:"expected_kind"` // "symbol_set" | "code_patch" | "rubric"
	Expected     Expected `yaml:"expected"`
	Scoring      Scoring  `yaml:"scoring"`
}

Task mirrors the eval/tasks/*.yaml schema (spec §9.3).

func LoadTasks ¶

func LoadTasks(glob string) ([]Task, error)

LoadTasks reads any *.yaml under glob (e.g. "eval/tasks/synthetic-*.yaml"). Environment variables in corpus_path (e.g. ${STABLENET_SRC}) are expanded via os.ExpandEnv so task YAMLs stay portable across machines.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
retrieval Package retrieval implements the LLM-free retrieval-accuracy measurement (EV1 Phase 2) — load YAML probe fixtures, dispatch each to the matching MCP tool through StoreReader, score result symbols against an expected set with recall / precision / F1.	Package retrieval implements the LLM-free retrieval-accuracy measurement (EV1 Phase 2) — load YAML probe fixtures, dispatch each to the matching MCP tool through StoreReader, score result symbols against an expected set with recall / precision / F1.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL