Documentation
¶
Overview ¶
Package summaryeval is a quality-gate harness for session summaries.
What this solves ¶
Session summarization is a lossy LLM operation. Every change to the pipeline — swapping models, tuning prompts, adjusting pkg/tokenopt, adding a tokenstrip stage, landing a REDACT.md LLM redactor — can silently degrade summary quality in ways unit tests won't catch. Without a quality gate, "does the new thing make summaries worse?" is a flying-blind question.
Design ¶
The harness scores a candidate summary against a hand-reviewed reference summary using a rubric of five dimensions:
- Title fidelity — semantic match against reference title
- Summary coverage — keyword overlap against reference summary
- Key actions recall — set overlap over bullet points
- Outcome correctness — exact match of success|partial|failed
- Aha moments — count match + positional overlap
Each dimension produces a 0.0–1.0 score. A weighted aggregate gives an overall session score. A corpus of N sessions gives an aggregate distribution that can be tracked over time and gated in CI.
Why deterministic scoring (v1) ¶
The scorer uses lexical metrics (Jaccard similarity, set overlap, exact match) rather than an LLM judge. This is a deliberate v1 choice:
- Deterministic — same input always produces same score. Regressions are attributable to the thing that changed, not model variance.
- Fast and free — no API calls. Runs on every CI build.
- Self-contained — stdlib only, no Anthropic SDK dependency.
Lexical metrics are approximate. A summary that paraphrases using different vocabulary may score lower than one that copies the reference. That's a known tradeoff. V2 can introduce an optional LLM-judge scorer on top of the same Rubric/Score shapes for cases where semantic equivalence matters.
Curating the golden corpus ¶
See pkg/summaryeval/CORPUS.md (or the README in the testdata directory) for the process: pick 20–30 diverse sessions, hand-review and polish summaries to a trusted reference, then run the harness to establish the baseline score distribution.
Index ¶
- Constants
- func BuildJudgePrompt(reference *Summary, candidate Summary, opts JudgeOptions) string
- func Dimensions() []string
- type AhaMoment
- type Completer
- type CompletionResult
- type DimensionScore
- type Gates
- type GoldenSession
- type Judge
- type JudgeOptions
- type JudgeResult
- type Report
- type SessionScore
- type Summary
- type Weights
Constants ¶
const ( DimTitle = "title" DimSummary = "summary" DimKeyActions = "key_actions" DimOutcome = "outcome" DimAhaMoments = "aha_moments" )
Dimension names the scored rubric dimensions. Constants so typos fail at compile time and callers can iterate a known set.
Variables ¶
This section is empty.
Functions ¶
func BuildJudgePrompt ¶
func BuildJudgePrompt(reference *Summary, candidate Summary, opts JudgeOptions) string
BuildJudgePrompt constructs the prompt text sent to the LLM judge. Paired mode (reference != nil) asks for semantic equivalence; absolute mode (reference == nil) asks for on-merits evaluation against the rubric's definition of a good summary.
The prompt asks for a strict JSON response to simplify parsing and make the judge's output CI-gateable. Models that free-form their response will parse-fail and surface as errors.
func Dimensions ¶
func Dimensions() []string
Dimensions returns the canonical ordered list. Callers use this for stable report ordering.
Types ¶
type AhaMoment ¶
AhaMoment is a minimal moment shape for scoring. We score count match and approximate sequence-position overlap, not the full highlight text.
type Completer ¶
type Completer func(ctx context.Context, prompt string) (CompletionResult, error)
Completer is the abstraction for calling an LLM. Callers provide the actual implementation; pkg/summaryeval does NOT depend on any specific SDK (Anthropic, OpenAI, Bedrock, etc.). This keeps the package importable with only stdlib and leaves all API concerns — authentication, retries, rate limiting, model selection, streaming — to the caller.
The prompt is the full text to send. The return is the model's raw response text. Errors propagate up to the Judge caller.
type CompletionResult ¶
CompletionResult is the response shape Completer returns. Token counts are optional — pass 0 when unknown.
type DimensionScore ¶
type DimensionScore struct {
Dimension string `json:"dimension"`
Score float64 `json:"score"`
Reason string `json:"reason,omitempty"`
}
DimensionScore is one dimension's 0.0–1.0 result for a single session with a reason string explaining how it was computed.
type Gates ¶
type Gates struct {
MinOverall float64 `json:"min_overall,omitempty"`
MinDimensions map[string]float64 `json:"min_dimensions,omitempty"`
}
Gates is an optional set of minimum-score thresholds. Any dimension or overall score below its threshold fails the gate and surfaces in Report.GatesFailed. Useful as a CI regression guard.
type GoldenSession ¶
type GoldenSession struct {
// Name identifies the session (matches ledger session dir name).
Name string `json:"name"`
// Notes is free-form human context explaining why this session was
// chosen and what a good summary should capture. Not scored; helps
// future curators.
Notes string `json:"notes,omitempty"`
// Reference is the trusted summary: what a great distillation looks
// like for this session. Candidates are scored against this.
Reference Summary `json:"reference"`
}
GoldenSession is a single entry in the reference corpus: a session name + the hand-reviewed reference summary we score candidates against. Stored on disk as <corpus>/<session_name>/reference.json.
func LoadCorpus ¶
func LoadCorpus(corpusDir string) ([]GoldenSession, error)
LoadCorpus walks corpusDir and returns all GoldenSessions found. Each subdirectory containing reference.json is a session. Order is lexicographic by directory name for reproducible reports.
func LoadGoldenSession ¶
func LoadGoldenSession(corpusDir, sessionName string) (*GoldenSession, error)
LoadGoldenSession reads a GoldenSession from a corpus directory layout:
<corpus>/<session_name>/reference.json
Returns (nil, nil) if the session dir doesn't exist — callers can distinguish "not curated" from "exists but broken".
type Judge ¶
type Judge interface {
Score(ctx context.Context, name string, reference *Summary, candidate Summary) (JudgeResult, error)
}
Judge evaluates a candidate Summary semantically, returning a JudgeResult. Complements the deterministic rubric Score (scorer.go) for cases where lexical metrics aren't enough — paraphrased summaries that convey the same meaning with different vocabulary score poorly on Jaccard but should score well on semantic equivalence.
Two modes:
- Paired: Score with a non-nil reference. The judge evaluates semantic equivalence between candidate and reference.
- Absolute: Score with a nil reference. The judge evaluates the candidate on its own merits against the rubric description, without needing a curated corpus.
Absolute mode is what the daemon runs in production — no corpus maintenance required, just "is this summary good on its face?"
func NewJudge ¶
func NewJudge(c Completer, opts JudgeOptions) Judge
NewJudge constructs a Judge backed by the given Completer. The judge uses a fixed prompt template (see BuildJudgePrompt) and parses the LLM's JSON response. Callers that need different prompts or response shapes should implement Judge directly.
type JudgeOptions ¶
type JudgeOptions struct {
// ModelHint is a free-form tag the Judge can use to signal the
// desired model class to the Completer. The Completer may ignore
// it. Example: "haiku" for cheap/fast; "opus" for deep evaluation.
// If empty, the Completer picks.
ModelHint string
// IncludeSuggestions, when true, asks the judge for up to 5 concrete
// suggestions for improving the summary. Slight prompt length
// increase; usually worth it for diagnostic mode.
IncludeSuggestions bool
// MaxRationaleChars caps the rationale length the judge is asked
// to produce. Default: 600. Set lower to reduce completion tokens.
MaxRationaleChars int
}
JudgeOptions tune a Judge's behavior.
type JudgeResult ¶
type JudgeResult struct {
// Name identifies the session scored.
Name string `json:"name"`
// Dimensions are per-dimension 0.0-1.0 scores (same dimension
// names as Rubric). Absent dimensions are treated as 0.0 when
// converting to SessionScore.
Dimensions []DimensionScore `json:"dimensions"`
// Overall is the judge's aggregate verdict, 0.0-1.0.
Overall float64 `json:"overall"`
// Rationale is a short human-readable explanation of the verdict.
// Useful for humans debugging "why did this summary score 0.4?"
Rationale string `json:"rationale,omitempty"`
// Suggestions is a short list of specific, actionable fixes the
// judge thinks would improve the summary. Surfaced to the user
// via diagnostic output; consumed by CI in "hint to the prompt
// engineer" mode.
Suggestions []string `json:"suggestions,omitempty"`
// ModelUsed identifies which model produced the judgment, for
// reproducibility tracking ("haiku-4-5", "opus-4-7", etc.).
ModelUsed string `json:"model_used,omitempty"`
// DurationMs captures end-to-end judge latency including the LLM
// call. Useful for operational telemetry.
DurationMs int64 `json:"duration_ms,omitempty"`
// PromptTokens / CompletionTokens are the raw token counts the
// LLM reported, when available. Used for cost attribution.
PromptTokens int `json:"prompt_tokens,omitempty"`
CompletionTokens int `json:"completion_tokens,omitempty"`
}
JudgeResult is what a Judge returns for one session.
func (JudgeResult) LogValue ¶
func (jr JudgeResult) LogValue() slog.Value
LogValue implements slog.LogValuer so callers can emit one-line judge telemetry via the existing slog path.
func (JudgeResult) ToSessionScore ¶
func (jr JudgeResult) ToSessionScore() SessionScore
ToSessionScore converts a JudgeResult to a SessionScore so it composes with the same ScoreCorpus / Report aggregation the deterministic scorer uses. The overall score is taken verbatim from the judge; individual dimensions carry through.
type Report ¶
type Report struct {
ScoredAt time.Time `json:"scored_at"`
CorpusSize int `json:"corpus_size"`
OverallMean float64 `json:"overall_mean"`
OverallMin float64 `json:"overall_min"`
OverallMax float64 `json:"overall_max"`
DimensionMeans map[string]float64 `json:"dimension_means"`
Sessions []SessionScore `json:"sessions"`
// GatesFailed lists any minimum-threshold gates that didn't pass.
// Empty = all gates met (or no gates configured).
GatesFailed []string `json:"gates_failed,omitempty"`
}
Report is the aggregate result of running an eval over a corpus.
func ScoreCorpus ¶
func ScoreCorpus(corpus []GoldenSession, candidates map[string]Summary, w Weights, gates *Gates) Report
ScoreCorpus evaluates each golden session against its paired candidate and returns an aggregated Report. The candidates map is keyed by session name. Sessions with no paired candidate are scored as all-zeros (missing_candidate reason) and counted in the corpus size.
Gates (optional) are checked after aggregation; any unmet thresholds appear in Report.GatesFailed.
type SessionScore ¶
type SessionScore struct {
Name string `json:"name"`
Dimensions []DimensionScore `json:"dimensions"`
Overall float64 `json:"overall"`
}
SessionScore is the full per-session result: all dimensions + a weighted aggregate.
func Score ¶
func Score(name string, reference, candidate Summary, w Weights) SessionScore
Score evaluates a candidate summary against a reference using the given weights. Each dimension produces a 0.0–1.0 score; the overall is the weighted sum. All scoring is deterministic and lexical — same input always produces the same score. No LLM calls.
type Summary ¶
type Summary struct {
Title string `json:"title"`
Summary string `json:"summary"`
KeyActions []string `json:"key_actions"`
Outcome string `json:"outcome"`
AhaMoments []AhaMoment `json:"aha_moments,omitempty"`
TopicsFound []string `json:"topics_found,omitempty"`
}
Summary is a minimal shape mirroring the fields from pkg/sessionsummary.SummarizeResponse that the rubric actually scores. Kept separate so summaryeval has no dependency on sessionsummary — the two evolve independently, and the eval harness doesn't care about the full summary schema (quality_score, sageox_score, chapter lists, etc. are out of scope).
func LoadCandidate ¶
LoadCandidate reads a candidate Summary from a JSON file. Used when comparing a distiller's output against a golden reference.
type Weights ¶
type Weights struct {
Title float64 `json:"title"`
Summary float64 `json:"summary"`
KeyActions float64 `json:"key_actions"`
Outcome float64 `json:"outcome"`
AhaMoments float64 `json:"aha_moments"`
}
Weights are the rubric's per-dimension contribution to the overall score. Must sum to 1.0. Defaults chosen deliberately:
- Title carries user-visible fidelity of the session's identity
- Summary is the longest-form signal
- Key actions are the "what was done" spine
- Outcome is small but binary-important (success vs failed)
- Aha moments are secondary but measurable
func DefaultWeights ¶
func DefaultWeights() Weights
DefaultWeights returns the canonical rubric weights (sum = 1.0).