Documentation
¶
Overview ¶
Package report — benchmark.go implements benchmark.json generation logic.
Benchmark is the final aggregation step of the evaluation pipeline. It computes statistics (mean, stddev) for pass_rate, time, and tokens, and optionally calculates delta between with_skill and without_skill runs.
Two modes (design doc):
- Simplified mode (default): only with_skill, without_skill=null, delta=null
- Full mode (benchmark.enabled=true): with_skill + without_skill + delta
Package report — benchmark_anthropic.go implements the Anthropic-compatible benchmark.json format and computation logic.
This file contains types and functions for generating benchmark outputs that are compatible with Anthropic's eval-viewer and skill-creator tooling.
Package report — benchmark_md.go generates human-readable Markdown benchmark reports.
Format matches demo/chinese-jokes-workspace/iteration-1/benchmark.md.
Package report — grading.go implements Anthropic-compatible grading.json and eval_metadata.json writers.
These formats align with Anthropic's skill-creator evaluation outputs, enabling interoperability with eval-viewer and other Anthropic tooling.
Package report — helpers.go provides shared utility functions used across report writers.
Package report emits JSON, JUnit, and HTML reports from evaluation runs.
Package report — template_helpers.go provides shared helper functions for HTML template rendering across different report types.
Package report — workspace.go manages the Anthropic-compatible iteration directory structure for evaluation outputs.
Directory layout:
<skill-name>-workspace/
iteration-<N>/
benchmark.json
benchmark.md
report.html
<case-id>/
eval_metadata.json
with_skill/
outputs/
response.md
grading.json
without_skill/ # optional, only when benchmark.enabled=true
outputs/
response.md
grading.json
Index ¶
- func SharedTemplateFuncs() template.FuncMap
- func WithTimestamp(t time.Time) func(*benchmarkOptions)
- func WriteAnthropicBenchmark(path string, bm *AnthropicBenchmark) error
- func WriteBenchmarkMarkdown(w io.Writer, bm *AnthropicBenchmark) error
- func WriteBenchmarkMarkdownFile(path string, bm *AnthropicBenchmark) (err error)
- func WriteEvalMetadata(path string, meta *EvalMetadata) error
- func WriteGradingJSON(path string, grading *AnthropicGrading) error
- func WriteHTMLReport(ctx context.Context, path string, in Input) error
- type AnthropicBenchmark
- type AnthropicDelta
- type AnthropicExpectation
- type AnthropicGrading
- type AnthropicRunSummary
- type AnthropicStatSummary
- type AnthropicStatValue
- type AnthropicSummary
- type BenchmarkDelta
- type BenchmarkMetadata
- type BenchmarkResult
- type BenchmarkRun
- type BenchmarkRunResult
- type BenchmarkRunSummary
- type BenchmarkStats
- type CaseMetrics
- type CaseResult
- type EvalMetadata
- type HTMLReporter
- type Input
- type IterationWorkspace
- func (w *IterationWorkspace) CaseDir(caseID string) string
- func (w *IterationWorkspace) ConfigDir(caseID, config string) string
- func (w *IterationWorkspace) EnsureDirs(caseIDs []string) error
- func (w *IterationWorkspace) EnsureDirsWithBaseline(caseIDs []string) error
- func (w *IterationWorkspace) IterationDir() string
- func (w *IterationWorkspace) WithSkillDir(caseID string) string
- func (w *IterationWorkspace) WithoutSkillDir(caseID string) string
- func (w *IterationWorkspace) WriteBenchmark(bm *AnthropicBenchmark) error
- func (w *IterationWorkspace) WriteBenchmarkMD(bm *AnthropicBenchmark) error
- func (w *IterationWorkspace) WriteEvalMeta(caseID string, meta *EvalMetadata) error
- func (w *IterationWorkspace) WriteFile(relPath string, data []byte) error
- func (w *IterationWorkspace) WriteGrading(caseID, config string, grading *AnthropicGrading) error
- func (w *IterationWorkspace) WriteResponse(caseID, config, content string) error
- type JSONReporter
- type JUnitReporter
- type Reporter
- type StatValue
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func SharedTemplateFuncs ¶
SharedTemplateFuncs returns the template.FuncMap shared across all HTML reporters.
func WithTimestamp ¶
WithTimestamp sets a fixed timestamp for the benchmark metadata.
func WriteAnthropicBenchmark ¶
func WriteAnthropicBenchmark(path string, bm *AnthropicBenchmark) error
WriteAnthropicBenchmark writes an AnthropicBenchmark to the specified file.
func WriteBenchmarkMarkdown ¶
func WriteBenchmarkMarkdown(w io.Writer, bm *AnthropicBenchmark) error
WriteBenchmarkMarkdown writes a Markdown benchmark report to the given writer.
func WriteBenchmarkMarkdownFile ¶
func WriteBenchmarkMarkdownFile(path string, bm *AnthropicBenchmark) (err error)
WriteBenchmarkMarkdownFile writes the benchmark Markdown report to a file.
func WriteEvalMetadata ¶
func WriteEvalMetadata(path string, meta *EvalMetadata) error
WriteEvalMetadata writes an EvalMetadata to the specified file path as formatted JSON.
func WriteGradingJSON ¶
func WriteGradingJSON(path string, grading *AnthropicGrading) error
WriteGradingJSON writes an AnthropicGrading to the specified file path as formatted JSON.
Types ¶
type AnthropicBenchmark ¶
type AnthropicBenchmark struct {
Metadata BenchmarkMetadata `json:"metadata"`
Runs []BenchmarkRun `json:"runs"`
RunSummary AnthropicRunSummary `json:"run_summary"`
Notes []string `json:"notes,omitempty"`
}
AnthropicBenchmark corresponds to the full Anthropic benchmark.json schema.
This format includes metadata, per-run details, summary statistics, and optional notes — matching the demo/chinese-jokes-workspace/benchmark.json.
func ComputeAnthropicBenchmark ¶
func ComputeAnthropicBenchmark( skillName, skillPath string, withSkillRuns []BenchmarkRun, withoutSkillRuns []BenchmarkRun, opts ...func(*benchmarkOptions), ) *AnthropicBenchmark
ComputeAnthropicBenchmark builds the full Anthropic-compatible benchmark from evaluation run data. An optional timestamp can be provided; if empty, the current UTC time is used.
type AnthropicDelta ¶
type AnthropicDelta struct {
PassRate string `json:"pass_rate"`
TimeSeconds string `json:"time_seconds"`
Tokens string `json:"tokens"`
}
AnthropicDelta holds the string-formatted delta values between configurations.
type AnthropicExpectation ¶
type AnthropicExpectation struct {
Text string `json:"text"`
Passed bool `json:"passed"`
Evidence string `json:"evidence"`
}
AnthropicExpectation is a single expectation result in the Anthropic format.
type AnthropicGrading ¶
type AnthropicGrading struct {
Expectations []AnthropicExpectation `json:"expectations"`
Summary AnthropicSummary `json:"summary"`
}
AnthropicGrading corresponds to the Anthropic grading.json schema.
Example output (from demo/chinese-jokes-workspace):
{
"expectations": [
{"text": "...", "passed": true, "evidence": "..."}
],
"summary": {"passed": 5, "failed": 0, "total": 5, "pass_rate": 1.0}
}
func ConvertToAnthropicGrading ¶
func ConvertToAnthropicGrading(result *judge.Result) *AnthropicGrading
ConvertToAnthropicGrading converts an internal judge.Result to the Anthropic grading.json format.
Mapping:
- judge.AssertionResult.Text -> AnthropicExpectation.Text
- judge.AssertionResult.Passed -> AnthropicExpectation.Passed
- judge.AssertionResult.Evidence -> AnthropicExpectation.Evidence
- judge.ResultSummary -> AnthropicSummary (direct field mapping)
type AnthropicRunSummary ¶
type AnthropicRunSummary struct {
WithSkill AnthropicStatSummary `json:"with_skill"`
WithoutSkill *AnthropicStatSummary `json:"without_skill"`
Delta *AnthropicDelta `json:"delta"`
}
AnthropicRunSummary holds per-configuration summary statistics.
type AnthropicStatSummary ¶
type AnthropicStatSummary struct {
PassRate AnthropicStatValue `json:"pass_rate"`
TimeSeconds AnthropicStatValue `json:"time_seconds"`
Tokens AnthropicStatValue `json:"tokens"`
}
AnthropicStatSummary holds statistics with min/max for a configuration.
type AnthropicStatValue ¶
type AnthropicStatValue struct {
Mean float64 `json:"mean"`
StdDev float64 `json:"stddev"`
Min float64 `json:"min"`
Max float64 `json:"max"`
}
AnthropicStatValue holds mean, stddev, min, max for a metric.
func ComputeAnthropicStatValue ¶
func ComputeAnthropicStatValue(values []float64) AnthropicStatValue
ComputeAnthropicStatValue computes an AnthropicStatValue from raw values.
type AnthropicSummary ¶
type AnthropicSummary struct {
Passed int `json:"passed"`
Failed int `json:"failed"`
Total int `json:"total"`
PassRate float64 `json:"pass_rate"`
}
AnthropicSummary holds aggregate pass/fail statistics in the Anthropic format.
type BenchmarkDelta ¶
type BenchmarkDelta struct {
PassRate float64 `json:"pass_rate"`
TimeSeconds float64 `json:"time_seconds"`
InputTokens float64 `json:"input_tokens"`
OutputTokens float64 `json:"output_tokens"`
}
BenchmarkDelta holds the difference between with_skill and without_skill.
func ComputeBenchmarkDelta ¶
func ComputeBenchmarkDelta(withSkill, withoutSkill BenchmarkStats) BenchmarkDelta
ComputeBenchmarkDelta computes the delta between with_skill and without_skill stats.
type BenchmarkMetadata ¶
type BenchmarkMetadata struct {
SkillName string `json:"skill_name"`
SkillPath string `json:"skill_path"`
Timestamp string `json:"timestamp"`
EvalsRun []int `json:"evals_run"`
RunsPerConfiguration int `json:"runs_per_configuration"`
}
BenchmarkMetadata holds skill and execution metadata.
type BenchmarkResult ¶
type BenchmarkResult struct {
RunSummary BenchmarkRunSummary `json:"run_summary"`
}
BenchmarkResult is the top-level structure for benchmark.json.
func ComputeBenchmark ¶
func ComputeBenchmark(withSkillMetrics []CaseMetrics, withoutSkillMetrics []CaseMetrics) *BenchmarkResult
ComputeBenchmark builds the complete BenchmarkResult.
- Simplified mode: pass withoutSkillMetrics as nil.
- Full mode: pass both withSkillMetrics and withoutSkillMetrics.
type BenchmarkRun ¶
type BenchmarkRun struct {
EvalID int `json:"eval_id"`
EvalName string `json:"eval_name"`
Configuration string `json:"configuration"`
RunNumber int `json:"run_number"`
Result BenchmarkRunResult `json:"result"`
Expectations []AnthropicExpectation `json:"expectations"`
}
BenchmarkRun holds per-eval, per-configuration run details.
type BenchmarkRunResult ¶
type BenchmarkRunResult struct {
PassRate float64 `json:"pass_rate"`
Passed int `json:"passed"`
Failed int `json:"failed"`
Total int `json:"total"`
TimeSeconds float64 `json:"time_seconds"`
Tokens int `json:"tokens"`
Errors int `json:"errors"`
}
BenchmarkRunResult holds aggregated metrics for a single run.
type BenchmarkRunSummary ¶
type BenchmarkRunSummary struct {
WithSkill BenchmarkStats `json:"with_skill"`
WithoutSkill *BenchmarkStats `json:"without_skill"`
Delta *BenchmarkDelta `json:"delta"`
}
BenchmarkRunSummary holds the stats for with_skill and optionally without_skill.
type BenchmarkStats ¶
type BenchmarkStats struct {
PassRate StatValue `json:"pass_rate"`
TimeSeconds StatValue `json:"time_seconds"`
InputTokens StatValue `json:"input_tokens"`
OutputTokens StatValue `json:"output_tokens"`
}
BenchmarkStats holds the computed statistics for a run.
func ComputeBenchmarkStats ¶
func ComputeBenchmarkStats(metrics []CaseMetrics) BenchmarkStats
ComputeBenchmarkStats computes BenchmarkStats from a slice of CaseMetrics.
type CaseMetrics ¶
type CaseMetrics struct {
Passed bool // whether the case passed
TimeSeconds float64 // execution time in seconds
InputTokens float64 // input tokens consumed
OutputTokens float64 // output tokens consumed
}
CaseMetrics holds the raw metrics for a single case execution.
func ExtractMetrics ¶
func ExtractMetrics(results []CaseResult) []CaseMetrics
ExtractMetrics converts CaseResults into CaseMetrics for benchmark computation.
type CaseResult ¶
type CaseResult struct {
CaseID string `json:"case_id"`
Title string `json:"title"`
Status judge.Status `json:"status"`
DurationMs int64 `json:"duration_ms"`
Turns int `json:"turns"`
InputTokens int `json:"input_tokens"`
OutputTokens int `json:"output_tokens"`
Error string `json:"error,omitempty"`
Grading *judge.Result `json:"grading"`
Configuration string `json:"configuration,omitempty"` // "with_skill" or "without_skill"
Prompt string `json:"prompt,omitempty"` // input prompt sent to the agent
Response string `json:"response,omitempty"` // agent final message
}
CaseResult represents the result of a single case execution.
type EvalMetadata ¶
type EvalMetadata struct {
EvalID int `json:"eval_id"`
EvalName string `json:"eval_name"`
Prompt string `json:"prompt"`
Assertions []string `json:"assertions"`
}
EvalMetadata corresponds to the per-case eval_metadata.json file.
Example output:
{
"eval_id": 1,
"eval_name": "bored-coding",
"prompt": "I'm so bored after coding all afternoon, my brain is fried.",
"assertions": ["...", "..."]
}
type HTMLReporter ¶
type HTMLReporter struct {
// OutputPath is the file path to write the HTML report.
// If empty, writes to stdout.
OutputPath string
}
HTMLReporter writes human-readable HTML summaries. Generates an interactive single-page report with case navigation, collapsible grading details, benchmark visualization, and feedback support.
type Input ¶
type Input struct {
SkillName string `json:"skill_name"`
SchemaVersion string `json:"schema_version"`
EngineName string `json:"engine_name"`
ModelName string `json:"model_name"`
StartTime time.Time `json:"start_time"`
EndTime time.Time `json:"end_time"`
CaseResults []CaseResult `json:"case_results"`
TotalTokens int `json:"total_tokens"`
Benchmark *BenchmarkResult `json:"benchmark,omitempty"`
}
Input aggregates run results for reporting.
func (Input) OverallPassRate ¶
OverallPassRate calculates the overall pass rate across all cases.
func (Input) TotalDuration ¶
TotalDuration calculates the total wall-clock duration from StartTime to EndTime. Falls back to summing individual case durations if StartTime/EndTime are not set.
type IterationWorkspace ¶
type IterationWorkspace struct {
// RootDir is the <skill-name>-workspace directory.
RootDir string
// IterationNum is the iteration number (1-based).
IterationNum int
// SkillName is the name of the skill being evaluated.
SkillName string
}
IterationWorkspace manages a single evaluation iteration's artifact output.
func NewIterationWorkspace ¶
func NewIterationWorkspace(outputDir, skillName string, iterNum int) (*IterationWorkspace, error)
NewIterationWorkspace creates a workspace for the given iteration number. iterNum must be >= 1. If outputDir is empty, defaults to "<skillName>-workspace" in the current directory.
func (*IterationWorkspace) CaseDir ¶
func (w *IterationWorkspace) CaseDir(caseID string) string
CaseDir returns the path to a case's directory.
func (*IterationWorkspace) ConfigDir ¶
func (w *IterationWorkspace) ConfigDir(caseID, config string) string
ConfigDir returns the directory for a given configuration ("with_skill" or "without_skill").
func (*IterationWorkspace) EnsureDirs ¶
func (w *IterationWorkspace) EnsureDirs(caseIDs []string) error
EnsureDirs creates all necessary directory structure for the given case IDs.
func (*IterationWorkspace) EnsureDirsWithBaseline ¶
func (w *IterationWorkspace) EnsureDirsWithBaseline(caseIDs []string) error
EnsureDirsWithBaseline creates case directories including without_skill outputs.
func (*IterationWorkspace) IterationDir ¶
func (w *IterationWorkspace) IterationDir() string
IterationDir returns the path to the current iteration directory.
func (*IterationWorkspace) WithSkillDir ¶
func (w *IterationWorkspace) WithSkillDir(caseID string) string
WithSkillDir returns the with_skill subdirectory for a case.
func (*IterationWorkspace) WithoutSkillDir ¶
func (w *IterationWorkspace) WithoutSkillDir(caseID string) string
WithoutSkillDir returns the without_skill subdirectory for a case.
func (*IterationWorkspace) WriteBenchmark ¶
func (w *IterationWorkspace) WriteBenchmark(bm *AnthropicBenchmark) error
WriteBenchmark writes benchmark.json to the iteration directory.
func (*IterationWorkspace) WriteBenchmarkMD ¶
func (w *IterationWorkspace) WriteBenchmarkMD(bm *AnthropicBenchmark) error
WriteBenchmarkMD writes benchmark.md to the iteration directory.
func (*IterationWorkspace) WriteEvalMeta ¶
func (w *IterationWorkspace) WriteEvalMeta(caseID string, meta *EvalMetadata) error
WriteEvalMeta writes eval_metadata.json to the case directory.
func (*IterationWorkspace) WriteFile ¶
func (w *IterationWorkspace) WriteFile(relPath string, data []byte) error
WriteFile writes arbitrary content to a file in the iteration directory.
func (*IterationWorkspace) WriteGrading ¶
func (w *IterationWorkspace) WriteGrading(caseID, config string, grading *AnthropicGrading) error
WriteGrading writes grading.json to the config directory.
func (*IterationWorkspace) WriteResponse ¶
func (w *IterationWorkspace) WriteResponse(caseID, config, content string) error
WriteResponse writes the agent response to outputs/response.md. config is "with_skill" or "without_skill".
type JSONReporter ¶
type JSONReporter struct {
// OutputPath is the file path to write the JSON report.
// If empty, writes to stdout.
OutputPath string
}
JSONReporter writes machine-readable JSON results. JSON is the "fact source" — JUnit and HTML are derived views.
type JUnitReporter ¶
type JUnitReporter struct {
// OutputPath is the file path to write the JUnit XML report.
// If empty, writes to stdout.
OutputPath string
}
JUnitReporter writes JUnit XML for CI systems.
Mapping: each case → <testcase>, failed assertions → <failure> message.