report

package

v0.1.0 Latest Latest Go to latest Published: May 14, 2026 License: Apache-2.0 Imports: 16 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/alibaba/skill-up

Links

Open Source Insights

README ¶

report — Reporting and Artifact Generation Layer

internal/report is the reporting layer of the skill-up framework. It converts evaluation run results into output files in multiple formats and manages the Anthropic-compatible workspace directory structure. All report generators share a unified Reporter interface that takes an Input and writes to a given path or stdout.

Position in the System

The report module is the last stage of the evaluation pipeline. It receives the evaluation results from internal/judge and produces report artifacts for both developers and CI systems:

Runner executes cases
      │
      ▼
 ┌──────────────┐
 │   Judge      │──▶ judge.Result (evaluation result)
 │  (eval layer)│
 └──────┬───────┘
        │
        ▼
 ┌──────────────┐
 │  Report      │──▶ JSON / HTML / JUnit / Markdown
 │ (report layer│   benchmark.json / grading.json / eval_metadata.json
 └──────────────┘

This corresponds to the Report Generator component in the design doc (docs/design-docs/0.1.0-design.md) and is responsible for:

Receiving aggregated case execution results from the Case Runner
Computing pass rate and benchmark statistics (mean / stddev / min / max)
Generating JSON / JUnit / HTML reports
Producing the Anthropic-compatible workspace directory structure (iteration-<N>/)

Feature Overview

File	Responsibility
`reporter.go`	Core interface (`Reporter`) and shared data types (`Input`, `CaseResult`, `BenchmarkResult`, `StatValue`, etc.)
`json.go`	JSONReporter — machine-readable JSON report (the "source of truth"; other formats are derived views)
`html.go`	HTMLReporter — human-readable HTML evaluation report with summary cards and a per-case detail table
`junit.go`	JUnitReporter — JUnit XML report consumable by CI systems
`grading.go`	Writers for the Anthropic-compatible `grading.json` and `eval_metadata.json`
`benchmark.go`	Benchmark statistics computation (mean / stddev / min / max)
`benchmark_anthropic.go`	Type definitions and computation for the Anthropic full-format `benchmark.json`
`benchmark_md.go`	Human-readable Markdown benchmark report (`benchmark.md`)
`workspace.go`	IterationWorkspace — manages the Anthropic-compatible `iteration-<N>/` directory structure and artifact writes
`template_helpers.go`	Shared helpers for HTML templates (formatting time, percentages, nil checks, etc.)
`helpers.go`	Generic JSON file writer helper (`writeJSONFile`)
`templates/`	Embedded HTML template directory (`report.html`, `review.html`), loaded via `go:embed`; `review.html` template exists but the Go generator is not yet implemented

Architecture and Module Relationships

@startuml
skinparam packageStyle rectangle

package "internal/report" {

  interface Reporter {
    +Write(ctx, Input) error
  }

  class Input {
    SkillName : string
    SchemaVersion : string
    EngineName : string
    ModelName : string
    StartTime : time.Time
    EndTime : time.Time
    CaseResults : []CaseResult
    TotalTokens : int
    Benchmark : *BenchmarkResult
    +TotalDuration() time.Duration
    +OverallPassRate() float64
  }

  class CaseResult {
    CaseID : string
    Title : string
    Status : judge.Status
    DurationMs : int64
    Turns : int
    Error : string
    Grading : *judge.Result
  }

  class JSONReporter {
    OutputPath : string
  }

  class HTMLReporter {
    OutputPath : string
  }

  class JUnitReporter {
    OutputPath : string
  }

  class IterationWorkspace {
    RootDir : string
    IterationNum : int
    SkillName : string
    +IterationDir() string
    +CaseDir(caseID) string
    +WithSkillDir(caseID) string
    +WithoutSkillDir(caseID) string
    +EnsureDirs(caseIDs, withBaseline) error
    +WriteResponse(caseID, config, content) error
    +WriteGrading(caseID, config, grading) error
    +WriteEvalMeta(caseID, meta) error
    +WriteBenchmark(bm) error
    +WriteBenchmarkMD(bm) error
    +WriteFile(relPath, data) error
  }

  Reporter <|.. JSONReporter
  Reporter <|.. HTMLReporter
  Reporter <|.. JUnitReporter
  Reporter ..> Input : receives
  Input *-- CaseResult
  Input *-- BenchmarkResult
  CaseResult --> "judge.Result" : Grading
  IterationWorkspace --> AnthropicGrading : WriteGrading()
  IterationWorkspace --> EvalMetadata : WriteEvalMeta()
  IterationWorkspace --> AnthropicBenchmark : WriteBenchmark()
}

package "internal/judge" {
  class Status
  class Result
  class AssertionResult
  class ResultSummary
}

CaseResult --> Status
CaseResult --> Result

note right of JSONReporter
  **"Source of truth"**:
  the JSON report is the basis for all other formats;
  JUnit and HTML are derived views.
end note

note right of IterationWorkspace
  **Anthropic-compatible**:
  auto-detects the iteration number;
  manages with_skill / without_skill subdirectories.
end note

@enduml

Report Generators

JSONReporter (`json.go`)

Directly serializes Input to formatted JSON
Acts as the "source of truth" for every report format
When the output path is empty, writes to stdout

HTMLReporter (`html.go`)

Rendered with the standard library's html/template; the template is loaded via go:embed from templates/report.html
Bundles responsive CSS styles
Displays: skill name, engine, model, start time, execution time, pass rate
Summary cards: Total / Passed / Failed / Skipped / Errors / Pass Rate
Per-case detail table: status icons, assertion results, evidence

JUnitReporter (`junit.go`)

Generates standard JUnit XML for CI systems (Jenkins, GitHub Actions, etc.) to consume
Mapping rules:
- Each case → <testcase>
- StatusFail → <failure> element (with details about failed assertions)
- StatusError → <error> element
- StatusSkip → <skipped> element

Anthropic-Compatible Data Formats

grading.json (`grading.go`)

AnthropicGrading: contains expectations (per-assertion text / passed / evidence) and summary (passed / failed / total / pass_rate)
ConvertToAnthropicGrading(): converts the internal judge.Result into the Anthropic format
EvalMetadata: corresponds to per-case eval_metadata.json (eval_id / eval_name / prompt / assertions)

benchmark.json (`benchmark.go` + `benchmark_anthropic.go`)

Provides two layers of data structures:

Simplified mode (internal statistics, benchmark.go — BenchmarkResult):

BenchmarkStats: pass_rate / time_seconds / tokens, each with mean + stddev
BenchmarkDelta: deltas between the two configurations

Anthropic full format (benchmark_anthropic.go — AnthropicBenchmark):

BenchmarkMetadata: skill name, path, timestamp, eval ID list
BenchmarkRun: per-run details (pass_rate / passed / failed / total / time_seconds / tokens)
AnthropicRunSummary: per-configuration statistics summary with mean / stddev / min / max
AnthropicDelta: deltas formatted as strings

benchmark.md (`benchmark_md.go`)

Generates a human-readable Markdown benchmark report
Includes a summary table (Pass Rate ± StdDev) and per-case results
Supports both with-baseline and no-baseline display modes

Workspace Management (`workspace.go`)

IterationWorkspace manages the Anthropic-compatible evaluation artifact directory layout:

<skill-name>-workspace/
  iteration-<N>/
    benchmark.json
    benchmark.md
    <case-id>/
      eval_metadata.json
      with_skill/
        outputs/
          response.md
        grading.json
      without_skill/          # Optional; only when benchmark.enabled=true
        outputs/
          response.md
        grading.json

Key behaviors:

NewIterationWorkspace(): creates the iteration-N/ directory using the iteration number provided; requires N >= 1
EnsureDirs(): bulk-creates per-case directory structures and can optionally create the without_skill subdirectory
WriteResponse() / WriteGrading() / WriteEvalMeta() / WriteBenchmark(): write artifact files at the corresponding locations

Template Helpers (`template_helpers.go`)

Template functions shared by every HTML report:

Function	Description
`fmtDuration`	Milliseconds → seconds (e.g. `1500` → `"1.5s"`)
`fmtPercent`	Float → percent (e.g. `0.85` → `"85%"`)
`fmtPercentSigned`	Signed percent (e.g. `+0.1` → `"+10%"`)
`passFailClass`	Bool → CSS class (`"pass"` / `"fail"`)
`passFailIcon`	Bool → HTML icon (✅ / ❌)
`notNil`	Generic nil check (pointer, interface, slice, map, ...)
`derefFloat`	Safely dereference `*float64` (nil → 0)

Package Dependencies

Dependency	Purpose
`internal/judge`	Imports the constants `Status`, `StatusPass`, `StatusFail`, `StatusSkip`, `StatusError` and the structs `Result`, `AssertionResult`, `ResultSummary`

Standard library: context, embed, encoding/json, encoding/xml, fmt, html/template, io, math, os, path/filepath, reflect, regexp, sort, strconv, strings, time

Testing

Run all report tests:

go test ./internal/report/ -v -count=1 -timeout 60s

Run specific subsets:

# Reporter interface implementation tests
go test ./internal/report/ -run TestReporter -v

# Benchmark computation tests
go test ./internal/report/ -run TestBenchmark -v
go test ./internal/report/ -run TestBenchmarkMd -v

# Grading conversion tests
go test ./internal/report/ -run TestGrading -v

# Workspace directory management tests
go test ./internal/report/ -run TestWorkspace -v

# E2E integration tests
go test ./internal/report/ -run TestE2E -v

Test file	Description
`reporter_test.go`	Reporter interface implementations + Input methods
`benchmark_test.go`	Statistics functions: Mean / StdDev / PassRate / ComputeBenchmark, etc.
`benchmark_md_test.go`	Markdown benchmark report generation
`grading_test.go`	Anthropic grading.json conversion and writing
`workspace_test.go`	IterationWorkspace directory creation and artifact writing
`e2e_test.go`	Full-pipeline integration tests

Documentation ¶

Overview ¶

Package report — benchmark.go implements benchmark.json generation logic.

Benchmark is the final aggregation step of the evaluation pipeline. It computes statistics (mean, stddev) for pass_rate, time, and tokens, and optionally calculates delta between with_skill and without_skill runs.

Two modes (design doc):

Simplified mode (default): only with_skill, without_skill=null, delta=null
Full mode (benchmark.enabled=true): with_skill + without_skill + delta

Package report — benchmark_anthropic.go implements the Anthropic-compatible benchmark.json format and computation logic.

This file contains types and functions for generating benchmark outputs that are compatible with Anthropic's eval-viewer and skill-creator tooling.

Package report — benchmark_md.go generates human-readable Markdown benchmark reports.

Format matches demo/chinese-jokes-workspace/iteration-1/benchmark.md.

Package report — grading.go implements Anthropic-compatible grading.json and eval_metadata.json writers.

These formats align with Anthropic's skill-creator evaluation outputs, enabling interoperability with eval-viewer and other Anthropic tooling.

Package report — helpers.go provides shared utility functions used across report writers.

Package report emits JSON, JUnit, and HTML reports from evaluation runs.

Package report — template_helpers.go provides shared helper functions for HTML template rendering across different report types.

Package report — workspace.go manages the Anthropic-compatible iteration directory structure for evaluation outputs.

Directory layout:

<skill-name>-workspace/
  iteration-<N>/
    benchmark.json
    benchmark.md
    report.html
    <case-id>/
      eval_metadata.json
      with_skill/
        outputs/
          response.md
        grading.json
      without_skill/          # optional, only when benchmark.enabled=true
        outputs/
          response.md
        grading.json

Index ¶

func SharedTemplateFuncs() template.FuncMap
func WithTimestamp(t time.Time) func(*benchmarkOptions)
func WriteAnthropicBenchmark(path string, bm *AnthropicBenchmark) error
func WriteBenchmarkMarkdown(w io.Writer, bm *AnthropicBenchmark) error
func WriteBenchmarkMarkdownFile(path string, bm *AnthropicBenchmark) (err error)
func WriteEvalMetadata(path string, meta *EvalMetadata) error
func WriteGradingJSON(path string, grading *AnthropicGrading) error
func WriteHTMLReport(ctx context.Context, path string, in Input) error
type AnthropicBenchmark
- func ComputeAnthropicBenchmark(skillName, skillPath string, withSkillRuns []BenchmarkRun, ...) *AnthropicBenchmark
type AnthropicDelta
type AnthropicExpectation
type AnthropicGrading
- func ConvertToAnthropicGrading(result *judge.Result) *AnthropicGrading
type AnthropicRunSummary
type AnthropicStatSummary
type AnthropicStatValue
- func ComputeAnthropicStatValue(values []float64) AnthropicStatValue
type AnthropicSummary
type BenchmarkDelta
- func ComputeBenchmarkDelta(withSkill, withoutSkill BenchmarkStats) BenchmarkDelta
type BenchmarkMetadata
type BenchmarkResult
- func ComputeBenchmark(withSkillMetrics []CaseMetrics, withoutSkillMetrics []CaseMetrics) *BenchmarkResult
type BenchmarkRun
type BenchmarkRunResult
type BenchmarkRunSummary
type BenchmarkStats
- func ComputeBenchmarkStats(metrics []CaseMetrics) BenchmarkStats
type CaseMetrics
- func ExtractMetrics(results []CaseResult) []CaseMetrics
type CaseResult
type EvalMetadata
type HTMLReporter
- func (r *HTMLReporter) Write(_ context.Context, in Input) error
type Input
- func (in Input) OverallPassRate() float64
- func (in Input) TotalDuration() time.Duration
type IterationWorkspace
- func NewIterationWorkspace(outputDir, skillName string, iterNum int) (*IterationWorkspace, error)
- func (w *IterationWorkspace) CaseDir(caseID string) string
- func (w *IterationWorkspace) ConfigDir(caseID, config string) string
- func (w *IterationWorkspace) EnsureDirs(caseIDs []string) error
- func (w *IterationWorkspace) EnsureDirsWithBaseline(caseIDs []string) error
- func (w *IterationWorkspace) IterationDir() string
- func (w *IterationWorkspace) WithSkillDir(caseID string) string
- func (w *IterationWorkspace) WithoutSkillDir(caseID string) string
- func (w *IterationWorkspace) WriteBenchmark(bm *AnthropicBenchmark) error
- func (w *IterationWorkspace) WriteBenchmarkMD(bm *AnthropicBenchmark) error
- func (w *IterationWorkspace) WriteEvalMeta(caseID string, meta *EvalMetadata) error
- func (w *IterationWorkspace) WriteFile(relPath string, data []byte) error
- func (w *IterationWorkspace) WriteGrading(caseID, config string, grading *AnthropicGrading) error
- func (w *IterationWorkspace) WriteResponse(caseID, config, content string) error
type JSONReporter
- func (r *JSONReporter) Write(_ context.Context, in Input) error
type JUnitReporter
- func (r *JUnitReporter) Write(_ context.Context, in Input) error
type Reporter
type StatValue

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func SharedTemplateFuncs ¶

func SharedTemplateFuncs() template.FuncMap

SharedTemplateFuncs returns the template.FuncMap shared across all HTML reporters.

func WithTimestamp ¶

func WithTimestamp(t time.Time) func(*benchmarkOptions)

WithTimestamp sets a fixed timestamp for the benchmark metadata.

func WriteAnthropicBenchmark ¶

func WriteAnthropicBenchmark(path string, bm *AnthropicBenchmark) error

WriteAnthropicBenchmark writes an AnthropicBenchmark to the specified file.

func WriteBenchmarkMarkdown ¶

func WriteBenchmarkMarkdown(w io.Writer, bm *AnthropicBenchmark) error

WriteBenchmarkMarkdown writes a Markdown benchmark report to the given writer.

func WriteBenchmarkMarkdownFile ¶

func WriteBenchmarkMarkdownFile(path string, bm *AnthropicBenchmark) (err error)

WriteBenchmarkMarkdownFile writes the benchmark Markdown report to a file.

func WriteEvalMetadata ¶

func WriteEvalMetadata(path string, meta *EvalMetadata) error

WriteEvalMetadata writes an EvalMetadata to the specified file path as formatted JSON.

func WriteGradingJSON ¶

func WriteGradingJSON(path string, grading *AnthropicGrading) error

WriteGradingJSON writes an AnthropicGrading to the specified file path as formatted JSON.

func WriteHTMLReport ¶

func WriteHTMLReport(ctx context.Context, path string, in Input) error

WriteHTMLReport is a convenience function that writes an HTML report to the specified path.

Types ¶

type AnthropicBenchmark ¶

type AnthropicBenchmark struct {
	Metadata   BenchmarkMetadata   `json:"metadata"`
	Runs       []BenchmarkRun      `json:"runs"`
	RunSummary AnthropicRunSummary `json:"run_summary"`
	Notes      []string            `json:"notes,omitempty"`
}

AnthropicBenchmark corresponds to the full Anthropic benchmark.json schema.

This format includes metadata, per-run details, summary statistics, and optional notes — matching the demo/chinese-jokes-workspace/benchmark.json.

func ComputeAnthropicBenchmark ¶

func ComputeAnthropicBenchmark(
	skillName, skillPath string,
	withSkillRuns []BenchmarkRun,
	withoutSkillRuns []BenchmarkRun,
	opts ...func(*benchmarkOptions),
) *AnthropicBenchmark

ComputeAnthropicBenchmark builds the full Anthropic-compatible benchmark from evaluation run data. An optional timestamp can be provided; if empty, the current UTC time is used.

type AnthropicDelta ¶

type AnthropicDelta struct {
	PassRate    string `json:"pass_rate"`
	TimeSeconds string `json:"time_seconds"`
	Tokens      string `json:"tokens"`
}

AnthropicDelta holds the string-formatted delta values between configurations.

type AnthropicExpectation ¶

type AnthropicExpectation struct {
	Text     string `json:"text"`
	Passed   bool   `json:"passed"`
	Evidence string `json:"evidence"`
}

AnthropicExpectation is a single expectation result in the Anthropic format.

type AnthropicGrading ¶

type AnthropicGrading struct {
	Expectations []AnthropicExpectation `json:"expectations"`
	Summary      AnthropicSummary       `json:"summary"`
}

AnthropicGrading corresponds to the Anthropic grading.json schema.

Example output (from demo/chinese-jokes-workspace):

{
  "expectations": [
    {"text": "...", "passed": true, "evidence": "..."}
  ],
  "summary": {"passed": 5, "failed": 0, "total": 5, "pass_rate": 1.0}
}

func ConvertToAnthropicGrading ¶

func ConvertToAnthropicGrading(result *judge.Result) *AnthropicGrading

ConvertToAnthropicGrading converts an internal judge.Result to the Anthropic grading.json format.

Mapping:

judge.AssertionResult.Text -> AnthropicExpectation.Text
judge.AssertionResult.Passed -> AnthropicExpectation.Passed
judge.AssertionResult.Evidence -> AnthropicExpectation.Evidence
judge.ResultSummary -> AnthropicSummary (direct field mapping)

type AnthropicRunSummary ¶

type AnthropicRunSummary struct {
	WithSkill    AnthropicStatSummary  `json:"with_skill"`
	WithoutSkill *AnthropicStatSummary `json:"without_skill"`
	Delta        *AnthropicDelta       `json:"delta"`
}

AnthropicRunSummary holds per-configuration summary statistics.

type AnthropicStatSummary ¶

type AnthropicStatSummary struct {
	PassRate    AnthropicStatValue `json:"pass_rate"`
	TimeSeconds AnthropicStatValue `json:"time_seconds"`
	Tokens      AnthropicStatValue `json:"tokens"`
}

AnthropicStatSummary holds statistics with min/max for a configuration.

type AnthropicStatValue ¶

type AnthropicStatValue struct {
	Mean   float64 `json:"mean"`
	StdDev float64 `json:"stddev"`
	Min    float64 `json:"min"`
	Max    float64 `json:"max"`
}

AnthropicStatValue holds mean, stddev, min, max for a metric.

func ComputeAnthropicStatValue ¶

func ComputeAnthropicStatValue(values []float64) AnthropicStatValue

ComputeAnthropicStatValue computes an AnthropicStatValue from raw values.

type AnthropicSummary ¶

type AnthropicSummary struct {
	Passed   int     `json:"passed"`
	Failed   int     `json:"failed"`
	Total    int     `json:"total"`
	PassRate float64 `json:"pass_rate"`
}

AnthropicSummary holds aggregate pass/fail statistics in the Anthropic format.

type BenchmarkDelta ¶

type BenchmarkDelta struct {
	PassRate     float64 `json:"pass_rate"`
	TimeSeconds  float64 `json:"time_seconds"`
	InputTokens  float64 `json:"input_tokens"`
	OutputTokens float64 `json:"output_tokens"`
}

BenchmarkDelta holds the difference between with_skill and without_skill.

func ComputeBenchmarkDelta ¶

func ComputeBenchmarkDelta(withSkill, withoutSkill BenchmarkStats) BenchmarkDelta

ComputeBenchmarkDelta computes the delta between with_skill and without_skill stats.

type BenchmarkMetadata ¶

type BenchmarkMetadata struct {
	SkillName            string `json:"skill_name"`
	SkillPath            string `json:"skill_path"`
	Timestamp            string `json:"timestamp"`
	EvalsRun             []int  `json:"evals_run"`
	RunsPerConfiguration int    `json:"runs_per_configuration"`
}

BenchmarkMetadata holds skill and execution metadata.

type BenchmarkResult ¶

type BenchmarkResult struct {
	RunSummary BenchmarkRunSummary `json:"run_summary"`
}

BenchmarkResult is the top-level structure for benchmark.json.

func ComputeBenchmark ¶

func ComputeBenchmark(withSkillMetrics []CaseMetrics, withoutSkillMetrics []CaseMetrics) *BenchmarkResult

ComputeBenchmark builds the complete BenchmarkResult.

Simplified mode: pass withoutSkillMetrics as nil.
Full mode: pass both withSkillMetrics and withoutSkillMetrics.

type BenchmarkRun ¶

type BenchmarkRun struct {
	EvalID        int                    `json:"eval_id"`
	EvalName      string                 `json:"eval_name"`
	Configuration string                 `json:"configuration"`
	RunNumber     int                    `json:"run_number"`
	Result        BenchmarkRunResult     `json:"result"`
	Expectations  []AnthropicExpectation `json:"expectations"`
}

BenchmarkRun holds per-eval, per-configuration run details.

type BenchmarkRunResult ¶

type BenchmarkRunResult struct {
	PassRate    float64 `json:"pass_rate"`
	Passed      int     `json:"passed"`
	Failed      int     `json:"failed"`
	Total       int     `json:"total"`
	TimeSeconds float64 `json:"time_seconds"`
	Tokens      int     `json:"tokens"`
	Errors      int     `json:"errors"`
}

BenchmarkRunResult holds aggregated metrics for a single run.

type BenchmarkRunSummary ¶

type BenchmarkRunSummary struct {
	WithSkill    BenchmarkStats  `json:"with_skill"`
	WithoutSkill *BenchmarkStats `json:"without_skill"`
	Delta        *BenchmarkDelta `json:"delta"`
}

BenchmarkRunSummary holds the stats for with_skill and optionally without_skill.

type BenchmarkStats ¶

type BenchmarkStats struct {
	PassRate     StatValue `json:"pass_rate"`
	TimeSeconds  StatValue `json:"time_seconds"`
	InputTokens  StatValue `json:"input_tokens"`
	OutputTokens StatValue `json:"output_tokens"`
}

BenchmarkStats holds the computed statistics for a run.

func ComputeBenchmarkStats ¶

func ComputeBenchmarkStats(metrics []CaseMetrics) BenchmarkStats

ComputeBenchmarkStats computes BenchmarkStats from a slice of CaseMetrics.

type CaseMetrics ¶

type CaseMetrics struct {
	Passed       bool    // whether the case passed
	TimeSeconds  float64 // execution time in seconds
	InputTokens  float64 // input tokens consumed
	OutputTokens float64 // output tokens consumed
}

CaseMetrics holds the raw metrics for a single case execution.

func ExtractMetrics ¶

func ExtractMetrics(results []CaseResult) []CaseMetrics

ExtractMetrics converts CaseResults into CaseMetrics for benchmark computation.

type CaseResult ¶

type CaseResult struct {
	CaseID        string        `json:"case_id"`
	Title         string        `json:"title"`
	Status        judge.Status  `json:"status"`
	DurationMs    int64         `json:"duration_ms"`
	Turns         int           `json:"turns"`
	InputTokens   int           `json:"input_tokens"`
	OutputTokens  int           `json:"output_tokens"`
	Error         string        `json:"error,omitempty"`
	Grading       *judge.Result `json:"grading"`
	Configuration string        `json:"configuration,omitempty"` // "with_skill" or "without_skill"
	Prompt        string        `json:"prompt,omitempty"`        // input prompt sent to the agent
	Response      string        `json:"response,omitempty"`      // agent final message
}

CaseResult represents the result of a single case execution.

type EvalMetadata ¶

type EvalMetadata struct {
	EvalID     int      `json:"eval_id"`
	EvalName   string   `json:"eval_name"`
	Prompt     string   `json:"prompt"`
	Assertions []string `json:"assertions"`
}

EvalMetadata corresponds to the per-case eval_metadata.json file.

Example output:

{
  "eval_id": 1,
  "eval_name": "bored-coding",
  "prompt": "I'm so bored after coding all afternoon, my brain is fried.",
  "assertions": ["...", "..."]
}

type HTMLReporter ¶

type HTMLReporter struct {
	// OutputPath is the file path to write the HTML report.
	// If empty, writes to stdout.
	OutputPath string
}

HTMLReporter writes human-readable HTML summaries. Generates an interactive single-page report with case navigation, collapsible grading details, benchmark visualization, and feedback support.

func (*HTMLReporter) Write ¶

func (r *HTMLReporter) Write(_ context.Context, in Input) error

Write implements the Reporter interface.

type Input ¶

type Input struct {
	SkillName     string           `json:"skill_name"`
	SchemaVersion string           `json:"schema_version"`
	EngineName    string           `json:"engine_name"`
	ModelName     string           `json:"model_name"`
	StartTime     time.Time        `json:"start_time"`
	EndTime       time.Time        `json:"end_time"`
	CaseResults   []CaseResult     `json:"case_results"`
	TotalTokens   int              `json:"total_tokens"`
	Benchmark     *BenchmarkResult `json:"benchmark,omitempty"`
}

Input aggregates run results for reporting.

func (Input) OverallPassRate ¶

func (in Input) OverallPassRate() float64

OverallPassRate calculates the overall pass rate across all cases.

func (Input) TotalDuration ¶

func (in Input) TotalDuration() time.Duration

TotalDuration calculates the total wall-clock duration from StartTime to EndTime. Falls back to summing individual case durations if StartTime/EndTime are not set.

type IterationWorkspace ¶

type IterationWorkspace struct {
	// RootDir is the <skill-name>-workspace directory.
	RootDir string

	// IterationNum is the iteration number (1-based).
	IterationNum int

	// SkillName is the name of the skill being evaluated.
	SkillName string
}

IterationWorkspace manages a single evaluation iteration's artifact output.

func NewIterationWorkspace ¶

func NewIterationWorkspace(outputDir, skillName string, iterNum int) (*IterationWorkspace, error)

NewIterationWorkspace creates a workspace for the given iteration number. iterNum must be >= 1. If outputDir is empty, defaults to "<skillName>-workspace" in the current directory.

func (*IterationWorkspace) CaseDir ¶

func (w *IterationWorkspace) CaseDir(caseID string) string

CaseDir returns the path to a case's directory.

func (*IterationWorkspace) ConfigDir ¶

func (w *IterationWorkspace) ConfigDir(caseID, config string) string

ConfigDir returns the directory for a given configuration ("with_skill" or "without_skill").

func (*IterationWorkspace) EnsureDirs ¶

func (w *IterationWorkspace) EnsureDirs(caseIDs []string) error

EnsureDirs creates all necessary directory structure for the given case IDs.

func (*IterationWorkspace) EnsureDirsWithBaseline ¶

func (w *IterationWorkspace) EnsureDirsWithBaseline(caseIDs []string) error

EnsureDirsWithBaseline creates case directories including without_skill outputs.

func (*IterationWorkspace) IterationDir ¶

func (w *IterationWorkspace) IterationDir() string

IterationDir returns the path to the current iteration directory.

func (*IterationWorkspace) WithSkillDir ¶

func (w *IterationWorkspace) WithSkillDir(caseID string) string

WithSkillDir returns the with_skill subdirectory for a case.

func (*IterationWorkspace) WithoutSkillDir ¶

func (w *IterationWorkspace) WithoutSkillDir(caseID string) string

WithoutSkillDir returns the without_skill subdirectory for a case.

func (*IterationWorkspace) WriteBenchmark ¶

func (w *IterationWorkspace) WriteBenchmark(bm *AnthropicBenchmark) error

WriteBenchmark writes benchmark.json to the iteration directory.

func (*IterationWorkspace) WriteBenchmarkMD ¶

func (w *IterationWorkspace) WriteBenchmarkMD(bm *AnthropicBenchmark) error

WriteBenchmarkMD writes benchmark.md to the iteration directory.

func (*IterationWorkspace) WriteEvalMeta ¶

func (w *IterationWorkspace) WriteEvalMeta(caseID string, meta *EvalMetadata) error

WriteEvalMeta writes eval_metadata.json to the case directory.

func (*IterationWorkspace) WriteFile ¶

func (w *IterationWorkspace) WriteFile(relPath string, data []byte) error

WriteFile writes arbitrary content to a file in the iteration directory.

func (*IterationWorkspace) WriteGrading ¶

func (w *IterationWorkspace) WriteGrading(caseID, config string, grading *AnthropicGrading) error

WriteGrading writes grading.json to the config directory.

func (*IterationWorkspace) WriteResponse ¶

func (w *IterationWorkspace) WriteResponse(caseID, config, content string) error

WriteResponse writes the agent response to outputs/response.md. config is "with_skill" or "without_skill".

type JSONReporter ¶

type JSONReporter struct {
	// OutputPath is the file path to write the JSON report.
	// If empty, writes to stdout.
	OutputPath string
}

JSONReporter writes machine-readable JSON results. JSON is the "fact source" — JUnit and HTML are derived views.

func (*JSONReporter) Write ¶

func (r *JSONReporter) Write(_ context.Context, in Input) error

Write implements the Reporter interface.

type JUnitReporter ¶

type JUnitReporter struct {
	// OutputPath is the file path to write the JUnit XML report.
	// If empty, writes to stdout.
	OutputPath string
}

JUnitReporter writes JUnit XML for CI systems.

Mapping: each case → <testcase>, failed assertions → <failure> message.

func (*JUnitReporter) Write ¶

func (r *JUnitReporter) Write(_ context.Context, in Input) error

Write implements the Reporter interface.

type Reporter ¶

type Reporter interface {
	Write(ctx context.Context, in Input) error
}

Reporter writes evaluation output to a chosen format.

type StatValue ¶

type StatValue struct {
	Mean   float64 `json:"mean"`
	StdDev float64 `json:"stddev"`
}

StatValue holds mean and standard deviation.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

report — Reporting and Artifact Generation Layer

Position in the System

Feature Overview

Architecture and Module Relationships

Report Generators

JSONReporter (json.go)

HTMLReporter (html.go)

JUnitReporter (junit.go)

Anthropic-Compatible Data Formats

grading.json (grading.go)

benchmark.json (benchmark.go + benchmark_anthropic.go)

benchmark.md (benchmark_md.go)

Workspace Management (workspace.go)

Template Helpers (template_helpers.go)

Package Dependencies

Testing

Documentation ¶

Overview ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

func SharedTemplateFuncs ¶

func WithTimestamp ¶

func WriteAnthropicBenchmark ¶

func WriteBenchmarkMarkdown ¶

func WriteBenchmarkMarkdownFile ¶

func WriteEvalMetadata ¶

func WriteGradingJSON ¶

func WriteHTMLReport ¶

Types ¶

type AnthropicBenchmark ¶

func ComputeAnthropicBenchmark ¶

type AnthropicDelta ¶

type AnthropicExpectation ¶

type AnthropicGrading ¶

func ConvertToAnthropicGrading ¶

type AnthropicRunSummary ¶

type AnthropicStatSummary ¶

type AnthropicStatValue ¶

func ComputeAnthropicStatValue ¶

type AnthropicSummary ¶

type BenchmarkDelta ¶

func ComputeBenchmarkDelta ¶

type BenchmarkMetadata ¶

type BenchmarkResult ¶

func ComputeBenchmark ¶

type BenchmarkRun ¶

type BenchmarkRunResult ¶

type BenchmarkRunSummary ¶

type BenchmarkStats ¶

func ComputeBenchmarkStats ¶

type CaseMetrics ¶

func ExtractMetrics ¶

type CaseResult ¶

type EvalMetadata ¶

type HTMLReporter ¶

func (*HTMLReporter) Write ¶

type Input ¶

func (Input) OverallPassRate ¶

func (Input) TotalDuration ¶

type IterationWorkspace ¶

func NewIterationWorkspace ¶

func (*IterationWorkspace) CaseDir ¶

func (*IterationWorkspace) ConfigDir ¶

func (*IterationWorkspace) EnsureDirs ¶

func (*IterationWorkspace) EnsureDirsWithBaseline ¶

func (*IterationWorkspace) IterationDir ¶

func (*IterationWorkspace) WithSkillDir ¶

func (*IterationWorkspace) WithoutSkillDir ¶

func (*IterationWorkspace) WriteBenchmark ¶

func (*IterationWorkspace) WriteBenchmarkMD ¶

func (*IterationWorkspace) WriteEvalMeta ¶

func (*IterationWorkspace) WriteFile ¶

func (*IterationWorkspace) WriteGrading ¶

func (*IterationWorkspace) WriteResponse ¶

type JSONReporter ¶

func (*JSONReporter) Write ¶

type JUnitReporter ¶

JSONReporter (`json.go`)

HTMLReporter (`html.go`)

JUnitReporter (`junit.go`)

grading.json (`grading.go`)

benchmark.json (`benchmark.go` + `benchmark_anthropic.go`)

benchmark.md (`benchmark_md.go`)

Workspace Management (`workspace.go`)

Template Helpers (`template_helpers.go`)