report

package
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 14, 2026 License: Apache-2.0 Imports: 16 Imported by: 0

README

report — Reporting and Artifact Generation Layer

internal/report is the reporting layer of the skill-up framework. It converts evaluation run results into output files in multiple formats and manages the Anthropic-compatible workspace directory structure. All report generators share a unified Reporter interface that takes an Input and writes to a given path or stdout.

Position in the System

The report module is the last stage of the evaluation pipeline. It receives the evaluation results from internal/judge and produces report artifacts for both developers and CI systems:

Runner executes cases
      │
      ▼
 ┌──────────────┐
 │   Judge      │──▶ judge.Result (evaluation result)
 │  (eval layer)│
 └──────┬───────┘
        │
        ▼
 ┌──────────────┐
 │  Report      │──▶ JSON / HTML / JUnit / Markdown
 │ (report layer│   benchmark.json / grading.json / eval_metadata.json
 └──────────────┘

This corresponds to the Report Generator component in the design doc (docs/design-docs/0.1.0-design.md) and is responsible for:

  • Receiving aggregated case execution results from the Case Runner
  • Computing pass rate and benchmark statistics (mean / stddev / min / max)
  • Generating JSON / JUnit / HTML reports
  • Producing the Anthropic-compatible workspace directory structure (iteration-<N>/)

Feature Overview

File Responsibility
reporter.go Core interface (Reporter) and shared data types (Input, CaseResult, BenchmarkResult, StatValue, etc.)
json.go JSONReporter — machine-readable JSON report (the "source of truth"; other formats are derived views)
html.go HTMLReporter — human-readable HTML evaluation report with summary cards and a per-case detail table
junit.go JUnitReporter — JUnit XML report consumable by CI systems
grading.go Writers for the Anthropic-compatible grading.json and eval_metadata.json
benchmark.go Benchmark statistics computation (mean / stddev / min / max)
benchmark_anthropic.go Type definitions and computation for the Anthropic full-format benchmark.json
benchmark_md.go Human-readable Markdown benchmark report (benchmark.md)
workspace.go IterationWorkspace — manages the Anthropic-compatible iteration-<N>/ directory structure and artifact writes
template_helpers.go Shared helpers for HTML templates (formatting time, percentages, nil checks, etc.)
helpers.go Generic JSON file writer helper (writeJSONFile)
templates/ Embedded HTML template directory (report.html, review.html), loaded via go:embed; review.html template exists but the Go generator is not yet implemented

Architecture and Module Relationships

@startuml
skinparam packageStyle rectangle

package "internal/report" {

  interface Reporter {
    +Write(ctx, Input) error
  }

  class Input {
    SkillName : string
    SchemaVersion : string
    EngineName : string
    ModelName : string
    StartTime : time.Time
    EndTime : time.Time
    CaseResults : []CaseResult
    TotalTokens : int
    Benchmark : *BenchmarkResult
    +TotalDuration() time.Duration
    +OverallPassRate() float64
  }

  class CaseResult {
    CaseID : string
    Title : string
    Status : judge.Status
    DurationMs : int64
    Turns : int
    Error : string
    Grading : *judge.Result
  }

  class JSONReporter {
    OutputPath : string
  }

  class HTMLReporter {
    OutputPath : string
  }

  class JUnitReporter {
    OutputPath : string
  }

  class IterationWorkspace {
    RootDir : string
    IterationNum : int
    SkillName : string
    +IterationDir() string
    +CaseDir(caseID) string
    +WithSkillDir(caseID) string
    +WithoutSkillDir(caseID) string
    +EnsureDirs(caseIDs, withBaseline) error
    +WriteResponse(caseID, config, content) error
    +WriteGrading(caseID, config, grading) error
    +WriteEvalMeta(caseID, meta) error
    +WriteBenchmark(bm) error
    +WriteBenchmarkMD(bm) error
    +WriteFile(relPath, data) error
  }

  Reporter <|.. JSONReporter
  Reporter <|.. HTMLReporter
  Reporter <|.. JUnitReporter
  Reporter ..> Input : receives
  Input *-- CaseResult
  Input *-- BenchmarkResult
  CaseResult --> "judge.Result" : Grading
  IterationWorkspace --> AnthropicGrading : WriteGrading()
  IterationWorkspace --> EvalMetadata : WriteEvalMeta()
  IterationWorkspace --> AnthropicBenchmark : WriteBenchmark()
}

package "internal/judge" {
  class Status
  class Result
  class AssertionResult
  class ResultSummary
}

CaseResult --> Status
CaseResult --> Result

note right of JSONReporter
  **"Source of truth"**:
  the JSON report is the basis for all other formats;
  JUnit and HTML are derived views.
end note

note right of IterationWorkspace
  **Anthropic-compatible**:
  auto-detects the iteration number;
  manages with_skill / without_skill subdirectories.
end note

@enduml

Report Generators

JSONReporter (json.go)
  • Directly serializes Input to formatted JSON
  • Acts as the "source of truth" for every report format
  • When the output path is empty, writes to stdout
HTMLReporter (html.go)
  • Rendered with the standard library's html/template; the template is loaded via go:embed from templates/report.html
  • Bundles responsive CSS styles
  • Displays: skill name, engine, model, start time, execution time, pass rate
  • Summary cards: Total / Passed / Failed / Skipped / Errors / Pass Rate
  • Per-case detail table: status icons, assertion results, evidence
JUnitReporter (junit.go)
  • Generates standard JUnit XML for CI systems (Jenkins, GitHub Actions, etc.) to consume
  • Mapping rules:
    • Each case → <testcase>
    • StatusFail<failure> element (with details about failed assertions)
    • StatusError<error> element
    • StatusSkip<skipped> element

Anthropic-Compatible Data Formats

grading.json (grading.go)
  • AnthropicGrading: contains expectations (per-assertion text / passed / evidence) and summary (passed / failed / total / pass_rate)
  • ConvertToAnthropicGrading(): converts the internal judge.Result into the Anthropic format
  • EvalMetadata: corresponds to per-case eval_metadata.json (eval_id / eval_name / prompt / assertions)
benchmark.json (benchmark.go + benchmark_anthropic.go)

Provides two layers of data structures:

Simplified mode (internal statistics, benchmark.goBenchmarkResult):

  • BenchmarkStats: pass_rate / time_seconds / tokens, each with mean + stddev
  • BenchmarkDelta: deltas between the two configurations

Anthropic full format (benchmark_anthropic.goAnthropicBenchmark):

  • BenchmarkMetadata: skill name, path, timestamp, eval ID list
  • BenchmarkRun: per-run details (pass_rate / passed / failed / total / time_seconds / tokens)
  • AnthropicRunSummary: per-configuration statistics summary with mean / stddev / min / max
  • AnthropicDelta: deltas formatted as strings
benchmark.md (benchmark_md.go)
  • Generates a human-readable Markdown benchmark report
  • Includes a summary table (Pass Rate ± StdDev) and per-case results
  • Supports both with-baseline and no-baseline display modes

Workspace Management (workspace.go)

IterationWorkspace manages the Anthropic-compatible evaluation artifact directory layout:

<skill-name>-workspace/
  iteration-<N>/
    benchmark.json
    benchmark.md
    <case-id>/
      eval_metadata.json
      with_skill/
        outputs/
          response.md
        grading.json
      without_skill/          # Optional; only when benchmark.enabled=true
        outputs/
          response.md
        grading.json

Key behaviors:

  • NewIterationWorkspace(): creates the iteration-N/ directory using the iteration number provided; requires N >= 1
  • EnsureDirs(): bulk-creates per-case directory structures and can optionally create the without_skill subdirectory
  • WriteResponse() / WriteGrading() / WriteEvalMeta() / WriteBenchmark(): write artifact files at the corresponding locations

Template Helpers (template_helpers.go)

Template functions shared by every HTML report:

Function Description
fmtDuration Milliseconds → seconds (e.g. 1500"1.5s")
fmtPercent Float → percent (e.g. 0.85"85%")
fmtPercentSigned Signed percent (e.g. +0.1"+10%")
passFailClass Bool → CSS class ("pass" / "fail")
passFailIcon Bool → HTML icon (✅ / ❌)
notNil Generic nil check (pointer, interface, slice, map, ...)
derefFloat Safely dereference *float64 (nil → 0)

Package Dependencies

Dependency Purpose
internal/judge Imports the constants Status, StatusPass, StatusFail, StatusSkip, StatusError and the structs Result, AssertionResult, ResultSummary

Standard library: context, embed, encoding/json, encoding/xml, fmt, html/template, io, math, os, path/filepath, reflect, regexp, sort, strconv, strings, time

Testing

Run all report tests:

go test ./internal/report/ -v -count=1 -timeout 60s

Run specific subsets:

# Reporter interface implementation tests
go test ./internal/report/ -run TestReporter -v

# Benchmark computation tests
go test ./internal/report/ -run TestBenchmark -v
go test ./internal/report/ -run TestBenchmarkMd -v

# Grading conversion tests
go test ./internal/report/ -run TestGrading -v

# Workspace directory management tests
go test ./internal/report/ -run TestWorkspace -v

# E2E integration tests
go test ./internal/report/ -run TestE2E -v
Test file Description
reporter_test.go Reporter interface implementations + Input methods
benchmark_test.go Statistics functions: Mean / StdDev / PassRate / ComputeBenchmark, etc.
benchmark_md_test.go Markdown benchmark report generation
grading_test.go Anthropic grading.json conversion and writing
workspace_test.go IterationWorkspace directory creation and artifact writing
e2e_test.go Full-pipeline integration tests

Documentation

Overview

Package report — benchmark.go implements benchmark.json generation logic.

Benchmark is the final aggregation step of the evaluation pipeline. It computes statistics (mean, stddev) for pass_rate, time, and tokens, and optionally calculates delta between with_skill and without_skill runs.

Two modes (design doc):

  • Simplified mode (default): only with_skill, without_skill=null, delta=null
  • Full mode (benchmark.enabled=true): with_skill + without_skill + delta

Package report — benchmark_anthropic.go implements the Anthropic-compatible benchmark.json format and computation logic.

This file contains types and functions for generating benchmark outputs that are compatible with Anthropic's eval-viewer and skill-creator tooling.

Package report — benchmark_md.go generates human-readable Markdown benchmark reports.

Format matches demo/chinese-jokes-workspace/iteration-1/benchmark.md.

Package report — grading.go implements Anthropic-compatible grading.json and eval_metadata.json writers.

These formats align with Anthropic's skill-creator evaluation outputs, enabling interoperability with eval-viewer and other Anthropic tooling.

Package report — helpers.go provides shared utility functions used across report writers.

Package report emits JSON, JUnit, and HTML reports from evaluation runs.

Package report — template_helpers.go provides shared helper functions for HTML template rendering across different report types.

Package report — workspace.go manages the Anthropic-compatible iteration directory structure for evaluation outputs.

Directory layout:

<skill-name>-workspace/
  iteration-<N>/
    benchmark.json
    benchmark.md
    report.html
    <case-id>/
      eval_metadata.json
      with_skill/
        outputs/
          response.md
        grading.json
      without_skill/          # optional, only when benchmark.enabled=true
        outputs/
          response.md
        grading.json

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func SharedTemplateFuncs

func SharedTemplateFuncs() template.FuncMap

SharedTemplateFuncs returns the template.FuncMap shared across all HTML reporters.

func WithTimestamp

func WithTimestamp(t time.Time) func(*benchmarkOptions)

WithTimestamp sets a fixed timestamp for the benchmark metadata.

func WriteAnthropicBenchmark

func WriteAnthropicBenchmark(path string, bm *AnthropicBenchmark) error

WriteAnthropicBenchmark writes an AnthropicBenchmark to the specified file.

func WriteBenchmarkMarkdown

func WriteBenchmarkMarkdown(w io.Writer, bm *AnthropicBenchmark) error

WriteBenchmarkMarkdown writes a Markdown benchmark report to the given writer.

func WriteBenchmarkMarkdownFile

func WriteBenchmarkMarkdownFile(path string, bm *AnthropicBenchmark) (err error)

WriteBenchmarkMarkdownFile writes the benchmark Markdown report to a file.

func WriteEvalMetadata

func WriteEvalMetadata(path string, meta *EvalMetadata) error

WriteEvalMetadata writes an EvalMetadata to the specified file path as formatted JSON.

func WriteGradingJSON

func WriteGradingJSON(path string, grading *AnthropicGrading) error

WriteGradingJSON writes an AnthropicGrading to the specified file path as formatted JSON.

func WriteHTMLReport

func WriteHTMLReport(ctx context.Context, path string, in Input) error

WriteHTMLReport is a convenience function that writes an HTML report to the specified path.

Types

type AnthropicBenchmark

type AnthropicBenchmark struct {
	Metadata   BenchmarkMetadata   `json:"metadata"`
	Runs       []BenchmarkRun      `json:"runs"`
	RunSummary AnthropicRunSummary `json:"run_summary"`
	Notes      []string            `json:"notes,omitempty"`
}

AnthropicBenchmark corresponds to the full Anthropic benchmark.json schema.

This format includes metadata, per-run details, summary statistics, and optional notes — matching the demo/chinese-jokes-workspace/benchmark.json.

func ComputeAnthropicBenchmark

func ComputeAnthropicBenchmark(
	skillName, skillPath string,
	withSkillRuns []BenchmarkRun,
	withoutSkillRuns []BenchmarkRun,
	opts ...func(*benchmarkOptions),
) *AnthropicBenchmark

ComputeAnthropicBenchmark builds the full Anthropic-compatible benchmark from evaluation run data. An optional timestamp can be provided; if empty, the current UTC time is used.

type AnthropicDelta

type AnthropicDelta struct {
	PassRate    string `json:"pass_rate"`
	TimeSeconds string `json:"time_seconds"`
	Tokens      string `json:"tokens"`
}

AnthropicDelta holds the string-formatted delta values between configurations.

type AnthropicExpectation

type AnthropicExpectation struct {
	Text     string `json:"text"`
	Passed   bool   `json:"passed"`
	Evidence string `json:"evidence"`
}

AnthropicExpectation is a single expectation result in the Anthropic format.

type AnthropicGrading

type AnthropicGrading struct {
	Expectations []AnthropicExpectation `json:"expectations"`
	Summary      AnthropicSummary       `json:"summary"`
}

AnthropicGrading corresponds to the Anthropic grading.json schema.

Example output (from demo/chinese-jokes-workspace):

{
  "expectations": [
    {"text": "...", "passed": true, "evidence": "..."}
  ],
  "summary": {"passed": 5, "failed": 0, "total": 5, "pass_rate": 1.0}
}

func ConvertToAnthropicGrading

func ConvertToAnthropicGrading(result *judge.Result) *AnthropicGrading

ConvertToAnthropicGrading converts an internal judge.Result to the Anthropic grading.json format.

Mapping:

  • judge.AssertionResult.Text -> AnthropicExpectation.Text
  • judge.AssertionResult.Passed -> AnthropicExpectation.Passed
  • judge.AssertionResult.Evidence -> AnthropicExpectation.Evidence
  • judge.ResultSummary -> AnthropicSummary (direct field mapping)

type AnthropicRunSummary

type AnthropicRunSummary struct {
	WithSkill    AnthropicStatSummary  `json:"with_skill"`
	WithoutSkill *AnthropicStatSummary `json:"without_skill"`
	Delta        *AnthropicDelta       `json:"delta"`
}

AnthropicRunSummary holds per-configuration summary statistics.

type AnthropicStatSummary

type AnthropicStatSummary struct {
	PassRate    AnthropicStatValue `json:"pass_rate"`
	TimeSeconds AnthropicStatValue `json:"time_seconds"`
	Tokens      AnthropicStatValue `json:"tokens"`
}

AnthropicStatSummary holds statistics with min/max for a configuration.

type AnthropicStatValue

type AnthropicStatValue struct {
	Mean   float64 `json:"mean"`
	StdDev float64 `json:"stddev"`
	Min    float64 `json:"min"`
	Max    float64 `json:"max"`
}

AnthropicStatValue holds mean, stddev, min, max for a metric.

func ComputeAnthropicStatValue

func ComputeAnthropicStatValue(values []float64) AnthropicStatValue

ComputeAnthropicStatValue computes an AnthropicStatValue from raw values.

type AnthropicSummary

type AnthropicSummary struct {
	Passed   int     `json:"passed"`
	Failed   int     `json:"failed"`
	Total    int     `json:"total"`
	PassRate float64 `json:"pass_rate"`
}

AnthropicSummary holds aggregate pass/fail statistics in the Anthropic format.

type BenchmarkDelta

type BenchmarkDelta struct {
	PassRate     float64 `json:"pass_rate"`
	TimeSeconds  float64 `json:"time_seconds"`
	InputTokens  float64 `json:"input_tokens"`
	OutputTokens float64 `json:"output_tokens"`
}

BenchmarkDelta holds the difference between with_skill and without_skill.

func ComputeBenchmarkDelta

func ComputeBenchmarkDelta(withSkill, withoutSkill BenchmarkStats) BenchmarkDelta

ComputeBenchmarkDelta computes the delta between with_skill and without_skill stats.

type BenchmarkMetadata

type BenchmarkMetadata struct {
	SkillName            string `json:"skill_name"`
	SkillPath            string `json:"skill_path"`
	Timestamp            string `json:"timestamp"`
	EvalsRun             []int  `json:"evals_run"`
	RunsPerConfiguration int    `json:"runs_per_configuration"`
}

BenchmarkMetadata holds skill and execution metadata.

type BenchmarkResult

type BenchmarkResult struct {
	RunSummary BenchmarkRunSummary `json:"run_summary"`
}

BenchmarkResult is the top-level structure for benchmark.json.

func ComputeBenchmark

func ComputeBenchmark(withSkillMetrics []CaseMetrics, withoutSkillMetrics []CaseMetrics) *BenchmarkResult

ComputeBenchmark builds the complete BenchmarkResult.

  • Simplified mode: pass withoutSkillMetrics as nil.
  • Full mode: pass both withSkillMetrics and withoutSkillMetrics.

type BenchmarkRun

type BenchmarkRun struct {
	EvalID        int                    `json:"eval_id"`
	EvalName      string                 `json:"eval_name"`
	Configuration string                 `json:"configuration"`
	RunNumber     int                    `json:"run_number"`
	Result        BenchmarkRunResult     `json:"result"`
	Expectations  []AnthropicExpectation `json:"expectations"`
}

BenchmarkRun holds per-eval, per-configuration run details.

type BenchmarkRunResult

type BenchmarkRunResult struct {
	PassRate    float64 `json:"pass_rate"`
	Passed      int     `json:"passed"`
	Failed      int     `json:"failed"`
	Total       int     `json:"total"`
	TimeSeconds float64 `json:"time_seconds"`
	Tokens      int     `json:"tokens"`
	Errors      int     `json:"errors"`
}

BenchmarkRunResult holds aggregated metrics for a single run.

type BenchmarkRunSummary

type BenchmarkRunSummary struct {
	WithSkill    BenchmarkStats  `json:"with_skill"`
	WithoutSkill *BenchmarkStats `json:"without_skill"`
	Delta        *BenchmarkDelta `json:"delta"`
}

BenchmarkRunSummary holds the stats for with_skill and optionally without_skill.

type BenchmarkStats

type BenchmarkStats struct {
	PassRate     StatValue `json:"pass_rate"`
	TimeSeconds  StatValue `json:"time_seconds"`
	InputTokens  StatValue `json:"input_tokens"`
	OutputTokens StatValue `json:"output_tokens"`
}

BenchmarkStats holds the computed statistics for a run.

func ComputeBenchmarkStats

func ComputeBenchmarkStats(metrics []CaseMetrics) BenchmarkStats

ComputeBenchmarkStats computes BenchmarkStats from a slice of CaseMetrics.

type CaseMetrics

type CaseMetrics struct {
	Passed       bool    // whether the case passed
	TimeSeconds  float64 // execution time in seconds
	InputTokens  float64 // input tokens consumed
	OutputTokens float64 // output tokens consumed
}

CaseMetrics holds the raw metrics for a single case execution.

func ExtractMetrics

func ExtractMetrics(results []CaseResult) []CaseMetrics

ExtractMetrics converts CaseResults into CaseMetrics for benchmark computation.

type CaseResult

type CaseResult struct {
	CaseID        string        `json:"case_id"`
	Title         string        `json:"title"`
	Status        judge.Status  `json:"status"`
	DurationMs    int64         `json:"duration_ms"`
	Turns         int           `json:"turns"`
	InputTokens   int           `json:"input_tokens"`
	OutputTokens  int           `json:"output_tokens"`
	Error         string        `json:"error,omitempty"`
	Grading       *judge.Result `json:"grading"`
	Configuration string        `json:"configuration,omitempty"` // "with_skill" or "without_skill"
	Prompt        string        `json:"prompt,omitempty"`        // input prompt sent to the agent
	Response      string        `json:"response,omitempty"`      // agent final message
}

CaseResult represents the result of a single case execution.

type EvalMetadata

type EvalMetadata struct {
	EvalID     int      `json:"eval_id"`
	EvalName   string   `json:"eval_name"`
	Prompt     string   `json:"prompt"`
	Assertions []string `json:"assertions"`
}

EvalMetadata corresponds to the per-case eval_metadata.json file.

Example output:

{
  "eval_id": 1,
  "eval_name": "bored-coding",
  "prompt": "I'm so bored after coding all afternoon, my brain is fried.",
  "assertions": ["...", "..."]
}

type HTMLReporter

type HTMLReporter struct {
	// OutputPath is the file path to write the HTML report.
	// If empty, writes to stdout.
	OutputPath string
}

HTMLReporter writes human-readable HTML summaries. Generates an interactive single-page report with case navigation, collapsible grading details, benchmark visualization, and feedback support.

func (*HTMLReporter) Write

func (r *HTMLReporter) Write(_ context.Context, in Input) error

Write implements the Reporter interface.

type Input

type Input struct {
	SkillName     string           `json:"skill_name"`
	SchemaVersion string           `json:"schema_version"`
	EngineName    string           `json:"engine_name"`
	ModelName     string           `json:"model_name"`
	StartTime     time.Time        `json:"start_time"`
	EndTime       time.Time        `json:"end_time"`
	CaseResults   []CaseResult     `json:"case_results"`
	TotalTokens   int              `json:"total_tokens"`
	Benchmark     *BenchmarkResult `json:"benchmark,omitempty"`
}

Input aggregates run results for reporting.

func (Input) OverallPassRate

func (in Input) OverallPassRate() float64

OverallPassRate calculates the overall pass rate across all cases.

func (Input) TotalDuration

func (in Input) TotalDuration() time.Duration

TotalDuration calculates the total wall-clock duration from StartTime to EndTime. Falls back to summing individual case durations if StartTime/EndTime are not set.

type IterationWorkspace

type IterationWorkspace struct {
	// RootDir is the <skill-name>-workspace directory.
	RootDir string

	// IterationNum is the iteration number (1-based).
	IterationNum int

	// SkillName is the name of the skill being evaluated.
	SkillName string
}

IterationWorkspace manages a single evaluation iteration's artifact output.

func NewIterationWorkspace

func NewIterationWorkspace(outputDir, skillName string, iterNum int) (*IterationWorkspace, error)

NewIterationWorkspace creates a workspace for the given iteration number. iterNum must be >= 1. If outputDir is empty, defaults to "<skillName>-workspace" in the current directory.

func (*IterationWorkspace) CaseDir

func (w *IterationWorkspace) CaseDir(caseID string) string

CaseDir returns the path to a case's directory.

func (*IterationWorkspace) ConfigDir

func (w *IterationWorkspace) ConfigDir(caseID, config string) string

ConfigDir returns the directory for a given configuration ("with_skill" or "without_skill").

func (*IterationWorkspace) EnsureDirs

func (w *IterationWorkspace) EnsureDirs(caseIDs []string) error

EnsureDirs creates all necessary directory structure for the given case IDs.

func (*IterationWorkspace) EnsureDirsWithBaseline

func (w *IterationWorkspace) EnsureDirsWithBaseline(caseIDs []string) error

EnsureDirsWithBaseline creates case directories including without_skill outputs.

func (*IterationWorkspace) IterationDir

func (w *IterationWorkspace) IterationDir() string

IterationDir returns the path to the current iteration directory.

func (*IterationWorkspace) WithSkillDir

func (w *IterationWorkspace) WithSkillDir(caseID string) string

WithSkillDir returns the with_skill subdirectory for a case.

func (*IterationWorkspace) WithoutSkillDir

func (w *IterationWorkspace) WithoutSkillDir(caseID string) string

WithoutSkillDir returns the without_skill subdirectory for a case.

func (*IterationWorkspace) WriteBenchmark

func (w *IterationWorkspace) WriteBenchmark(bm *AnthropicBenchmark) error

WriteBenchmark writes benchmark.json to the iteration directory.

func (*IterationWorkspace) WriteBenchmarkMD

func (w *IterationWorkspace) WriteBenchmarkMD(bm *AnthropicBenchmark) error

WriteBenchmarkMD writes benchmark.md to the iteration directory.

func (*IterationWorkspace) WriteEvalMeta

func (w *IterationWorkspace) WriteEvalMeta(caseID string, meta *EvalMetadata) error

WriteEvalMeta writes eval_metadata.json to the case directory.

func (*IterationWorkspace) WriteFile

func (w *IterationWorkspace) WriteFile(relPath string, data []byte) error

WriteFile writes arbitrary content to a file in the iteration directory.

func (*IterationWorkspace) WriteGrading

func (w *IterationWorkspace) WriteGrading(caseID, config string, grading *AnthropicGrading) error

WriteGrading writes grading.json to the config directory.

func (*IterationWorkspace) WriteResponse

func (w *IterationWorkspace) WriteResponse(caseID, config, content string) error

WriteResponse writes the agent response to outputs/response.md. config is "with_skill" or "without_skill".

type JSONReporter

type JSONReporter struct {
	// OutputPath is the file path to write the JSON report.
	// If empty, writes to stdout.
	OutputPath string
}

JSONReporter writes machine-readable JSON results. JSON is the "fact source" — JUnit and HTML are derived views.

func (*JSONReporter) Write

func (r *JSONReporter) Write(_ context.Context, in Input) error

Write implements the Reporter interface.

type JUnitReporter

type JUnitReporter struct {
	// OutputPath is the file path to write the JUnit XML report.
	// If empty, writes to stdout.
	OutputPath string
}

JUnitReporter writes JUnit XML for CI systems.

Mapping: each case → <testcase>, failed assertions → <failure> message.

func (*JUnitReporter) Write

func (r *JUnitReporter) Write(_ context.Context, in Input) error

Write implements the Reporter interface.

type Reporter

type Reporter interface {
	Write(ctx context.Context, in Input) error
}

Reporter writes evaluation output to a chosen format.

type StatValue

type StatValue struct {
	Mean   float64 `json:"mean"`
	StdDev float64 `json:"stddev"`
}

StatValue holds mean and standard deviation.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL