bench

package module
v0.0.0-...-e209eb6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 5, 2026 License: MIT Imports: 17 Imported by: 0

README

bench

General-purpose accuracy and quality benchmarking framework for Go

Note: This package is for model/system quality evaluation (accuracy, ranking, calibration, regression), not Go micro-benchmarks. For CPU/memory micro-benchmarks see go test -bench and the per-package *_test.go files.

Think of bench as testing.B for classification accuracy, ranking quality, probability calibration, and regression error. Evaluators are backed by gokit providers, datasets flow through pipelines, and metrics are fully pluggable.

Features

  • Generics-firstSample[L], Prediction[L], Evaluator[L] are parameterised on the label type
  • Provider integration — any provider.RequestResponse becomes an evaluator with one adapter call
  • Pipeline integration — datasets expose a pipeline.Pipeline / pipeline.Iterator for lazy, backpressure-aware loading
  • Pluggable metrics — classification, probability, ranking, regression, matching — or bring your own
  • Multiple output formats — JSON, Markdown, CSV, HTML, JUnit XML, Vega-Lite, SVG visualisations
  • Comparison & regression detection — diff two runs, surface fixed/regressed samples, gate CI on thresholds
  • CLI helpersCLIRunner wires up run → store → compare → print in a few lines
  • Concurrent evaluation — fan out across evaluators with configurable concurrency and per-sample timeouts

Install

go get github.com/kbukum/gokit/bench@latest

Quick Start

package main

import (
	"context"
	"fmt"
	"os"

	"github.com/kbukum/gokit/bench"
	"github.com/kbukum/gokit/bench/metric"
	"github.com/kbukum/gokit/bench/report"
)

func main() {
	ctx := context.Background()

	// 1. Define an evaluator (wraps any prediction function).
	eval := bench.EvaluatorFunc("my-classifier",
		func(ctx context.Context, input []byte) (bench.Prediction[string], error) {
			// Replace with your model / API call.
			return bench.Prediction[string]{
				Label: "positive",
				Score: 0.92,
				Scores: map[string]float64{
					"positive": 0.92,
					"negative": 0.08,
				},
			}, nil
		},
	)

	// 2. Create a runner with metrics.
	runner := bench.NewBenchRunner(
		bench.WithTag[string]("v1.0"),
		bench.WithConcurrency[string](8),
		bench.WithMetrics(
			metric.AsRunMetric(metric.BinaryClassification[string]("positive")),
			metric.AsRunMetric(metric.AUCROC[string]("positive")),
			metric.AsRunMetric(metric.BrierScore[string]("positive")),
		),
	)

	// 3. Register one or more evaluators (branches).
	runner.Register("baseline", eval)

	// 4. Load a dataset (directory with manifest.json + sample files).
	dataset := bench.NewDatasetLoader("./testdata", func(s string) (string, error) {
		return s, nil // string labels → string
	})

	// 5. Run the benchmark.
	result, err := runner.Run(ctx, dataset)
	if err != nil {
		fmt.Fprintln(os.Stderr, err)
		os.Exit(1)
	}

	// 6. Generate a Markdown report.
	_ = report.Markdown().Generate(os.Stdout, result)
}

Key Types & Functions

Symbol Kind Description
Sample[L] struct Labeled data point — ID, Input, Label, Source, Metadata
Prediction[L] struct Evaluator output — Label, Score, per-label Scores, Metadata
ScoredSample[L] struct Pairs a Sample with its Prediction
Evaluator[L] interface provider.RequestResponse[[]byte, Prediction[L]]
EvaluatorFunc[L] func Wraps a plain func(ctx, []byte) (Prediction[L], error) as an Evaluator
FromProvider[I,O,L] func Adapts any provider.RequestResponse[I,O] into an Evaluator[L]
DatasetLoader[L] struct Reads a manifest directory into []Sample[L] or a pipeline.Pipeline
LabelMapper[L] func func(string) (L, error) — converts manifest string labels to typed L
BenchRunner[L] struct Orchestrates evaluation: load → evaluate → compute metrics → store
RunResult struct Full benchmark output — metrics, branch results, per-sample details, curves
RunComparator struct Diffs two RunResults, reports metric changes & sample regressions
CLIRunner struct Convenience wrapper: run, compare, list, show — writes to io.Writer
FileStorage struct Stores RunResult as JSON files on disk
RunStorage interface Save / Load / Latest / List for benchmark results

Sub-packages

Package Description
bench/metric Metric implementations — classification, probability, ranking, regression, matching
bench/report Output-format reporters — JSON, Markdown, CSV, Table, JUnit, Vega-Lite, HTML
bench/viz Pure-Go SVG visualisation generation — ROC, confusion matrix, calibration, distribution, comparison
bench/storage Cloud-storage adapter for bench results — wraps gokit/storage

Available Metrics

Classification
Constructor Description
BinaryClassification[L](positiveLabel, ...ClassificationOption) Precision, recall, F1, accuracy, FPR + confusion counts
ConfusionMatrix[L](labels) Full N×N confusion matrix
ThresholdSweep[L](positiveLabel, thresholds) Metrics at each threshold (default 0.1–0.9)
MultiClassClassification[L](labels) Macro / micro / weighted precision, recall, F1
Probability & Calibration
Constructor Description
AUCROC[L](positiveLabel) Area under the ROC curve
BrierScore[L](positiveLabel) Mean squared error of predicted probabilities (lower is better)
LogLoss[L](positiveLabel) Logarithmic loss (cross-entropy)
Calibration[L](positiveLabel, bins) Calibration curve — predicted probability vs actual frequency
Ranking
Constructor Description
NDCG[L](k) Normalised Discounted Cumulative Gain at k
MAP[L](positiveLabel) Mean Average Precision
PrecisionAtK[L](positiveLabel, k) Precision at top k
RecallAtK[L](positiveLabel, k) Recall at top k
Regression
Constructor Description
MAE() Mean Absolute Error (Metric[float64])
MSE() Mean Squared Error
RMSE() Root Mean Squared Error
RSquared() Coefficient of determination (R²)
Matching
Constructor Description
ExactMatch[L]() Fraction of exact label matches
FuzzyMatch(threshold) Levenshtein-based string similarity (Metric[string])
Composite
Constructor Description
Weighted[L](weights) Weighted combination of multiple metrics

Use metric.AsRunMetric / metric.AsRunMetrics to pass any Metric[L] into bench.WithMetrics.

Reporters

Constructor Output
report.JSON() Canonical Bench JSON with $schema and version
report.HTML() Self-contained HTML with embedded Vega-Lite charts
report.Markdown() GitHub-flavoured Markdown tables
report.CSV() Flat CSV — one row per metric
report.JUnit(opts...) JUnit XML — metrics become test cases, gated by targets
report.VegaLite() Vega-Lite spec JSON ({ filename: spec, … })
SVG Visualisations (bench/viz)
Function Description
viz.RenderAll(result, ...RenderOption) All available SVGs as map[string]string
viz.RenderROC(roc) ROC curve
viz.RenderConfusion(cm) Confusion-matrix heatmap
viz.RenderCalibration(cal) Calibration curve
viz.RenderDistribution(dists) Score-distribution histograms
viz.RenderComparison(branches) Branch comparison grouped bar chart

Usage Examples

Multi-class Classification
labels := []string{"cat", "dog", "bird"}

runner := bench.NewBenchRunner(
	bench.WithMetrics(
		metric.AsRunMetric(metric.MultiClassClassification(labels)),
		metric.AsRunMetric(metric.ConfusionMatrix(labels)),
	),
)
Regression
runner := bench.NewBenchRunner(
	bench.WithMetrics(
		metric.AsRunMetric(metric.RMSE()),
		metric.AsRunMetric(metric.RSquared()),
	),
)
Adapting an Existing Provider
eval := bench.FromProvider(
	myProvider,                              // provider.RequestResponse[MyInput, MyOutput]
	func(raw []byte) MyInput { ... },        // []byte → provider input
	func(out MyOutput) bench.Prediction[string] { ... }, // provider output → Prediction
)
runner.Register("my-provider", eval)
CI / CD with JUnit Targets
targets := map[string]float64{"f1": 0.90, "accuracy": 0.85}

runner := bench.NewBenchRunner(
	bench.WithTargets[string](targets),
	bench.WithFailOnRegression[string](true),
	bench.WithMetrics(
		metric.AsRunMetric(metric.BinaryClassification[string]("positive")),
	),
)

// JUnit reporter uses the same targets to pass/fail test cases.
junit := report.JUnit(report.WithTargets(targets))
_ = junit.Generate(junitFile, result)

Comparison & Regression Detection

cmp := bench.NewRunComparator(bench.WithChangeThreshold(0.02))

diff := cmp.Compare(baseResult, latestResult)

fmt.Println(diff.Summary())
// e.g. "f1: 0.91 → 0.93 (+0.02 ✓) | accuracy: 0.88 → 0.86 (−0.02 ✗)"

if diff.HasRegression() {
	fmt.Printf("Regressed samples: %v\n", diff.Regressed)
	os.Exit(1)
}

CLI Helper

store := bench.NewFileStorage("./results")
cli := bench.NewCLIRunner(store, bench.WithOutput(os.Stdout))

_ = cli.RunAndPrint(ctx, runner, dataset)  // run + print report
_ = cli.CompareLatest(ctx)                 // diff last two runs
_ = cli.ListRuns(ctx)                      // list stored runs
_ = cli.ShowRun(ctx, "run-abc123")         // show a specific run
  • providerEvaluator is a provider.RequestResponse under the hood
  • pipelineDatasetLoader.Pipeline() returns a lazy pipeline.Pipeline
  • process — wrap a subprocess as a provider, then adapt to an evaluator
  • storagebench/storage adapts gokit/storage for cloud result persistence

License

MIT — Copyright (c) 2024 kbukum

← Back to main gokit README

Documentation

Overview

Package bench provides a pluggable evaluation framework for benchmarking providers against labeled datasets.

The framework bridges gokit's provider and pipeline packages to create a composable evaluation workflow:

  • Evaluator = provider.RequestResponse[[]byte, Prediction[L]]
  • Dataset = pipeline.Iterator[Sample[L]], loaded from manifest files
  • Metrics = pluggable scorers that consume (ground-truth, prediction) pairs

Architecture

bench models evaluation as a data pipeline:

Dataset → Evaluator → ScoredSample → Metrics → Results

Datasets are loaded lazily through pipeline.Iterator, so arbitrarily large datasets stream through memory without loading everything at once. Evaluators wrap any provider.RequestResponse via the FromProvider adapter, or use EvaluatorFunc for quick inline definitions. Metrics are stateless functions that receive a slice of ScoredSample and return scalar or structured results.

Quick Start

loader := bench.NewDatasetLoader[string](dir, func(s string) (string, error) {
    return s, nil
})
samples, _ := loader.All(ctx)

eval := bench.EvaluatorFunc("my-model", func(ctx context.Context, input []byte) (bench.Prediction[string], error) {
    label, score := myModel.Predict(input)
    return bench.Prediction[string]{Label: label, Score: score}, nil
})

var scored []bench.ScoredSample[string]
for _, s := range samples {
    pred, _ := eval.Execute(ctx, s.Input)
    scored = append(scored, bench.ScoredSample[string]{Sample: s, Prediction: pred})
}

suite := metric.NewSuite(metric.BinaryClassification("positive"))
results := suite.Compute(scored)

Sub-packages

  • metric: pluggable metric implementations (classification, confusion matrix, threshold sweep)
  • report: result formatting and output (planned)

Index

Constants

View Source
const SchemaURL = "https://gokit.dev/bench/v1/schema.json"

SchemaURL is the schema URL for Bench JSON output.

View Source
const SchemaVersion = "1.0"

SchemaVersion is the current Bench JSON schema version.

Variables

This section is empty.

Functions

This section is empty.

Types

type BenchRunner

type BenchRunner[L comparable] struct {
	// contains filtered or unexported fields
}

BenchRunner orchestrates evaluation runs.

func NewBenchRunner

func NewBenchRunner[L comparable](opts ...RunOption[L]) *BenchRunner[L]

NewBenchRunner creates a new runner with the given options.

func (*BenchRunner[L]) Register

func (r *BenchRunner[L]) Register(name string, eval Evaluator[L], opts ...BranchOption)

Register adds an evaluator branch to the runner.

func (*BenchRunner[L]) Run

func (r *BenchRunner[L]) Run(ctx context.Context, dataset *DatasetLoader[L]) (*RunResult, error)

Run executes the benchmark: loads samples, runs evaluators, computes metrics, and stores results.

type BranchOption

type BranchOption func(*branchConfig)

BranchOption configures a branch registration.

func WithTier

func WithTier(tier int) BranchOption

WithTier sets the tier for a branch (used for tiered evaluation).

type BranchResult

type BranchResult struct {
	Name             string             `json:"name"`
	Tier             int                `json:"tier"`
	Metrics          map[string]float64 `json:"metrics"`
	AvgScorePositive float64            `json:"avg_score_positive"`
	AvgScoreNegative float64            `json:"avg_score_negative"`
	Duration         time.Duration      `json:"duration_ms"`
	Errors           int                `json:"errors"`
}

BranchResult holds results for a single evaluator branch.

type CLIOption

type CLIOption func(*CLIRunner)

CLIOption configures a CLIRunner.

func WithOutput

func WithOutput(w io.Writer) CLIOption

WithOutput sets the output writer for CLI output (default: os.Stdout).

type CLIRunner

type CLIRunner struct {
	// contains filtered or unexported fields
}

CLIRunner provides CLI-friendly helpers for benchmark operations.

func NewCLIRunner

func NewCLIRunner(storage RunStorage, opts ...CLIOption) *CLIRunner

NewCLIRunner creates a CLI runner backed by the given storage.

func (*CLIRunner) CompareLatest

func (c *CLIRunner) CompareLatest(ctx context.Context) error

CompareLatest compares the two most recent runs.

func (*CLIRunner) CompareRuns

func (c *CLIRunner) CompareRuns(ctx context.Context, baseID, targetID string) error

CompareRuns loads two runs by ID and prints their comparison.

func (*CLIRunner) ListRuns

func (c *CLIRunner) ListRuns(ctx context.Context, opts ...ListOption) error

ListRuns prints a table of stored runs.

func (*CLIRunner) RunAndPrint

func (c *CLIRunner) RunAndPrint(ctx context.Context, runner *BenchRunner[string], dataset *DatasetLoader[string]) error

RunAndPrint executes a benchmark run and prints the results.

func (*CLIRunner) ShowRun

func (c *CLIRunner) ShowRun(ctx context.Context, runID string) error

ShowRun loads and prints a specific run in detail.

type CachingMiddleware

type CachingMiddleware[L comparable] struct {
	// contains filtered or unexported fields
}

CachingMiddleware wraps an evaluator and caches results by input hash.

func WithCaching

func WithCaching[L comparable](eval Evaluator[L]) *CachingMiddleware[L]

WithCaching wraps an evaluator with SHA-256 input-based caching.

func (*CachingMiddleware[L]) Execute

func (m *CachingMiddleware[L]) Execute(ctx context.Context, input []byte) (Prediction[L], error)

func (*CachingMiddleware[L]) IsAvailable

func (m *CachingMiddleware[L]) IsAvailable(ctx context.Context) bool

func (*CachingMiddleware[L]) Name

func (m *CachingMiddleware[L]) Name() string

func (*CachingMiddleware[L]) Stats

func (m *CachingMiddleware[L]) Stats() (hits, misses int)

Stats returns the number of cache hits and misses.

type CalibrationCurve

type CalibrationCurve struct {
	PredictedProbability []float64
	ActualFrequency      []float64
	BinCount             []int
}

CalibrationCurve holds calibration curve data.

type CompareOption

type CompareOption func(*RunComparator)

CompareOption configures comparison.

func WithChangeThreshold

func WithChangeThreshold(t float64) CompareOption

WithChangeThreshold sets the minimum absolute change to report as significant (default: 0.01).

type ConfusionMatrixDetail

type ConfusionMatrixDetail struct {
	Labels      []string
	Matrix      [][]int
	Orientation string // "row=actual, col=predicted"
}

ConfusionMatrixDetail holds confusion matrix data.

type DatasetInfo

type DatasetInfo struct {
	Name              string         `json:"name"`
	Version           string         `json:"version"`
	SampleCount       int            `json:"sample_count"`
	LabelDistribution map[string]int `json:"label_distribution"`
}

DatasetInfo holds summary info about the dataset used.

type DatasetLoader

type DatasetLoader[L comparable] struct {
	// contains filtered or unexported fields
}

DatasetLoader loads labeled samples from a manifest file.

func NewDatasetLoader

func NewDatasetLoader[L comparable](dir string, mapper LabelMapper[L], opts ...DatasetOption) *DatasetLoader[L]

NewDatasetLoader creates a loader for the given directory.

func (*DatasetLoader[L]) All

func (d *DatasetLoader[L]) All(ctx context.Context) ([]Sample[L], error)

All loads all samples into memory.

func (*DatasetLoader[L]) Filter

func (d *DatasetLoader[L]) Filter(fn func(ManifestSample) bool) *DatasetLoader[L]

Filter returns a new loader that only yields matching samples.

func (*DatasetLoader[L]) Iterator

func (d *DatasetLoader[L]) Iterator(ctx context.Context) (pipeline.Iterator[Sample[L]], error)

Iterator returns a pipeline.Iterator that lazily loads samples.

func (*DatasetLoader[L]) Manifest

func (d *DatasetLoader[L]) Manifest() (*DatasetManifest, error)

Manifest returns the parsed manifest.

func (*DatasetLoader[L]) Pipeline

func (d *DatasetLoader[L]) Pipeline() *pipeline.Pipeline[Sample[L]]

Pipeline returns a Pipeline[Sample[L]] for composition.

type DatasetManifest

type DatasetManifest struct {
	Name    string           `json:"name"`
	Version string           `json:"version"`
	Samples []ManifestSample `json:"samples"`
}

DatasetManifest describes a labeled dataset on disk.

type DatasetOption

type DatasetOption func(*datasetConfig)

DatasetOption configures dataset loading.

func WithManifestFile

func WithManifestFile(name string) DatasetOption

WithManifestFile sets the manifest filename (default: "manifest.json").

type Evaluator

type Evaluator[L comparable] interface {
	provider.RequestResponse[[]byte, Prediction[L]]
}

Evaluator is a provider.RequestResponse that produces predictions from raw input.

func EvaluatorFunc

func EvaluatorFunc[L comparable](name string, fn func(ctx context.Context, input []byte) (Prediction[L], error)) Evaluator[L]

EvaluatorFunc wraps a plain function as an Evaluator.

func FromProcess

func FromProcess[L comparable](
	name string,
	buildCmd func(Sample[L]) process.Command,
	parseOutput func(*process.Result) (Prediction[L], error),
) Evaluator[L]

FromProcess creates an Evaluator that calls a subprocess. buildCmd creates the process command from a sample's raw input. parseOutput extracts a prediction from the process result.

func FromProvider

func FromProvider[I, O any, L comparable](
	p provider.RequestResponse[I, O],
	toInput func([]byte) I,
	toPrediction func(O) Prediction[L],
) Evaluator[L]

FromProvider adapts any RequestResponse provider into an Evaluator using mapper functions for input/output transformation.

type FileStorage

type FileStorage struct {
	// contains filtered or unexported fields
}

FileStorage stores results as JSON files on disk.

func NewFileStorage

func NewFileStorage(dir string) *FileStorage

NewFileStorage creates a FileStorage that persists results under dir.

func (*FileStorage) Latest

func (fs *FileStorage) Latest(ctx context.Context) (*RunResult, error)

Latest returns the most recent RunResult by timestamp.

func (*FileStorage) List

func (fs *FileStorage) List(_ context.Context, opts ...ListOption) ([]RunSummary, error)

List returns summaries of stored results, sorted by timestamp descending.

func (*FileStorage) Load

func (fs *FileStorage) Load(_ context.Context, runID string) (*RunResult, error)

Load reads a RunResult from disk by run ID.

func (*FileStorage) Save

func (fs *FileStorage) Save(_ context.Context, result *RunResult) (string, error)

Save writes the RunResult as a JSON file named {runID}.json.

type LabelMapper

type LabelMapper[L comparable] func(string) (L, error)

LabelMapper converts a string label from a manifest into a typed label.

type ListOption

type ListOption func(*listConfig)

ListOption configures result listing.

func WithDatasetFilter

func WithDatasetFilter(dataset string) ListOption

WithDatasetFilter filters results by dataset name.

func WithLimit

func WithLimit(n int) ListOption

WithLimit sets the maximum number of results to return.

func WithTagFilter

func WithTagFilter(tag string) ListOption

WithTagFilter filters results by tag.

type ListParams

type ListParams struct {
	Limit   int
	Tag     string
	Dataset string
}

ListParams holds the resolved parameters from ListOption values.

func ResolveListOptions

func ResolveListOptions(opts ...ListOption) ListParams

ResolveListOptions applies the given options and returns the resolved parameters. This is useful for external RunStorage implementations that need to inspect filter values.

type ManifestSample

type ManifestSample struct {
	ID     string         `json:"id"`
	File   string         `json:"file"`
	Label  string         `json:"label"`
	Source string         `json:"source,omitempty"`
	Meta   map[string]any `json:"metadata,omitempty"`
}

ManifestSample is one entry in a dataset manifest file.

type MetricChange

type MetricChange struct {
	Name        string
	OldValue    float64
	NewValue    float64
	Delta       float64
	Improved    bool
	Significant bool // above threshold
}

MetricChange represents a change in a metric between two runs.

type MetricResult

type MetricResult struct {
	Name   string             `json:"name"`
	Value  float64            `json:"value"`
	Values map[string]float64 `json:"values,omitempty"`
	Detail any                `json:"detail,omitempty"`
}

MetricResult pairs a metric name with its result.

type PrecisionRecallCurve

type PrecisionRecallCurve struct {
	Precision  []float64
	Recall     []float64
	Thresholds []float64
}

PrecisionRecallCurve holds precision-recall curve data.

type Prediction

type Prediction[L comparable] struct {
	SampleID string
	Label    L
	Score    float64
	Scores   map[L]float64
	Metadata map[string]any
}

Prediction represents an evaluator's output for a single sample.

type ROCCurve

type ROCCurve struct {
	FPR        []float64
	TPR        []float64
	Thresholds []float64
	AUC        float64
}

ROCCurve holds receiver operating characteristic curve data.

type RunComparator

type RunComparator struct {
	// contains filtered or unexported fields
}

RunComparator compares two benchmark runs.

func NewRunComparator

func NewRunComparator(opts ...CompareOption) *RunComparator

NewRunComparator creates a comparator with default settings.

func (*RunComparator) Compare

func (c *RunComparator) Compare(base, target *RunResult) *RunDiff

Compare compares two RunResults and returns the diff.

type RunDiff

type RunDiff struct {
	BaseID    string
	TargetID  string
	Changes   []MetricChange
	Fixed     []string // sample IDs that went from wrong to correct
	Regressed []string // sample IDs that went from correct to wrong
}

RunDiff holds the comparison result between two benchmark runs.

func (*RunDiff) HasRegression

func (d *RunDiff) HasRegression() bool

HasRegression returns true if any metric decreased significantly.

func (*RunDiff) Summary

func (d *RunDiff) Summary() string

Summary returns a human-readable summary of the comparison.

type RunMetric

type RunMetric[L comparable] interface {
	Name() string
	Compute(scored []ScoredSample[L]) MetricResult
}

RunMetric computes evaluation scores from predictions vs ground truth. This interface mirrors metric.Metric[L] but lives in bench to avoid an import cycle (bench/metric already imports bench). Use metric.AsRunMetric to adapt metric.Metric[L] values.

type RunOption

type RunOption[L comparable] func(*runConfig[L])

RunOption configures a BenchRunner.

func WithConcurrency

func WithConcurrency[L comparable](n int) RunOption[L]

WithConcurrency sets the number of parallel evaluation workers. Values <= 1 mean sequential execution.

func WithFailOnRegression

func WithFailOnRegression[L comparable](b bool) RunOption[L]

WithFailOnRegression configures whether the run should fail if a regression is detected compared to the previous run.

func WithMetrics

func WithMetrics[L comparable](metrics ...RunMetric[L]) RunOption[L]

WithMetrics configures the metrics to compute.

func WithStorage

func WithStorage[L comparable](s RunStorage) RunOption[L]

WithStorage configures the storage backend for persisting results.

func WithTag

func WithTag[L comparable](tag string) RunOption[L]

WithTag sets a human-readable tag for the run.

func WithTargets

func WithTargets[L comparable](targets map[string]float64) RunOption[L]

WithTargets sets metric target thresholds (metric name → minimum value).

func WithTimeout

func WithTimeout[L comparable](d time.Duration) RunOption[L]

WithTimeout sets the per-sample evaluation timeout.

type RunResult

type RunResult struct {
	ID        string                  `json:"id"`
	Schema    string                  `json:"schema"`
	Timestamp time.Time               `json:"timestamp"`
	Tag       string                  `json:"tag,omitempty"`
	Duration  time.Duration           `json:"duration_ms"`
	Dataset   DatasetInfo             `json:"dataset"`
	Metrics   []MetricResult          `json:"metrics"`
	Branches  map[string]BranchResult `json:"branches"`
	Samples   []SampleResult          `json:"samples"`
	Curves    map[string]any          `json:"curves,omitempty"`
}

RunResult holds the complete output of a benchmark run.

type RunStorage

type RunStorage interface {
	Save(ctx context.Context, result *RunResult) (string, error)
	Load(ctx context.Context, runID string) (*RunResult, error)
	Latest(ctx context.Context) (*RunResult, error)
	List(ctx context.Context, opts ...ListOption) ([]RunSummary, error)
}

RunStorage persists benchmark results.

type RunSummary

type RunSummary struct {
	ID        string    `json:"id"`
	Timestamp time.Time `json:"timestamp"`
	Tag       string    `json:"tag,omitempty"`
	Dataset   string    `json:"dataset"`
	F1        float64   `json:"f1,omitempty"`
}

RunSummary is a lightweight summary for listing runs.

type Sample

type Sample[L comparable] struct {
	ID       string
	Input    []byte
	Label    L
	Source   string
	Metadata map[string]any
}

Sample represents a labeled data point in an evaluation dataset.

type SampleResult

type SampleResult struct {
	ID           string             `json:"id"`
	Label        string             `json:"label"`
	Predicted    string             `json:"predicted"`
	Score        float64            `json:"score"`
	Correct      bool               `json:"correct"`
	BranchScores map[string]float64 `json:"branch_scores,omitempty"`
	Duration     time.Duration      `json:"duration_ms"`
	Error        string             `json:"error,omitempty"`
}

SampleResult holds per-sample evaluation results.

type ScoreDistribution

type ScoreDistribution struct {
	Label  string
	Bins   []float64
	Counts []int
}

ScoreDistribution holds a histogram of scores for a label.

type ScoredSample

type ScoredSample[L comparable] struct {
	Sample     Sample[L]
	Prediction Prediction[L]
}

ScoredSample pairs a ground-truth sample with its prediction.

type ThresholdPoint

type ThresholdPoint struct {
	Threshold float64
	Precision float64
	Recall    float64
	F1        float64
	Accuracy  float64
}

ThresholdPoint holds classification metrics at a specific threshold.

type TimingMiddleware

type TimingMiddleware[L comparable] struct {
	// contains filtered or unexported fields
}

TimingMiddleware wraps an evaluator and records per-sample execution times.

func WithTiming

func WithTiming[L comparable](eval Evaluator[L]) *TimingMiddleware[L]

WithTiming wraps an evaluator with timing instrumentation.

func (*TimingMiddleware[L]) Execute

func (m *TimingMiddleware[L]) Execute(ctx context.Context, input []byte) (Prediction[L], error)

func (*TimingMiddleware[L]) IsAvailable

func (m *TimingMiddleware[L]) IsAvailable(ctx context.Context) bool

func (*TimingMiddleware[L]) Name

func (m *TimingMiddleware[L]) Name() string

func (*TimingMiddleware[L]) Timings

func (m *TimingMiddleware[L]) Timings() map[string]time.Duration

Timings returns a copy of the recorded per-sample durations.

Directories

Path Synopsis
Package metric provides pluggable evaluation metrics for the bench framework.
Package metric provides pluggable evaluation metrics for the bench framework.
Package report generates formatted output from bench evaluation results.
Package report generates formatted output from bench evaluation results.
Package viz generates SVG visualizations from bench evaluation results.
Package viz generates SVG visualizations from bench evaluation results.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL