bench

package module

v0.0.0-...-e209eb6 Latest Latest Go to latest Published: May 5, 2026 License: MIT Imports: 17 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/kbukum/gokit

Links

Open Source Insights

README ¶

bench

General-purpose accuracy and quality benchmarking framework for Go

Note: This package is for model/system quality evaluation (accuracy, ranking, calibration, regression), not Go micro-benchmarks. For CPU/memory micro-benchmarks see go test -bench and the per-package *_test.go files.

Think of bench as testing.B for classification accuracy, ranking quality, probability calibration, and regression error. Evaluators are backed by gokit providers, datasets flow through pipelines, and metrics are fully pluggable.

Features

Generics-first — Sample[L], Prediction[L], Evaluator[L] are parameterised on the label type
Provider integration — any provider.RequestResponse becomes an evaluator with one adapter call
Pipeline integration — datasets expose a pipeline.Pipeline / pipeline.Iterator for lazy, backpressure-aware loading
Pluggable metrics — classification, probability, ranking, regression, matching — or bring your own
Multiple output formats — JSON, Markdown, CSV, HTML, JUnit XML, Vega-Lite, SVG visualisations
Comparison & regression detection — diff two runs, surface fixed/regressed samples, gate CI on thresholds
CLI helpers — CLIRunner wires up run → store → compare → print in a few lines
Concurrent evaluation — fan out across evaluators with configurable concurrency and per-sample timeouts

Install

go get github.com/kbukum/gokit/bench@latest

Quick Start

package main

import (
	"context"
	"fmt"
	"os"

	"github.com/kbukum/gokit/bench"
	"github.com/kbukum/gokit/bench/metric"
	"github.com/kbukum/gokit/bench/report"
)

func main() {
	ctx := context.Background()

	// 1. Define an evaluator (wraps any prediction function).
	eval := bench.EvaluatorFunc("my-classifier",
		func(ctx context.Context, input []byte) (bench.Prediction[string], error) {
			// Replace with your model / API call.
			return bench.Prediction[string]{
				Label: "positive",
				Score: 0.92,
				Scores: map[string]float64{
					"positive": 0.92,
					"negative": 0.08,
				},
			}, nil
		},
	)

	// 2. Create a runner with metrics.
	runner := bench.NewBenchRunner(
		bench.WithTag[string]("v1.0"),
		bench.WithConcurrency[string](8),
		bench.WithMetrics(
			metric.AsRunMetric(metric.BinaryClassification[string]("positive")),
			metric.AsRunMetric(metric.AUCROC[string]("positive")),
			metric.AsRunMetric(metric.BrierScore[string]("positive")),
		),
	)

	// 3. Register one or more evaluators (branches).
	runner.Register("baseline", eval)

	// 4. Load a dataset (directory with manifest.json + sample files).
	dataset := bench.NewDatasetLoader("./testdata", func(s string) (string, error) {
		return s, nil // string labels → string
	})

	// 5. Run the benchmark.
	result, err := runner.Run(ctx, dataset)
	if err != nil {
		fmt.Fprintln(os.Stderr, err)
		os.Exit(1)
	}

	// 6. Generate a Markdown report.
	_ = report.Markdown().Generate(os.Stdout, result)
}

Key Types & Functions

Symbol	Kind	Description
`Sample[L]`	struct	Labeled data point — ID, Input, Label, Source, Metadata
`Prediction[L]`	struct	Evaluator output — Label, Score, per-label Scores, Metadata
`ScoredSample[L]`	struct	Pairs a `Sample` with its `Prediction`
`Evaluator[L]`	interface	`provider.RequestResponse[[]byte, Prediction[L]]`
`EvaluatorFunc[L]`	func	Wraps a plain `func(ctx, []byte) (Prediction[L], error)` as an `Evaluator`
`FromProvider[I,O,L]`	func	Adapts any `provider.RequestResponse[I,O]` into an `Evaluator[L]`
`DatasetLoader[L]`	struct	Reads a manifest directory into `[]Sample[L]` or a `pipeline.Pipeline`
`LabelMapper[L]`	func	`func(string) (L, error)` — converts manifest string labels to typed `L`
`BenchRunner[L]`	struct	Orchestrates evaluation: load → evaluate → compute metrics → store
`RunResult`	struct	Full benchmark output — metrics, branch results, per-sample details, curves
`RunComparator`	struct	Diffs two `RunResult`s, reports metric changes & sample regressions
`CLIRunner`	struct	Convenience wrapper: run, compare, list, show — writes to `io.Writer`
`FileStorage`	struct	Stores `RunResult` as JSON files on disk
`RunStorage`	interface	Save / Load / Latest / List for benchmark results

Sub-packages

Package	Description
`bench/metric`	Metric implementations — classification, probability, ranking, regression, matching
`bench/report`	Output-format reporters — JSON, Markdown, CSV, Table, JUnit, Vega-Lite, HTML
`bench/viz`	Pure-Go SVG visualisation generation — ROC, confusion matrix, calibration, distribution, comparison
`bench/storage`	Cloud-storage adapter for bench results — wraps `gokit/storage`

Available Metrics

Classification

Constructor	Description
`BinaryClassification[L](positiveLabel, ...ClassificationOption)`	Precision, recall, F1, accuracy, FPR + confusion counts
`ConfusionMatrix[L](labels)`	Full N×N confusion matrix
`ThresholdSweep[L](positiveLabel, thresholds)`	Metrics at each threshold (default 0.1–0.9)
`MultiClassClassification[L](labels)`	Macro / micro / weighted precision, recall, F1

Probability & Calibration

Constructor	Description
`AUCROC[L](positiveLabel)`	Area under the ROC curve
`BrierScore[L](positiveLabel)`	Mean squared error of predicted probabilities (lower is better)
`LogLoss[L](positiveLabel)`	Logarithmic loss (cross-entropy)
`Calibration[L](positiveLabel, bins)`	Calibration curve — predicted probability vs actual frequency

Ranking

Constructor	Description
`NDCG[L](k)`	Normalised Discounted Cumulative Gain at k
`MAP[L](positiveLabel)`	Mean Average Precision
`PrecisionAtK[L](positiveLabel, k)`	Precision at top k
`RecallAtK[L](positiveLabel, k)`	Recall at top k

Regression

Constructor	Description
`MAE()`	Mean Absolute Error (`Metric[float64]`)
`MSE()`	Mean Squared Error
`RMSE()`	Root Mean Squared Error
`RSquared()`	Coefficient of determination (R²)

Matching

Constructor	Description
`ExactMatch[L]()`	Fraction of exact label matches
`FuzzyMatch(threshold)`	Levenshtein-based string similarity (`Metric[string]`)

Composite

Constructor	Description
`Weighted[L](weights)`	Weighted combination of multiple metrics

Use metric.AsRunMetric / metric.AsRunMetrics to pass any Metric[L] into bench.WithMetrics.

Reporters

Constructor	Output
`report.JSON()`	Canonical Bench JSON with `$schema` and version
`report.HTML()`	Self-contained HTML with embedded Vega-Lite charts
`report.Markdown()`	GitHub-flavoured Markdown tables
`report.CSV()`	Flat CSV — one row per metric
`report.JUnit(opts...)`	JUnit XML — metrics become test cases, gated by targets
`report.VegaLite()`	Vega-Lite spec JSON (`{ filename: spec, … }`)

SVG Visualisations (`bench/viz`)

Function	Description
`viz.RenderAll(result, ...RenderOption)`	All available SVGs as `map[string]string`
`viz.RenderROC(roc)`	ROC curve
`viz.RenderConfusion(cm)`	Confusion-matrix heatmap
`viz.RenderCalibration(cal)`	Calibration curve
`viz.RenderDistribution(dists)`	Score-distribution histograms
`viz.RenderComparison(branches)`	Branch comparison grouped bar chart

Usage Examples

Multi-class Classification

labels := []string{"cat", "dog", "bird"}

runner := bench.NewBenchRunner(
	bench.WithMetrics(
		metric.AsRunMetric(metric.MultiClassClassification(labels)),
		metric.AsRunMetric(metric.ConfusionMatrix(labels)),
	),
)

Regression

runner := bench.NewBenchRunner(
	bench.WithMetrics(
		metric.AsRunMetric(metric.RMSE()),
		metric.AsRunMetric(metric.RSquared()),
	),
)

Adapting an Existing Provider

eval := bench.FromProvider(
	myProvider,                              // provider.RequestResponse[MyInput, MyOutput]
	func(raw []byte) MyInput { ... },        // []byte → provider input
	func(out MyOutput) bench.Prediction[string] { ... }, // provider output → Prediction
)
runner.Register("my-provider", eval)

CI / CD with JUnit Targets

targets := map[string]float64{"f1": 0.90, "accuracy": 0.85}

runner := bench.NewBenchRunner(
	bench.WithTargets[string](targets),
	bench.WithFailOnRegression[string](true),
	bench.WithMetrics(
		metric.AsRunMetric(metric.BinaryClassification[string]("positive")),
	),
)

// JUnit reporter uses the same targets to pass/fail test cases.
junit := report.JUnit(report.WithTargets(targets))
_ = junit.Generate(junitFile, result)

Comparison & Regression Detection

cmp := bench.NewRunComparator(bench.WithChangeThreshold(0.02))

diff := cmp.Compare(baseResult, latestResult)

fmt.Println(diff.Summary())
// e.g. "f1: 0.91 → 0.93 (+0.02 ✓) | accuracy: 0.88 → 0.86 (−0.02 ✗)"

if diff.HasRegression() {
	fmt.Printf("Regressed samples: %v\n", diff.Regressed)
	os.Exit(1)
}

CLI Helper

store := bench.NewFileStorage("./results")
cli := bench.NewCLIRunner(store, bench.WithOutput(os.Stdout))

_ = cli.RunAndPrint(ctx, runner, dataset)  // run + print report
_ = cli.CompareLatest(ctx)                 // diff last two runs
_ = cli.ListRuns(ctx)                      // list stored runs
_ = cli.ShowRun(ctx, "run-abc123")         // show a specific run

provider — Evaluator is a provider.RequestResponse under the hood
pipeline — DatasetLoader.Pipeline() returns a lazy pipeline.Pipeline
process — wrap a subprocess as a provider, then adapt to an evaluator
storage — bench/storage adapts gokit/storage for cloud result persistence

License

← Back to main gokit README

Documentation ¶

Overview ¶

Package bench provides a pluggable evaluation framework for benchmarking providers against labeled datasets.

The framework bridges gokit's provider and pipeline packages to create a composable evaluation workflow:

Evaluator = provider.RequestResponse[[]byte, Prediction[L]]
Dataset = pipeline.Iterator[Sample[L]], loaded from manifest files
Metrics = pluggable scorers that consume (ground-truth, prediction) pairs

Architecture ¶

bench models evaluation as a data pipeline:

Dataset → Evaluator → ScoredSample → Metrics → Results

Datasets are loaded lazily through pipeline.Iterator, so arbitrarily large datasets stream through memory without loading everything at once. Evaluators wrap any provider.RequestResponse via the FromProvider adapter, or use EvaluatorFunc for quick inline definitions. Metrics are stateless functions that receive a slice of ScoredSample and return scalar or structured results.

Quick Start ¶

loader := bench.NewDatasetLoader[string](dir, func(s string) (string, error) {
    return s, nil
})
samples, _ := loader.All(ctx)

eval := bench.EvaluatorFunc("my-model", func(ctx context.Context, input []byte) (bench.Prediction[string], error) {
    label, score := myModel.Predict(input)
    return bench.Prediction[string]{Label: label, Score: score}, nil
})

var scored []bench.ScoredSample[string]
for _, s := range samples {
    pred, _ := eval.Execute(ctx, s.Input)
    scored = append(scored, bench.ScoredSample[string]{Sample: s, Prediction: pred})
}

suite := metric.NewSuite(metric.BinaryClassification("positive"))
results := suite.Compute(scored)

Sub-packages ¶

metric: pluggable metric implementations (classification, confusion matrix, threshold sweep)
report: result formatting and output (planned)

Index ¶

Constants
type BenchRunner
- func NewBenchRunner[L comparable](opts ...RunOption[L]) *BenchRunner[L]
- func (r *BenchRunner[L]) Register(name string, eval Evaluator[L], opts ...BranchOption)
- func (r *BenchRunner[L]) Run(ctx context.Context, dataset *DatasetLoader[L]) (*RunResult, error)
type BranchOption
- func WithTier(tier int) BranchOption
type BranchResult
type CLIOption
- func WithOutput(w io.Writer) CLIOption
type CLIRunner
- func NewCLIRunner(storage RunStorage, opts ...CLIOption) *CLIRunner
- func (c *CLIRunner) CompareLatest(ctx context.Context) error
- func (c *CLIRunner) CompareRuns(ctx context.Context, baseID, targetID string) error
- func (c *CLIRunner) ListRuns(ctx context.Context, opts ...ListOption) error
- func (c *CLIRunner) RunAndPrint(ctx context.Context, runner *BenchRunner[string], ...) error
- func (c *CLIRunner) ShowRun(ctx context.Context, runID string) error
type CachingMiddleware
- func WithCaching[L comparable](eval Evaluator[L]) *CachingMiddleware[L]
- func (m *CachingMiddleware[L]) Execute(ctx context.Context, input []byte) (Prediction[L], error)
- func (m *CachingMiddleware[L]) IsAvailable(ctx context.Context) bool
- func (m *CachingMiddleware[L]) Name() string
- func (m *CachingMiddleware[L]) Stats() (hits, misses int)
type CalibrationCurve
type CompareOption
- func WithChangeThreshold(t float64) CompareOption
type ConfusionMatrixDetail
type DatasetInfo
type DatasetLoader
- func NewDatasetLoader[L comparable](dir string, mapper LabelMapper[L], opts ...DatasetOption) *DatasetLoader[L]
- func (d *DatasetLoader[L]) All(ctx context.Context) ([]Sample[L], error)
- func (d *DatasetLoader[L]) Filter(fn func(ManifestSample) bool) *DatasetLoader[L]
- func (d *DatasetLoader[L]) Iterator(ctx context.Context) (pipeline.Iterator[Sample[L]], error)
- func (d *DatasetLoader[L]) Manifest() (*DatasetManifest, error)
- func (d *DatasetLoader[L]) Pipeline() *pipeline.Pipeline[Sample[L]]
type DatasetManifest
type DatasetOption
- func WithManifestFile(name string) DatasetOption
type Evaluator
- func EvaluatorFunc[L comparable](name string, fn func(ctx context.Context, input []byte) (Prediction[L], error)) Evaluator[L]
- func FromProcess[L comparable](name string, buildCmd func(Sample[L]) process.Command, ...) Evaluator[L]
- func FromProvider[I, O any, L comparable](p provider.RequestResponse[I, O], toInput func([]byte) I, ...) Evaluator[L]
type FileStorage
- func NewFileStorage(dir string) *FileStorage
- func (fs *FileStorage) Latest(ctx context.Context) (*RunResult, error)
- func (fs *FileStorage) List(_ context.Context, opts ...ListOption) ([]RunSummary, error)
- func (fs *FileStorage) Load(_ context.Context, runID string) (*RunResult, error)
- func (fs *FileStorage) Save(_ context.Context, result *RunResult) (string, error)
type LabelMapper
type ListOption
- func WithDatasetFilter(dataset string) ListOption
- func WithLimit(n int) ListOption
- func WithTagFilter(tag string) ListOption
type ListParams
- func ResolveListOptions(opts ...ListOption) ListParams
type ManifestSample
type MetricChange
type MetricResult
type PrecisionRecallCurve
type Prediction
type ROCCurve
type RunComparator
- func NewRunComparator(opts ...CompareOption) *RunComparator
- func (c *RunComparator) Compare(base, target *RunResult) *RunDiff
type RunDiff
- func (d *RunDiff) HasRegression() bool
- func (d *RunDiff) Summary() string
type RunMetric
type RunOption
- func WithConcurrency[L comparable](n int) RunOption[L]
- func WithFailOnRegression[L comparable](b bool) RunOption[L]
- func WithMetrics[L comparable](metrics ...RunMetric[L]) RunOption[L]
- func WithStorage[L comparable](s RunStorage) RunOption[L]
- func WithTag[L comparable](tag string) RunOption[L]
- func WithTargets[L comparable](targets map[string]float64) RunOption[L]
- func WithTimeout[L comparable](d time.Duration) RunOption[L]
type RunResult
type RunStorage
type RunSummary
type Sample
type SampleResult
type ScoreDistribution
type ScoredSample
type ThresholdPoint
type TimingMiddleware
- func WithTiming[L comparable](eval Evaluator[L]) *TimingMiddleware[L]
- func (m *TimingMiddleware[L]) Execute(ctx context.Context, input []byte) (Prediction[L], error)
- func (m *TimingMiddleware[L]) IsAvailable(ctx context.Context) bool
- func (m *TimingMiddleware[L]) Name() string
- func (m *TimingMiddleware[L]) Timings() map[string]time.Duration

Constants ¶

View Source

const SchemaURL = "https://gokit.dev/bench/v1/schema.json"

SchemaURL is the schema URL for Bench JSON output.

View Source

const SchemaVersion = "1.0"

SchemaVersion is the current Bench JSON schema version.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type BenchRunner ¶

type BenchRunner[L comparable] struct {
	// contains filtered or unexported fields
}

BenchRunner orchestrates evaluation runs.

func NewBenchRunner ¶

func NewBenchRunner[L comparable](opts ...RunOption[L]) *BenchRunner[L]

NewBenchRunner creates a new runner with the given options.

func (*BenchRunner[L]) Register ¶

func (r *BenchRunner[L]) Register(name string, eval Evaluator[L], opts ...BranchOption)

Register adds an evaluator branch to the runner.

func (*BenchRunner[L]) Run ¶

func (r *BenchRunner[L]) Run(ctx context.Context, dataset *DatasetLoader[L]) (*RunResult, error)

Run executes the benchmark: loads samples, runs evaluators, computes metrics, and stores results.

type BranchOption ¶

type BranchOption func(*branchConfig)

BranchOption configures a branch registration.

func WithTier ¶

func WithTier(tier int) BranchOption

WithTier sets the tier for a branch (used for tiered evaluation).

type BranchResult ¶

type BranchResult struct {
	Name             string             `json:"name"`
	Tier             int                `json:"tier"`
	Metrics          map[string]float64 `json:"metrics"`
	AvgScorePositive float64            `json:"avg_score_positive"`
	AvgScoreNegative float64            `json:"avg_score_negative"`
	Duration         time.Duration      `json:"duration_ms"`
	Errors           int                `json:"errors"`
}

BranchResult holds results for a single evaluator branch.

type CLIOption ¶

type CLIOption func(*CLIRunner)

CLIOption configures a CLIRunner.

func WithOutput ¶

func WithOutput(w io.Writer) CLIOption

WithOutput sets the output writer for CLI output (default: os.Stdout).

type CLIRunner ¶

type CLIRunner struct {
	// contains filtered or unexported fields
}

CLIRunner provides CLI-friendly helpers for benchmark operations.

func NewCLIRunner ¶

func NewCLIRunner(storage RunStorage, opts ...CLIOption) *CLIRunner

NewCLIRunner creates a CLI runner backed by the given storage.

func (*CLIRunner) CompareLatest ¶

func (c *CLIRunner) CompareLatest(ctx context.Context) error

CompareLatest compares the two most recent runs.

func (*CLIRunner) CompareRuns ¶

func (c *CLIRunner) CompareRuns(ctx context.Context, baseID, targetID string) error

CompareRuns loads two runs by ID and prints their comparison.

func (*CLIRunner) ListRuns ¶

func (c *CLIRunner) ListRuns(ctx context.Context, opts ...ListOption) error

ListRuns prints a table of stored runs.

func (*CLIRunner) RunAndPrint ¶

func (c *CLIRunner) RunAndPrint(ctx context.Context, runner *BenchRunner[string], dataset *DatasetLoader[string]) error

RunAndPrint executes a benchmark run and prints the results.

func (*CLIRunner) ShowRun ¶

func (c *CLIRunner) ShowRun(ctx context.Context, runID string) error

ShowRun loads and prints a specific run in detail.

type CachingMiddleware ¶

type CachingMiddleware[L comparable] struct {
	// contains filtered or unexported fields
}

CachingMiddleware wraps an evaluator and caches results by input hash.

func WithCaching ¶

func WithCaching[L comparable](eval Evaluator[L]) *CachingMiddleware[L]

WithCaching wraps an evaluator with SHA-256 input-based caching.

func (*CachingMiddleware[L]) Execute ¶

func (m *CachingMiddleware[L]) Execute(ctx context.Context, input []byte) (Prediction[L], error)

func (*CachingMiddleware[L]) IsAvailable ¶

func (m *CachingMiddleware[L]) IsAvailable(ctx context.Context) bool

func (*CachingMiddleware[L]) Name ¶

func (m *CachingMiddleware[L]) Name() string

func (*CachingMiddleware[L]) Stats ¶

func (m *CachingMiddleware[L]) Stats() (hits, misses int)

Stats returns the number of cache hits and misses.

type CalibrationCurve ¶

type CalibrationCurve struct {
	PredictedProbability []float64
	ActualFrequency      []float64
	BinCount             []int
}

CalibrationCurve holds calibration curve data.

type CompareOption ¶

type CompareOption func(*RunComparator)

CompareOption configures comparison.

func WithChangeThreshold ¶

func WithChangeThreshold(t float64) CompareOption

WithChangeThreshold sets the minimum absolute change to report as significant (default: 0.01).

type ConfusionMatrixDetail ¶

type ConfusionMatrixDetail struct {
	Labels      []string
	Matrix      [][]int
	Orientation string // "row=actual, col=predicted"
}

ConfusionMatrixDetail holds confusion matrix data.

type DatasetInfo ¶

type DatasetInfo struct {
	Name              string         `json:"name"`
	Version           string         `json:"version"`
	SampleCount       int            `json:"sample_count"`
	LabelDistribution map[string]int `json:"label_distribution"`
}

DatasetInfo holds summary info about the dataset used.

type DatasetLoader ¶

type DatasetLoader[L comparable] struct {
	// contains filtered or unexported fields
}

DatasetLoader loads labeled samples from a manifest file.

func NewDatasetLoader ¶

func NewDatasetLoader[L comparable](dir string, mapper LabelMapper[L], opts ...DatasetOption) *DatasetLoader[L]

NewDatasetLoader creates a loader for the given directory.

func (*DatasetLoader[L]) All ¶

func (d *DatasetLoader[L]) All(ctx context.Context) ([]Sample[L], error)

All loads all samples into memory.

func (*DatasetLoader[L]) Filter ¶

func (d *DatasetLoader[L]) Filter(fn func(ManifestSample) bool) *DatasetLoader[L]

Filter returns a new loader that only yields matching samples.

func (*DatasetLoader[L]) Iterator ¶

func (d *DatasetLoader[L]) Iterator(ctx context.Context) (pipeline.Iterator[Sample[L]], error)

Iterator returns a pipeline.Iterator that lazily loads samples.

func (*DatasetLoader[L]) Manifest ¶

func (d *DatasetLoader[L]) Manifest() (*DatasetManifest, error)

Manifest returns the parsed manifest.

func (*DatasetLoader[L]) Pipeline ¶

func (d *DatasetLoader[L]) Pipeline() *pipeline.Pipeline[Sample[L]]

Pipeline returns a Pipeline[Sample[L]] for composition.

type DatasetManifest ¶

type DatasetManifest struct {
	Name    string           `json:"name"`
	Version string           `json:"version"`
	Samples []ManifestSample `json:"samples"`
}

DatasetManifest describes a labeled dataset on disk.

type DatasetOption ¶

type DatasetOption func(*datasetConfig)

DatasetOption configures dataset loading.

func WithManifestFile ¶

func WithManifestFile(name string) DatasetOption

WithManifestFile sets the manifest filename (default: "manifest.json").

type Evaluator ¶

type Evaluator[L comparable] interface {
	provider.RequestResponse[[]byte, Prediction[L]]
}

Evaluator is a provider.RequestResponse that produces predictions from raw input.

func EvaluatorFunc ¶

func EvaluatorFunc[L comparable](name string, fn func(ctx context.Context, input []byte) (Prediction[L], error)) Evaluator[L]

EvaluatorFunc wraps a plain function as an Evaluator.

func FromProcess ¶

func FromProcess[L comparable](
	name string,
	buildCmd func(Sample[L]) process.Command,
	parseOutput func(*process.Result) (Prediction[L], error),
) Evaluator[L]

FromProcess creates an Evaluator that calls a subprocess. buildCmd creates the process command from a sample's raw input. parseOutput extracts a prediction from the process result.

func FromProvider ¶

func FromProvider[I, O any, L comparable](
	p provider.RequestResponse[I, O],
	toInput func([]byte) I,
	toPrediction func(O) Prediction[L],
) Evaluator[L]

FromProvider adapts any RequestResponse provider into an Evaluator using mapper functions for input/output transformation.

type FileStorage ¶

type FileStorage struct {
	// contains filtered or unexported fields
}

FileStorage stores results as JSON files on disk.

func NewFileStorage ¶

func NewFileStorage(dir string) *FileStorage

NewFileStorage creates a FileStorage that persists results under dir.

func (*FileStorage) Latest ¶

func (fs *FileStorage) Latest(ctx context.Context) (*RunResult, error)

Latest returns the most recent RunResult by timestamp.

func (*FileStorage) List ¶

func (fs *FileStorage) List(_ context.Context, opts ...ListOption) ([]RunSummary, error)

List returns summaries of stored results, sorted by timestamp descending.

func (*FileStorage) Load ¶

func (fs *FileStorage) Load(_ context.Context, runID string) (*RunResult, error)

Load reads a RunResult from disk by run ID.

func (*FileStorage) Save ¶

func (fs *FileStorage) Save(_ context.Context, result *RunResult) (string, error)

Save writes the RunResult as a JSON file named {runID}.json.

type LabelMapper ¶

type LabelMapper[L comparable] func(string) (L, error)

LabelMapper converts a string label from a manifest into a typed label.

type ListOption ¶

type ListOption func(*listConfig)

ListOption configures result listing.

func WithDatasetFilter ¶

func WithDatasetFilter(dataset string) ListOption

WithDatasetFilter filters results by dataset name.

func WithLimit ¶

func WithLimit(n int) ListOption

WithLimit sets the maximum number of results to return.

func WithTagFilter ¶

func WithTagFilter(tag string) ListOption

WithTagFilter filters results by tag.

type ListParams ¶

type ListParams struct {
	Limit   int
	Tag     string
	Dataset string
}

ListParams holds the resolved parameters from ListOption values.

func ResolveListOptions ¶

func ResolveListOptions(opts ...ListOption) ListParams

ResolveListOptions applies the given options and returns the resolved parameters. This is useful for external RunStorage implementations that need to inspect filter values.

type ManifestSample ¶

type ManifestSample struct {
	ID     string         `json:"id"`
	File   string         `json:"file"`
	Label  string         `json:"label"`
	Source string         `json:"source,omitempty"`
	Meta   map[string]any `json:"metadata,omitempty"`
}

ManifestSample is one entry in a dataset manifest file.

type MetricChange ¶

type MetricChange struct {
	Name        string
	OldValue    float64
	NewValue    float64
	Delta       float64
	Improved    bool
	Significant bool // above threshold
}

MetricChange represents a change in a metric between two runs.

type MetricResult ¶

type MetricResult struct {
	Name   string             `json:"name"`
	Value  float64            `json:"value"`
	Values map[string]float64 `json:"values,omitempty"`
	Detail any                `json:"detail,omitempty"`
}

MetricResult pairs a metric name with its result.

type PrecisionRecallCurve ¶

type PrecisionRecallCurve struct {
	Precision  []float64
	Recall     []float64
	Thresholds []float64
}

PrecisionRecallCurve holds precision-recall curve data.

type Prediction ¶

type Prediction[L comparable] struct {
	SampleID string
	Label    L
	Score    float64
	Scores   map[L]float64
	Metadata map[string]any
}

Prediction represents an evaluator's output for a single sample.

type ROCCurve ¶

type ROCCurve struct {
	FPR        []float64
	TPR        []float64
	Thresholds []float64
	AUC        float64
}

ROCCurve holds receiver operating characteristic curve data.

type RunComparator ¶

type RunComparator struct {
	// contains filtered or unexported fields
}

RunComparator compares two benchmark runs.

func NewRunComparator ¶

func NewRunComparator(opts ...CompareOption) *RunComparator

NewRunComparator creates a comparator with default settings.

func (*RunComparator) Compare ¶

func (c *RunComparator) Compare(base, target *RunResult) *RunDiff

Compare compares two RunResults and returns the diff.

type RunDiff ¶

type RunDiff struct {
	BaseID    string
	TargetID  string
	Changes   []MetricChange
	Fixed     []string // sample IDs that went from wrong to correct
	Regressed []string // sample IDs that went from correct to wrong
}

RunDiff holds the comparison result between two benchmark runs.

func (*RunDiff) HasRegression ¶

func (d *RunDiff) HasRegression() bool

HasRegression returns true if any metric decreased significantly.

func (*RunDiff) Summary ¶

func (d *RunDiff) Summary() string

Summary returns a human-readable summary of the comparison.

type RunMetric ¶

type RunMetric[L comparable] interface {
	Name() string
	Compute(scored []ScoredSample[L]) MetricResult
}

RunMetric computes evaluation scores from predictions vs ground truth. This interface mirrors metric.Metric[L] but lives in bench to avoid an import cycle (bench/metric already imports bench). Use metric.AsRunMetric to adapt metric.Metric[L] values.

type RunOption ¶

type RunOption[L comparable] func(*runConfig[L])

RunOption configures a BenchRunner.

func WithConcurrency ¶

func WithConcurrency[L comparable](n int) RunOption[L]

WithConcurrency sets the number of parallel evaluation workers. Values <= 1 mean sequential execution.

func WithFailOnRegression ¶

func WithFailOnRegression[L comparable](b bool) RunOption[L]

WithFailOnRegression configures whether the run should fail if a regression is detected compared to the previous run.

func WithMetrics ¶

func WithMetrics[L comparable](metrics ...RunMetric[L]) RunOption[L]

WithMetrics configures the metrics to compute.

func WithStorage ¶

func WithStorage[L comparable](s RunStorage) RunOption[L]

WithStorage configures the storage backend for persisting results.

func WithTag ¶

func WithTag[L comparable](tag string) RunOption[L]

WithTag sets a human-readable tag for the run.

func WithTargets ¶

func WithTargets[L comparable](targets map[string]float64) RunOption[L]

WithTargets sets metric target thresholds (metric name → minimum value).

func WithTimeout ¶

func WithTimeout[L comparable](d time.Duration) RunOption[L]

WithTimeout sets the per-sample evaluation timeout.

type RunResult ¶

type RunResult struct {
	ID        string                  `json:"id"`
	Schema    string                  `json:"schema"`
	Timestamp time.Time               `json:"timestamp"`
	Tag       string                  `json:"tag,omitempty"`
	Duration  time.Duration           `json:"duration_ms"`
	Dataset   DatasetInfo             `json:"dataset"`
	Metrics   []MetricResult          `json:"metrics"`
	Branches  map[string]BranchResult `json:"branches"`
	Samples   []SampleResult          `json:"samples"`
	Curves    map[string]any          `json:"curves,omitempty"`
}

RunResult holds the complete output of a benchmark run.

type RunStorage ¶

type RunStorage interface {
	Save(ctx context.Context, result *RunResult) (string, error)
	Load(ctx context.Context, runID string) (*RunResult, error)
	Latest(ctx context.Context) (*RunResult, error)
	List(ctx context.Context, opts ...ListOption) ([]RunSummary, error)
}

RunStorage persists benchmark results.

type RunSummary ¶

type RunSummary struct {
	ID        string    `json:"id"`
	Timestamp time.Time `json:"timestamp"`
	Tag       string    `json:"tag,omitempty"`
	Dataset   string    `json:"dataset"`
	F1        float64   `json:"f1,omitempty"`
}

RunSummary is a lightweight summary for listing runs.

type Sample ¶

type Sample[L comparable] struct {
	ID       string
	Input    []byte
	Label    L
	Source   string
	Metadata map[string]any
}

Sample represents a labeled data point in an evaluation dataset.

type SampleResult ¶

type SampleResult struct {
	ID           string             `json:"id"`
	Label        string             `json:"label"`
	Predicted    string             `json:"predicted"`
	Score        float64            `json:"score"`
	Correct      bool               `json:"correct"`
	BranchScores map[string]float64 `json:"branch_scores,omitempty"`
	Duration     time.Duration      `json:"duration_ms"`
	Error        string             `json:"error,omitempty"`
}

SampleResult holds per-sample evaluation results.

type ScoreDistribution ¶

type ScoreDistribution struct {
	Label  string
	Bins   []float64
	Counts []int
}

ScoreDistribution holds a histogram of scores for a label.

type ScoredSample ¶

type ScoredSample[L comparable] struct {
	Sample     Sample[L]
	Prediction Prediction[L]
}

ScoredSample pairs a ground-truth sample with its prediction.

type ThresholdPoint ¶

type ThresholdPoint struct {
	Threshold float64
	Precision float64
	Recall    float64
	F1        float64
	Accuracy  float64
}

ThresholdPoint holds classification metrics at a specific threshold.

type TimingMiddleware ¶

type TimingMiddleware[L comparable] struct {
	// contains filtered or unexported fields
}

TimingMiddleware wraps an evaluator and records per-sample execution times.

func WithTiming ¶

func WithTiming[L comparable](eval Evaluator[L]) *TimingMiddleware[L]

WithTiming wraps an evaluator with timing instrumentation.

func (*TimingMiddleware[L]) Execute ¶

func (m *TimingMiddleware[L]) Execute(ctx context.Context, input []byte) (Prediction[L], error)

func (*TimingMiddleware[L]) IsAvailable ¶

func (m *TimingMiddleware[L]) IsAvailable(ctx context.Context) bool

func (*TimingMiddleware[L]) Name ¶

func (m *TimingMiddleware[L]) Name() string

func (*TimingMiddleware[L]) Timings ¶

func (m *TimingMiddleware[L]) Timings() map[string]time.Duration

Timings returns a copy of the recorded per-sample durations.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
metric Package metric provides pluggable evaluation metrics for the bench framework.	Package metric provides pluggable evaluation metrics for the bench framework.
report Package report generates formatted output from bench evaluation results.	Package report generates formatted output from bench evaluation results.
viz Package viz generates SVG visualizations from bench evaluation results.	Package viz generates SVG visualizations from bench evaluation results.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

bench

Features

Install

Quick Start

Key Types & Functions

Sub-packages

Available Metrics

Classification

Probability & Calibration

Ranking

Regression

Matching

Composite

Reporters

SVG Visualisations (bench/viz)

Usage Examples

Multi-class Classification

Regression

Adapting an Existing Provider

CI / CD with JUnit Targets

Comparison & Regression Detection

CLI Helper

Related Packages

License

Documentation ¶

Overview ¶

Architecture ¶

Quick Start ¶

Sub-packages ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

Types ¶

type BenchRunner ¶

func NewBenchRunner ¶

func (*BenchRunner[L]) Register ¶

func (*BenchRunner[L]) Run ¶

type BranchOption ¶

func WithTier ¶

type BranchResult ¶

type CLIOption ¶

func WithOutput ¶

type CLIRunner ¶

func NewCLIRunner ¶

func (*CLIRunner) CompareLatest ¶

func (*CLIRunner) CompareRuns ¶

func (*CLIRunner) ListRuns ¶

func (*CLIRunner) RunAndPrint ¶

func (*CLIRunner) ShowRun ¶

type CachingMiddleware ¶

func WithCaching ¶

func (*CachingMiddleware[L]) Execute ¶

func (*CachingMiddleware[L]) IsAvailable ¶

func (*CachingMiddleware[L]) Name ¶

func (*CachingMiddleware[L]) Stats ¶

type CalibrationCurve ¶

type CompareOption ¶

func WithChangeThreshold ¶

type ConfusionMatrixDetail ¶

type DatasetInfo ¶

type DatasetLoader ¶

func NewDatasetLoader ¶

func (*DatasetLoader[L]) All ¶

func (*DatasetLoader[L]) Filter ¶

func (*DatasetLoader[L]) Iterator ¶

func (*DatasetLoader[L]) Manifest ¶

func (*DatasetLoader[L]) Pipeline ¶

type DatasetManifest ¶

type DatasetOption ¶

func WithManifestFile ¶

type Evaluator ¶

func EvaluatorFunc ¶

func FromProcess ¶

func FromProvider ¶

type FileStorage ¶

func NewFileStorage ¶

func (*FileStorage) Latest ¶

func (*FileStorage) List ¶

SVG Visualisations (`bench/viz`)