Documentation
¶
Overview ¶
Package bench provides a pluggable evaluation framework for benchmarking providers against labeled datasets.
The framework bridges gokit's provider and pipeline packages to create a composable evaluation workflow:
- Evaluator = provider.RequestResponse[[]byte, Prediction[L]]
- Dataset = pipeline.Iterator[Sample[L]], loaded from manifest files
- Metrics = pluggable scorers that consume (ground-truth, prediction) pairs
Architecture ¶
bench models evaluation as a data pipeline:
Dataset → Evaluator → ScoredSample → Metrics → Results
Datasets are loaded lazily through pipeline.Iterator, so arbitrarily large datasets stream through memory without loading everything at once. Evaluators wrap any provider.RequestResponse via the FromProvider adapter, or use EvaluatorFunc for quick inline definitions. Metrics are stateless functions that receive a slice of ScoredSample and return scalar or structured results.
Quick Start ¶
loader := bench.NewDatasetLoader[string](dir, func(s string) (string, error) {
return s, nil
})
samples, _ := loader.All(ctx)
eval := bench.EvaluatorFunc("my-model", func(ctx context.Context, input []byte) (bench.Prediction[string], error) {
label, score := myModel.Predict(input)
return bench.Prediction[string]{Label: label, Score: score}, nil
})
var scored []bench.ScoredSample[string]
for _, s := range samples {
pred, _ := eval.Execute(ctx, s.Input)
scored = append(scored, bench.ScoredSample[string]{Sample: s, Prediction: pred})
}
suite := metric.NewSuite(metric.BinaryClassification("positive"))
results := suite.Compute(scored)
Sub-packages ¶
- metric: pluggable metric implementations (classification, confusion matrix, threshold sweep)
- report: result formatting and output (planned)
Index ¶
- Constants
- type BenchRunner
- type BranchOption
- type BranchResult
- type CLIOption
- type CLIRunner
- func (c *CLIRunner) CompareLatest(ctx context.Context) error
- func (c *CLIRunner) CompareRuns(ctx context.Context, baseID, targetID string) error
- func (c *CLIRunner) ListRuns(ctx context.Context, opts ...ListOption) error
- func (c *CLIRunner) RunAndPrint(ctx context.Context, runner *BenchRunner[string], ...) error
- func (c *CLIRunner) ShowRun(ctx context.Context, runID string) error
- type CachingMiddleware
- type CalibrationCurve
- type CompareOption
- type ConfusionMatrixDetail
- type DatasetInfo
- type DatasetLoader
- func (d *DatasetLoader[L]) All(ctx context.Context) ([]Sample[L], error)
- func (d *DatasetLoader[L]) Filter(fn func(ManifestSample) bool) *DatasetLoader[L]
- func (d *DatasetLoader[L]) Iterator(ctx context.Context) (pipeline.Iterator[Sample[L]], error)
- func (d *DatasetLoader[L]) Manifest() (*DatasetManifest, error)
- func (d *DatasetLoader[L]) Pipeline() *pipeline.Pipeline[Sample[L]]
- type DatasetManifest
- type DatasetOption
- type Evaluator
- func EvaluatorFunc[L comparable](name string, fn func(ctx context.Context, input []byte) (Prediction[L], error)) Evaluator[L]
- func FromProcess[L comparable](name string, buildCmd func(Sample[L]) process.Command, ...) Evaluator[L]
- func FromProvider[I, O any, L comparable](p provider.RequestResponse[I, O], toInput func([]byte) I, ...) Evaluator[L]
- type FileStorage
- func (fs *FileStorage) Latest(ctx context.Context) (*RunResult, error)
- func (fs *FileStorage) List(_ context.Context, opts ...ListOption) ([]RunSummary, error)
- func (fs *FileStorage) Load(_ context.Context, runID string) (*RunResult, error)
- func (fs *FileStorage) Save(_ context.Context, result *RunResult) (string, error)
- type LabelMapper
- type ListOption
- type ListParams
- type ManifestSample
- type MetricChange
- type MetricResult
- type PrecisionRecallCurve
- type Prediction
- type ROCCurve
- type RunComparator
- type RunDiff
- type RunMetric
- type RunOption
- func WithConcurrency[L comparable](n int) RunOption[L]
- func WithFailOnRegression[L comparable](b bool) RunOption[L]
- func WithMetrics[L comparable](metrics ...RunMetric[L]) RunOption[L]
- func WithStorage[L comparable](s RunStorage) RunOption[L]
- func WithTag[L comparable](tag string) RunOption[L]
- func WithTargets[L comparable](targets map[string]float64) RunOption[L]
- func WithTimeout[L comparable](d time.Duration) RunOption[L]
- type RunResult
- type RunStorage
- type RunSummary
- type Sample
- type SampleResult
- type ScoreDistribution
- type ScoredSample
- type ThresholdPoint
- type TimingMiddleware
Constants ¶
const SchemaURL = "https://gokit.dev/bench/v1/schema.json"
SchemaURL is the schema URL for Bench JSON output.
const SchemaVersion = "1.0"
SchemaVersion is the current Bench JSON schema version.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type BenchRunner ¶
type BenchRunner[L comparable] struct { // contains filtered or unexported fields }
BenchRunner orchestrates evaluation runs.
func NewBenchRunner ¶
func NewBenchRunner[L comparable](opts ...RunOption[L]) *BenchRunner[L]
NewBenchRunner creates a new runner with the given options.
func (*BenchRunner[L]) Register ¶
func (r *BenchRunner[L]) Register(name string, eval Evaluator[L], opts ...BranchOption)
Register adds an evaluator branch to the runner.
func (*BenchRunner[L]) Run ¶
func (r *BenchRunner[L]) Run(ctx context.Context, dataset *DatasetLoader[L]) (*RunResult, error)
Run executes the benchmark: loads samples, runs evaluators, computes metrics, and stores results.
type BranchOption ¶
type BranchOption func(*branchConfig)
BranchOption configures a branch registration.
func WithTier ¶
func WithTier(tier int) BranchOption
WithTier sets the tier for a branch (used for tiered evaluation).
type BranchResult ¶
type BranchResult struct {
Name string `json:"name"`
Tier int `json:"tier"`
Metrics map[string]float64 `json:"metrics"`
AvgScorePositive float64 `json:"avg_score_positive"`
AvgScoreNegative float64 `json:"avg_score_negative"`
Duration time.Duration `json:"duration_ms"`
Errors int `json:"errors"`
}
BranchResult holds results for a single evaluator branch.
type CLIOption ¶
type CLIOption func(*CLIRunner)
CLIOption configures a CLIRunner.
func WithOutput ¶
WithOutput sets the output writer for CLI output (default: os.Stdout).
type CLIRunner ¶
type CLIRunner struct {
// contains filtered or unexported fields
}
CLIRunner provides CLI-friendly helpers for benchmark operations.
func NewCLIRunner ¶
func NewCLIRunner(storage RunStorage, opts ...CLIOption) *CLIRunner
NewCLIRunner creates a CLI runner backed by the given storage.
func (*CLIRunner) CompareLatest ¶
CompareLatest compares the two most recent runs.
func (*CLIRunner) CompareRuns ¶
CompareRuns loads two runs by ID and prints their comparison.
func (*CLIRunner) ListRuns ¶
func (c *CLIRunner) ListRuns(ctx context.Context, opts ...ListOption) error
ListRuns prints a table of stored runs.
func (*CLIRunner) RunAndPrint ¶
func (c *CLIRunner) RunAndPrint(ctx context.Context, runner *BenchRunner[string], dataset *DatasetLoader[string]) error
RunAndPrint executes a benchmark run and prints the results.
type CachingMiddleware ¶
type CachingMiddleware[L comparable] struct { // contains filtered or unexported fields }
CachingMiddleware wraps an evaluator and caches results by input hash.
func WithCaching ¶
func WithCaching[L comparable](eval Evaluator[L]) *CachingMiddleware[L]
WithCaching wraps an evaluator with SHA-256 input-based caching.
func (*CachingMiddleware[L]) Execute ¶
func (m *CachingMiddleware[L]) Execute(ctx context.Context, input []byte) (Prediction[L], error)
func (*CachingMiddleware[L]) IsAvailable ¶
func (m *CachingMiddleware[L]) IsAvailable(ctx context.Context) bool
func (*CachingMiddleware[L]) Name ¶
func (m *CachingMiddleware[L]) Name() string
func (*CachingMiddleware[L]) Stats ¶
func (m *CachingMiddleware[L]) Stats() (hits, misses int)
Stats returns the number of cache hits and misses.
type CalibrationCurve ¶
type CalibrationCurve struct {
PredictedProbability []float64
ActualFrequency []float64
BinCount []int
}
CalibrationCurve holds calibration curve data.
type CompareOption ¶
type CompareOption func(*RunComparator)
CompareOption configures comparison.
func WithChangeThreshold ¶
func WithChangeThreshold(t float64) CompareOption
WithChangeThreshold sets the minimum absolute change to report as significant (default: 0.01).
type ConfusionMatrixDetail ¶
type ConfusionMatrixDetail struct {
Labels []string
Matrix [][]int
Orientation string // "row=actual, col=predicted"
}
ConfusionMatrixDetail holds confusion matrix data.
type DatasetInfo ¶
type DatasetInfo struct {
Name string `json:"name"`
Version string `json:"version"`
SampleCount int `json:"sample_count"`
LabelDistribution map[string]int `json:"label_distribution"`
}
DatasetInfo holds summary info about the dataset used.
type DatasetLoader ¶
type DatasetLoader[L comparable] struct { // contains filtered or unexported fields }
DatasetLoader loads labeled samples from a manifest file.
func NewDatasetLoader ¶
func NewDatasetLoader[L comparable](dir string, mapper LabelMapper[L], opts ...DatasetOption) *DatasetLoader[L]
NewDatasetLoader creates a loader for the given directory.
func (*DatasetLoader[L]) All ¶
func (d *DatasetLoader[L]) All(ctx context.Context) ([]Sample[L], error)
All loads all samples into memory.
func (*DatasetLoader[L]) Filter ¶
func (d *DatasetLoader[L]) Filter(fn func(ManifestSample) bool) *DatasetLoader[L]
Filter returns a new loader that only yields matching samples.
func (*DatasetLoader[L]) Manifest ¶
func (d *DatasetLoader[L]) Manifest() (*DatasetManifest, error)
Manifest returns the parsed manifest.
type DatasetManifest ¶
type DatasetManifest struct {
Name string `json:"name"`
Version string `json:"version"`
Samples []ManifestSample `json:"samples"`
}
DatasetManifest describes a labeled dataset on disk.
type DatasetOption ¶
type DatasetOption func(*datasetConfig)
DatasetOption configures dataset loading.
func WithManifestFile ¶
func WithManifestFile(name string) DatasetOption
WithManifestFile sets the manifest filename (default: "manifest.json").
type Evaluator ¶
type Evaluator[L comparable] interface { provider.RequestResponse[[]byte, Prediction[L]] }
Evaluator is a provider.RequestResponse that produces predictions from raw input.
func EvaluatorFunc ¶
func EvaluatorFunc[L comparable](name string, fn func(ctx context.Context, input []byte) (Prediction[L], error)) Evaluator[L]
EvaluatorFunc wraps a plain function as an Evaluator.
func FromProcess ¶
func FromProcess[L comparable]( name string, buildCmd func(Sample[L]) process.Command, parseOutput func(*process.Result) (Prediction[L], error), ) Evaluator[L]
FromProcess creates an Evaluator that calls a subprocess. buildCmd creates the process command from a sample's raw input. parseOutput extracts a prediction from the process result.
func FromProvider ¶
func FromProvider[I, O any, L comparable]( p provider.RequestResponse[I, O], toInput func([]byte) I, toPrediction func(O) Prediction[L], ) Evaluator[L]
FromProvider adapts any RequestResponse provider into an Evaluator using mapper functions for input/output transformation.
type FileStorage ¶
type FileStorage struct {
// contains filtered or unexported fields
}
FileStorage stores results as JSON files on disk.
func NewFileStorage ¶
func NewFileStorage(dir string) *FileStorage
NewFileStorage creates a FileStorage that persists results under dir.
func (*FileStorage) Latest ¶
func (fs *FileStorage) Latest(ctx context.Context) (*RunResult, error)
Latest returns the most recent RunResult by timestamp.
func (*FileStorage) List ¶
func (fs *FileStorage) List(_ context.Context, opts ...ListOption) ([]RunSummary, error)
List returns summaries of stored results, sorted by timestamp descending.
type LabelMapper ¶
type LabelMapper[L comparable] func(string) (L, error)
LabelMapper converts a string label from a manifest into a typed label.
type ListOption ¶
type ListOption func(*listConfig)
ListOption configures result listing.
func WithDatasetFilter ¶
func WithDatasetFilter(dataset string) ListOption
WithDatasetFilter filters results by dataset name.
func WithLimit ¶
func WithLimit(n int) ListOption
WithLimit sets the maximum number of results to return.
func WithTagFilter ¶
func WithTagFilter(tag string) ListOption
WithTagFilter filters results by tag.
type ListParams ¶
ListParams holds the resolved parameters from ListOption values.
func ResolveListOptions ¶
func ResolveListOptions(opts ...ListOption) ListParams
ResolveListOptions applies the given options and returns the resolved parameters. This is useful for external RunStorage implementations that need to inspect filter values.
type ManifestSample ¶
type ManifestSample struct {
ID string `json:"id"`
File string `json:"file"`
Label string `json:"label"`
Source string `json:"source,omitempty"`
Meta map[string]any `json:"metadata,omitempty"`
}
ManifestSample is one entry in a dataset manifest file.
type MetricChange ¶
type MetricChange struct {
Name string
OldValue float64
NewValue float64
Delta float64
Improved bool
Significant bool // above threshold
}
MetricChange represents a change in a metric between two runs.
type MetricResult ¶
type MetricResult struct {
Name string `json:"name"`
Value float64 `json:"value"`
Values map[string]float64 `json:"values,omitempty"`
Detail any `json:"detail,omitempty"`
}
MetricResult pairs a metric name with its result.
type PrecisionRecallCurve ¶
PrecisionRecallCurve holds precision-recall curve data.
type Prediction ¶
type Prediction[L comparable] struct { SampleID string Label L Score float64 Scores map[L]float64 Metadata map[string]any }
Prediction represents an evaluator's output for a single sample.
type RunComparator ¶
type RunComparator struct {
// contains filtered or unexported fields
}
RunComparator compares two benchmark runs.
func NewRunComparator ¶
func NewRunComparator(opts ...CompareOption) *RunComparator
NewRunComparator creates a comparator with default settings.
func (*RunComparator) Compare ¶
func (c *RunComparator) Compare(base, target *RunResult) *RunDiff
Compare compares two RunResults and returns the diff.
type RunDiff ¶
type RunDiff struct {
BaseID string
TargetID string
Changes []MetricChange
Fixed []string // sample IDs that went from wrong to correct
Regressed []string // sample IDs that went from correct to wrong
}
RunDiff holds the comparison result between two benchmark runs.
func (*RunDiff) HasRegression ¶
HasRegression returns true if any metric decreased significantly.
type RunMetric ¶
type RunMetric[L comparable] interface { Name() string Compute(scored []ScoredSample[L]) MetricResult }
RunMetric computes evaluation scores from predictions vs ground truth. This interface mirrors metric.Metric[L] but lives in bench to avoid an import cycle (bench/metric already imports bench). Use metric.AsRunMetric to adapt metric.Metric[L] values.
type RunOption ¶
type RunOption[L comparable] func(*runConfig[L])
RunOption configures a BenchRunner.
func WithConcurrency ¶
func WithConcurrency[L comparable](n int) RunOption[L]
WithConcurrency sets the number of parallel evaluation workers. Values <= 1 mean sequential execution.
func WithFailOnRegression ¶
func WithFailOnRegression[L comparable](b bool) RunOption[L]
WithFailOnRegression configures whether the run should fail if a regression is detected compared to the previous run.
func WithMetrics ¶
func WithMetrics[L comparable](metrics ...RunMetric[L]) RunOption[L]
WithMetrics configures the metrics to compute.
func WithStorage ¶
func WithStorage[L comparable](s RunStorage) RunOption[L]
WithStorage configures the storage backend for persisting results.
func WithTag ¶
func WithTag[L comparable](tag string) RunOption[L]
WithTag sets a human-readable tag for the run.
func WithTargets ¶
func WithTargets[L comparable](targets map[string]float64) RunOption[L]
WithTargets sets metric target thresholds (metric name → minimum value).
func WithTimeout ¶
func WithTimeout[L comparable](d time.Duration) RunOption[L]
WithTimeout sets the per-sample evaluation timeout.
type RunResult ¶
type RunResult struct {
ID string `json:"id"`
Schema string `json:"schema"`
Timestamp time.Time `json:"timestamp"`
Tag string `json:"tag,omitempty"`
Duration time.Duration `json:"duration_ms"`
Dataset DatasetInfo `json:"dataset"`
Metrics []MetricResult `json:"metrics"`
Branches map[string]BranchResult `json:"branches"`
Samples []SampleResult `json:"samples"`
Curves map[string]any `json:"curves,omitempty"`
}
RunResult holds the complete output of a benchmark run.
type RunStorage ¶
type RunStorage interface {
Save(ctx context.Context, result *RunResult) (string, error)
Load(ctx context.Context, runID string) (*RunResult, error)
Latest(ctx context.Context) (*RunResult, error)
List(ctx context.Context, opts ...ListOption) ([]RunSummary, error)
}
RunStorage persists benchmark results.
type RunSummary ¶
type RunSummary struct {
ID string `json:"id"`
Timestamp time.Time `json:"timestamp"`
Tag string `json:"tag,omitempty"`
Dataset string `json:"dataset"`
F1 float64 `json:"f1,omitempty"`
}
RunSummary is a lightweight summary for listing runs.
type Sample ¶
type Sample[L comparable] struct { ID string Input []byte Label L Source string Metadata map[string]any }
Sample represents a labeled data point in an evaluation dataset.
type SampleResult ¶
type SampleResult struct {
ID string `json:"id"`
Label string `json:"label"`
Predicted string `json:"predicted"`
Score float64 `json:"score"`
Correct bool `json:"correct"`
BranchScores map[string]float64 `json:"branch_scores,omitempty"`
Duration time.Duration `json:"duration_ms"`
Error string `json:"error,omitempty"`
}
SampleResult holds per-sample evaluation results.
type ScoreDistribution ¶
ScoreDistribution holds a histogram of scores for a label.
type ScoredSample ¶
type ScoredSample[L comparable] struct { Sample Sample[L] Prediction Prediction[L] }
ScoredSample pairs a ground-truth sample with its prediction.
type ThresholdPoint ¶
type ThresholdPoint struct {
Threshold float64
Precision float64
Recall float64
F1 float64
Accuracy float64
}
ThresholdPoint holds classification metrics at a specific threshold.
type TimingMiddleware ¶
type TimingMiddleware[L comparable] struct { // contains filtered or unexported fields }
TimingMiddleware wraps an evaluator and records per-sample execution times.
func WithTiming ¶
func WithTiming[L comparable](eval Evaluator[L]) *TimingMiddleware[L]
WithTiming wraps an evaluator with timing instrumentation.
func (*TimingMiddleware[L]) Execute ¶
func (m *TimingMiddleware[L]) Execute(ctx context.Context, input []byte) (Prediction[L], error)
func (*TimingMiddleware[L]) IsAvailable ¶
func (m *TimingMiddleware[L]) IsAvailable(ctx context.Context) bool
func (*TimingMiddleware[L]) Name ¶
func (m *TimingMiddleware[L]) Name() string
Source Files
¶
Directories
¶
| Path | Synopsis |
|---|---|
|
Package metric provides pluggable evaluation metrics for the bench framework.
|
Package metric provides pluggable evaluation metrics for the bench framework. |
|
Package report generates formatted output from bench evaluation results.
|
Package report generates formatted output from bench evaluation results. |
|
Package viz generates SVG visualizations from bench evaluation results.
|
Package viz generates SVG visualizations from bench evaluation results. |