experiment

package
v1.3.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 28, 2026 License: MIT Imports: 10 Imported by: 0

Documentation

Overview

Package experiment provides tools for evaluating and comparing agent configurations.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type ABOption

type ABOption func(*abConfig)

ABOption configures ABTest.

func WithJudge

func WithJudge(agent *daneel.Agent) ABOption

WithJudge sets an LLM judge agent to score each run pair.

func WithMetrics

func WithMetrics(ms ...Metric) ABOption

WithMetrics specifies which metrics to collect.

func WithRuns

func WithRuns(n int) ABOption

WithRuns sets the number of runs per candidate (default: 1).

type ABResult

type ABResult struct {
	NameA    string
	NameB    string
	ScoreA   float64
	ScoreB   float64
	Winner   string
	MetricsA MetricSnapshot // averages across runs
	MetricsB MetricSnapshot
	Runs     []RunPair
}

ABResult is the aggregate outcome of an ABTest.

func ABTest

func ABTest(ctx context.Context, input string, agentA, agentB *daneel.Agent, opts ...ABOption) (*ABResult, error)

ABTest runs both agents on the same input and compares results.

type Candidate

type Candidate struct {
	Name  string
	Agent *daneel.Agent
}

Candidate pairs a name with an agent for batch evaluation.

type EvalOption

type EvalOption func(*evalConfig)

EvalOption configures Evaluate.

func WithConcurrency

func WithConcurrency(n int) EvalOption

WithConcurrency sets how many (candidate, input) pairs run in parallel.

func WithEvalJudge

func WithEvalJudge(agent *daneel.Agent) EvalOption

WithEvalJudge sets the judge agent for scoring outputs.

type EvalResults

type EvalResults struct {
	Runs []EvalRun
}

EvalResults holds all evaluation runs.

func Evaluate

func Evaluate(ctx context.Context, dataset []string, candidates []Candidate, opts ...EvalOption) (*EvalResults, error)

Evaluate runs each candidate against each input in the dataset, optionally scoring outputs with a judge agent. Returns all results.

func (*EvalResults) AverageScore

func (r *EvalResults) AverageScore() map[string]float64

AverageScore returns the average judge score per candidate.

func (*EvalResults) ExportCSV

func (r *EvalResults) ExportCSV(path string) error

ExportCSV writes results to a CSV file.

func (*EvalResults) ExportJSON

func (r *EvalResults) ExportJSON(path string) error

ExportJSON writes results to a JSON file.

type EvalRun

type EvalRun struct {
	CandidateName string
	Input         string
	Result        *daneel.RunResult
	Metrics       MetricSnapshot
	JudgeScore    float64
	JudgeReason   string
}

EvalRun is the result of running one candidate on one input.

type JudgeResult

type JudgeResult struct {
	ScoreA float64
	ScoreB float64
	Reason string
}

JudgeResult holds the scores from an LLM judge comparison.

type Metric

type Metric string

Metric names for ABTest and Evaluate options.

const (
	Latency    Metric = "latency"
	TokenCount Metric = "token_count"
	ToolCalls  Metric = "tool_calls"
	Turns      Metric = "turns"
)

type MetricSnapshot

type MetricSnapshot struct {
	Latency   time.Duration
	Tokens    int
	ToolCalls int
	Turns     int
}

MetricSnapshot captures performance stats for a single run.

type RunPair

type RunPair struct {
	ResultA  *daneel.RunResult
	ResultB  *daneel.RunResult
	MetricsA MetricSnapshot
	MetricsB MetricSnapshot
	Judge    *JudgeResult
}

RunPair holds results from a single A/B run.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL