Documentation
¶
Overview ¶
Package experiment provides tools for evaluating and comparing agent configurations.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type ABOption ¶
type ABOption func(*abConfig)
ABOption configures ABTest.
func WithMetrics ¶
WithMetrics specifies which metrics to collect.
type ABResult ¶
type ABResult struct {
NameA string
NameB string
ScoreA float64
ScoreB float64
Winner string
MetricsA MetricSnapshot // averages across runs
MetricsB MetricSnapshot
Runs []RunPair
}
ABResult is the aggregate outcome of an ABTest.
type EvalOption ¶
type EvalOption func(*evalConfig)
EvalOption configures Evaluate.
func WithConcurrency ¶
func WithConcurrency(n int) EvalOption
WithConcurrency sets how many (candidate, input) pairs run in parallel.
func WithEvalJudge ¶
func WithEvalJudge(agent *daneel.Agent) EvalOption
WithEvalJudge sets the judge agent for scoring outputs.
type EvalResults ¶
type EvalResults struct {
Runs []EvalRun
}
EvalResults holds all evaluation runs.
func Evaluate ¶
func Evaluate(ctx context.Context, dataset []string, candidates []Candidate, opts ...EvalOption) (*EvalResults, error)
Evaluate runs each candidate against each input in the dataset, optionally scoring outputs with a judge agent. Returns all results.
func (*EvalResults) AverageScore ¶
func (r *EvalResults) AverageScore() map[string]float64
AverageScore returns the average judge score per candidate.
func (*EvalResults) ExportCSV ¶
func (r *EvalResults) ExportCSV(path string) error
ExportCSV writes results to a CSV file.
func (*EvalResults) ExportJSON ¶
func (r *EvalResults) ExportJSON(path string) error
ExportJSON writes results to a JSON file.
type EvalRun ¶
type EvalRun struct {
CandidateName string
Input string
Result *daneel.RunResult
Metrics MetricSnapshot
JudgeScore float64
JudgeReason string
}
EvalRun is the result of running one candidate on one input.
type JudgeResult ¶
JudgeResult holds the scores from an LLM judge comparison.
type MetricSnapshot ¶
MetricSnapshot captures performance stats for a single run.
type RunPair ¶
type RunPair struct {
ResultA *daneel.RunResult
ResultB *daneel.RunResult
MetricsA MetricSnapshot
MetricsB MetricSnapshot
Judge *JudgeResult
}
RunPair holds results from a single A/B run.