experiment

package

v1.3.2 Latest Latest Go to latest Published: Mar 28, 2026 License: MIT Imports: 10 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/Rafiki81/daneel

Links

Open Source Insights

Documentation ¶

Overview ¶

Package experiment provides tools for evaluating and comparing agent configurations.

Index ¶

type ABOption
type ABResult
- func ABTest(ctx context.Context, input string, agentA, agentB *daneel.Agent, ...) (*ABResult, error)
type Candidate
type EvalOption
- func WithConcurrency(n int) EvalOption
- func WithEvalJudge(agent *daneel.Agent) EvalOption
type EvalResults
- func Evaluate(ctx context.Context, dataset []string, candidates []Candidate, ...) (*EvalResults, error)
type EvalRun
type JudgeResult
type Metric
type MetricSnapshot
type RunPair

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type ABOption ¶

type ABOption func(*abConfig)

ABOption configures ABTest.

func WithJudge ¶

func WithJudge(agent *daneel.Agent) ABOption

WithJudge sets an LLM judge agent to score each run pair.

func WithMetrics ¶

func WithMetrics(ms ...Metric) ABOption

WithMetrics specifies which metrics to collect.

func WithRuns ¶

func WithRuns(n int) ABOption

WithRuns sets the number of runs per candidate (default: 1).

type ABResult ¶

type ABResult struct {
	NameA    string
	NameB    string
	ScoreA   float64
	ScoreB   float64
	Winner   string
	MetricsA MetricSnapshot // averages across runs
	MetricsB MetricSnapshot
	Runs     []RunPair
}

ABResult is the aggregate outcome of an ABTest.

func ABTest ¶

func ABTest(ctx context.Context, input string, agentA, agentB *daneel.Agent, opts ...ABOption) (*ABResult, error)

ABTest runs both agents on the same input and compares results.

type Candidate ¶

type Candidate struct {
	Name  string
	Agent *daneel.Agent
}

Candidate pairs a name with an agent for batch evaluation.

type EvalOption ¶

type EvalOption func(*evalConfig)

EvalOption configures Evaluate.

func WithConcurrency ¶

func WithConcurrency(n int) EvalOption

WithConcurrency sets how many (candidate, input) pairs run in parallel.

func WithEvalJudge ¶

func WithEvalJudge(agent *daneel.Agent) EvalOption

WithEvalJudge sets the judge agent for scoring outputs.

type EvalResults ¶

type EvalResults struct {
	Runs []EvalRun
}

EvalResults holds all evaluation runs.

func Evaluate ¶

func Evaluate(ctx context.Context, dataset []string, candidates []Candidate, opts ...EvalOption) (*EvalResults, error)

Evaluate runs each candidate against each input in the dataset, optionally scoring outputs with a judge agent. Returns all results.

func (*EvalResults) AverageScore ¶

func (r *EvalResults) AverageScore() map[string]float64

AverageScore returns the average judge score per candidate.

func (*EvalResults) ExportCSV ¶

func (r *EvalResults) ExportCSV(path string) error

ExportCSV writes results to a CSV file.

func (*EvalResults) ExportJSON ¶

func (r *EvalResults) ExportJSON(path string) error

ExportJSON writes results to a JSON file.

type EvalRun ¶

type EvalRun struct {
	CandidateName string
	Input         string
	Result        *daneel.RunResult
	Metrics       MetricSnapshot
	JudgeScore    float64
	JudgeReason   string
}

EvalRun is the result of running one candidate on one input.

type JudgeResult ¶

type JudgeResult struct {
	ScoreA float64
	ScoreB float64
	Reason string
}

JudgeResult holds the scores from an LLM judge comparison.

type Metric ¶

type Metric string

Metric names for ABTest and Evaluate options.

const (
	Latency    Metric = "latency"
	TokenCount Metric = "token_count"
	ToolCalls  Metric = "tool_calls"
	Turns      Metric = "turns"
)

type MetricSnapshot ¶

type MetricSnapshot struct {
	Latency   time.Duration
	Tokens    int
	ToolCalls int
	Turns     int
}

MetricSnapshot captures performance stats for a single run.

type RunPair ¶

type RunPair struct {
	ResultA  *daneel.RunResult
	ResultB  *daneel.RunResult
	MetricsA MetricSnapshot
	MetricsB MetricSnapshot
	Judge    *JudgeResult
}

RunPair holds results from a single A/B run.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL