eval

package
v0.5.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 16, 2026 License: Apache-2.0 Imports: 19 Imported by: 0

Documentation

Overview

Package eval provides tools for evaluating AI model outputs. Evaluations help measure AI application performance (accuracy/quality) and create an effective feedback loop for AI development. They help teams understand if updates improve or regress application quality. Evaluations are a key part of the Braintrust platform.

An evaluation consists of three main components:

  • Dataset: A set of test examples with inputs and expected outputs
  • TaskFunc: The unit of work we are evaluating, usually one or more calls to an LLM
  • Scorer: A function that scores the result of a task against the expected result

Type Parameters

This package uses two generic type parameters throughout its API:

  • I: The input type for the task (e.g., string, struct, []byte)
  • R: The result/output type from the task (e.g., string, struct, complex types)

All of the input and result types must be JSON-encodable.

For example:

  • Case[string, string] is a test case with string input and string output
  • TaskFunc[Input, Output] is a task that takes Input and returns Output
  • Dataset[string, bool] is an iterator over Cases with string inputs and boolean outputs

See Evaluator.Run for running evaluations.

Example

Example demonstrates how to run a basic evaluation.

ctx := context.Background()

// Create tracer provider
tp := trace.NewTracerProvider()
defer func() { _ = tp.Shutdown(ctx) }()

// Create Braintrust client (reads BRAINTRUST_API_KEY from environment)
client, err := braintrust.New(tp, braintrust.WithProject("test-project"))
if err != nil {
	log.Fatal(err)
}

// Create an evaluator with string input and output types
evaluator := braintrust.NewEvaluator[string, string](client)

// Define a simple task that adds exclamation marks
task := eval.T(func(ctx context.Context, input string) (string, error) {
	return input + "!", nil
})

// Create test cases
dataset := eval.NewDataset([]eval.Case[string, string]{
	{Input: "hello", Expected: "hello!"},
	{Input: "world", Expected: "world!"},
})

// Create a scorer
scorer := eval.NewScorer("exact-match", func(ctx context.Context, result eval.TaskResult[string, string]) (eval.Scores, error) {
	if result.Output == result.Expected {
		return eval.S(1.0), nil
	}
	return eval.S(0.0), nil
})

// Run the evaluation
result, err := evaluator.Run(ctx, eval.Opts[string, string]{
	Experiment: "example-eval",
	Dataset:    dataset,
	Task:       task,
	Scorers:    []eval.Scorer[string, string]{scorer},
	Quiet:      true,
})
if err != nil {
	log.Fatal(err)
}

fmt.Printf("Evaluation complete: %s\n", result.Name())

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Case

type Case[I, R any] struct {
	// Input is the input to the task function.
	Input I

	// Expected is the expected output (for scoring).
	// Optional.
	Expected R

	// Tags are labels to attach to this case.
	// Optional.
	Tags []string

	// Metadata is additional metadata for this case.
	// Optional.
	Metadata map[string]interface{}

	// These fields are only set if the Case is part of a Dataset.
	// They link the eval result back to the source dataset row.
	ID      string // Dataset record ID
	XactID  string // Transaction ID
	Created string // Creation timestamp
}

Case represents a single test case in an evaluation.

type Dataset

type Dataset[I, R any] interface {
	// Next returns the next case, or io.EOF if there are no more cases.
	Next() (Case[I, R], error)

	// ID returns the dataset ID if backed by a Braintrust dataset.
	// Returns empty string for literal in-memory cases.
	ID() string

	// Version returns the dataset version if applicable.
	// Returns empty string for literal cases or unversioned datasets.
	Version() string
}

Dataset is an iterator interface for evaluation datasets. It is commonly an in-memory slice of cases, but can also be a dataset lazily loaded from the Braintrust API.

func NewDataset

func NewDataset[I, R any](cases []Case[I, R]) Dataset[I, R]

NewDataset creates a Dataset iterator from a slice of cases.

type DatasetAPI

type DatasetAPI[I, R any] struct {
	// contains filtered or unexported fields
}

DatasetAPI provides methods for loading datasets with automatic type conversion so they can be easily used in evals.

func (*DatasetAPI[I, R]) Get

func (d *DatasetAPI[I, R]) Get(ctx context.Context, id string) (Dataset[I, R], error)

Get loads a dataset by ID and returns a Dataset iterator.

func (*DatasetAPI[I, R]) Query

func (d *DatasetAPI[I, R]) Query(ctx context.Context, opts DatasetQueryOpts) (Dataset[I, R], error)

Query loads a dataset with advanced query options.

type DatasetQueryOpts

type DatasetQueryOpts struct {
	// Name is the dataset name (requires project context)
	Name string

	// ID is the dataset ID
	ID string

	// Version specifies a specific dataset version
	Version string

	// Limit specifies the maximum number of records to return (0 = unlimited)
	Limit int
}

DatasetQueryOpts contains options for querying datasets.

type Evaluator

type Evaluator[I, R any] struct {
	// contains filtered or unexported fields
}

Evaluator provides a reusable way to run multiple evaluations with the same input and output types. This is useful when you need to run several evaluations in sequence with the same type signature, or use hosted prompts, scorers and datasets with automatic type conversion.

func NewEvaluator

func NewEvaluator[I, R any](s *auth.Session, tp *trace.TracerProvider, api *api.API, project string) *Evaluator[I, R]

NewEvaluator creates a new evaluator with explicit dependencies. The type parameters I (input) and R (result/output) must be specified explicitly. Users create Evaluators with braintrust.NewEvaluator.

func (*Evaluator[I, R]) Datasets

func (e *Evaluator[I, R]) Datasets() *DatasetAPI[I, R]

Datasets is used to access Datasets API for loading datasets with this evaluator's type parameters.

func (*Evaluator[I, R]) Functions added in v0.1.0

func (e *Evaluator[I, R]) Functions() *FunctionsAPI[I, R]

Functions is used to execute hosted Braintrust functions (e.g. hosted tasks and hosted scorers) as part of an eval. As long as I and R are JSON-serializable, FunctionsAPI will automatically convert the input and output to and from JSON.

func (*Evaluator[I, R]) Run

func (e *Evaluator[I, R]) Run(ctx context.Context, opts Opts[I, R]) (*Result, error)

Run executes an evaluation using this evaluator's dependencies.

type FunctionOpts added in v0.1.0

type FunctionOpts struct {
	// Slug is the function slug (required)
	Slug string

	// Project overrides the default project name (optional)
	Project string

	// Version pins to a specific function version (optional, e.g., "5878bd218351fb8e")
	Version string

	// Environment specifies the deployment environment (optional, e.g., "dev", "staging", "production")
	Environment string
}

FunctionOpts contains options for loading functions.

type FunctionsAPI added in v0.1.0

type FunctionsAPI[I, R any] struct {
	// contains filtered or unexported fields
}

FunctionsAPI provides access for executing tasks and scorers hosted at braintrust.dev.

func (*FunctionsAPI[I, R]) Scorer added in v0.1.0

func (f *FunctionsAPI[I, R]) Scorer(ctx context.Context, opts FunctionOpts) (Scorer[I, R], error)

Scorer loads a server-side scorer and returns a Scorer. The returned scorer, when called, will invoke the Braintrust scorer function remotely.

func (*FunctionsAPI[I, R]) Task added in v0.1.0

func (f *FunctionsAPI[I, R]) Task(ctx context.Context, opts FunctionOpts) (TaskFunc[I, R], error)

Task loads a server-side task/prompt and returns a TaskFunc. The returned function, when called, will invoke the Braintrust function remotely.

type Metadata

type Metadata map[string]any

Metadata is a map of strings to a JSON-encodable value.

type Opts

type Opts[I, R any] struct {
	// Required
	Experiment string
	Dataset    Dataset[I, R]
	Task       TaskFunc[I, R]
	Scorers    []Scorer[I, R]

	// Optional
	ProjectName string   // Project name (uses default from config if not specified)
	Tags        []string // Tags to apply to the experiment
	Metadata    Metadata // Metadata to attach to the experiment
	Update      bool     // If true, append to existing experiment (default: false)
	Parallelism int      // Number of goroutines (default: 1)
	Quiet       bool     // Suppress result output (default: false)
}

Opts defines the options for running an evaluation. I is the input type and R is the result/output type.

Dataset can be in-memory cases created with NewDataset or API-backed datasets loaded with Evaluator.Datasets.

Task can be a TaskFunc, a function wrapped with T, or a hosted task function loaded with [Evaluator.Functions().Task].

Scorers can be local functions created with NewScorer or hosted scorer functions loaded with [Evaluator.Functions().Scorer].

type Result

type Result struct {
	// contains filtered or unexported fields
}

Result contains the results of an evaluation.

func (*Result) Error

func (r *Result) Error() error

Error returns an errors that were encountered while running the eval.

func (*Result) ID

func (r *Result) ID() string

ID returns the experiment ID.

func (*Result) Name

func (r *Result) Name() string

Name returns the experiment name.

func (r *Result) Permalink() (string, error)

Permalink returns link to this eval in the Braintrust UI.

func (*Result) String

func (r *Result) String() string

String returns a string representaton of the result for printing on the console.

The format it prints will change and shouldn't be relied on for programmatic use.

type Score

type Score struct {
	// Name is the name of the score (e.g., "accuracy", "exact_match").
	Name string

	// Score is the numeric score value.
	Score float64

	// Metadata is optional additional metadata for this score.
	Metadata map[string]interface{}
}

Score represents a single score result.

type ScoreFunc

type ScoreFunc[I, R any] func(ctx context.Context, result TaskResult[I, R]) (Scores, error)

ScoreFunc is a function that evaluates a task result and returns a list of Scores.

type Scorer

type Scorer[I, R any] interface {
	// Name returns the name of this scorer.
	Name() string
	// Run evaluates the task result.
	// It returns one or more Score results.
	Run(context.Context, TaskResult[I, R]) (Scores, error)
}

Scorer is an interface for scoring the output of a task.

func NewScorer

func NewScorer[I, R any](name string, scoreFunc ScoreFunc[I, R]) Scorer[I, R]

NewScorer creates a new scorer with the given name and score function.

type Scores

type Scores = []Score

Scores is a collection of Score results returned by scorers.

func S

func S(score float64) Scores

S is a helper function to concisely return a single score from scorers. Scores created with S will default to the name of the scorer that creates them. S(0.5) is equivalent to Scores{{Score: 0.5}}.

type TaskFunc

type TaskFunc[I, R any] func(ctx context.Context, input I, hooks *TaskHooks) (TaskOutput[R], error)

TaskFunc is the signature for evaluation task functions. It receives the input, hooks for accessing eval context, and returns a TaskOutput.

func T

func T[I, R any](fn func(ctx context.Context, input I) (R, error)) TaskFunc[I, R]

T is a convenience function for writing short task functions (TaskFunc) that only use the input and output and don't need Hooks or other advanced features.

task := eval.T(func(ctx context.Context, input string) (string, error) {
	return input, nil
})

type TaskHooks

type TaskHooks struct {
	// The eval and task spans are included, if you want to add custom attributes or events.
	TaskSpan oteltrace.Span
	EvalSpan oteltrace.Span

	// Readonly fields. These aren't necessarily recommended to be included in the task function,
	// but are available for advanced use cases.
	Expected any      // Not usually used in tasks, so this is untyped
	Metadata Metadata // Case metadata
	Tags     []string // Case tags
}

TaskHooks provides access to evaluation context within a task. All fields are read-only except for span modification.

type TaskOutput

type TaskOutput[R any] struct {
	Value R

	// UserData allows passing custom application context to scorers.
	// This field is NOT logged and isn't supported outside the context of the Go SDK.
	// Use this for in-process data like database connections, file handles, or metrics.
	UserData any
}

TaskOutput wraps the output value from a task.

type TaskResult

type TaskResult[I, R any] struct {
	Input    I        // The case input
	Expected R        // What we expected
	Output   R        // What the task actually returned
	Metadata Metadata // Case metadata

	// UserData is custom application context from the task.
	// This field is NOT logged and isn't supported outside the context of the Go SDK.
	UserData any
}

TaskResult represents the complete result of executing a task on a case. This is passed to scorers for evaluation.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL