eval

package

v0.5.0 Latest Latest Go to latest Published: Apr 16, 2026 License: Apache-2.0 Imports: 19 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/braintrustdata/braintrust-sdk-go

Links

Open Source Insights

Documentation ¶

Overview ¶

Package eval provides tools for evaluating AI model outputs. Evaluations help measure AI application performance (accuracy/quality) and create an effective feedback loop for AI development. They help teams understand if updates improve or regress application quality. Evaluations are a key part of the Braintrust platform.

An evaluation consists of three main components:

Dataset: A set of test examples with inputs and expected outputs
TaskFunc: The unit of work we are evaluating, usually one or more calls to an LLM
Scorer: A function that scores the result of a task against the expected result

Type Parameters ¶

This package uses two generic type parameters throughout its API:

I: The input type for the task (e.g., string, struct, []byte)
R: The result/output type from the task (e.g., string, struct, complex types)

All of the input and result types must be JSON-encodable.

For example:

Case[string, string] is a test case with string input and string output
TaskFunc[Input, Output] is a task that takes Input and returns Output
Dataset[string, bool] is an iterator over Cases with string inputs and boolean outputs

See Evaluator.Run for running evaluations.

Example ¶

Example demonstrates how to run a basic evaluation.

ctx := context.Background()

// Create tracer provider
tp := trace.NewTracerProvider()
defer func() { _ = tp.Shutdown(ctx) }()

// Create Braintrust client (reads BRAINTRUST_API_KEY from environment)
client, err := braintrust.New(tp, braintrust.WithProject("test-project"))
if err != nil {
	log.Fatal(err)
}

// Create an evaluator with string input and output types
evaluator := braintrust.NewEvaluator[string, string](client)

// Define a simple task that adds exclamation marks
task := eval.T(func(ctx context.Context, input string) (string, error) {
	return input + "!", nil
})

// Create test cases
dataset := eval.NewDataset([]eval.Case[string, string]{
	{Input: "hello", Expected: "hello!"},
	{Input: "world", Expected: "world!"},
})

// Create a scorer
scorer := eval.NewScorer("exact-match", func(ctx context.Context, result eval.TaskResult[string, string]) (eval.Scores, error) {
	if result.Output == result.Expected {
		return eval.S(1.0), nil
	}
	return eval.S(0.0), nil
})

// Run the evaluation
result, err := evaluator.Run(ctx, eval.Opts[string, string]{
	Experiment: "example-eval",
	Dataset:    dataset,
	Task:       task,
	Scorers:    []eval.Scorer[string, string]{scorer},
	Quiet:      true,
})
if err != nil {
	log.Fatal(err)
}

fmt.Printf("Evaluation complete: %s\n", result.Name())

Index ¶

type Case
type Dataset
- func NewDataset[I, R any](cases []Case[I, R]) Dataset[I, R]
type DatasetAPI
- func (d *DatasetAPI[I, R]) Get(ctx context.Context, id string) (Dataset[I, R], error)
- func (d *DatasetAPI[I, R]) Query(ctx context.Context, opts DatasetQueryOpts) (Dataset[I, R], error)
type DatasetQueryOpts
type Evaluator
- func NewEvaluator[I, R any](s *auth.Session, tp *trace.TracerProvider, api *api.API, project string) *Evaluator[I, R]
- func (e *Evaluator[I, R]) Datasets() *DatasetAPI[I, R]
- func (e *Evaluator[I, R]) Functions() *FunctionsAPI[I, R]
- func (e *Evaluator[I, R]) Run(ctx context.Context, opts Opts[I, R]) (*Result, error)
type FunctionOpts
type FunctionsAPI
- func (f *FunctionsAPI[I, R]) Scorer(ctx context.Context, opts FunctionOpts) (Scorer[I, R], error)
- func (f *FunctionsAPI[I, R]) Task(ctx context.Context, opts FunctionOpts) (TaskFunc[I, R], error)
type Metadata
type Opts
type Result
- func (r *Result) Error() error
- func (r *Result) ID() string
- func (r *Result) Name() string
- func (r *Result) Permalink() (string, error)
- func (r *Result) String() string
type Score
type ScoreFunc
type Scorer
- func NewScorer[I, R any](name string, scoreFunc ScoreFunc[I, R]) Scorer[I, R]
type Scores
- func S(score float64) Scores
type TaskFunc
- func T[I, R any](fn func(ctx context.Context, input I) (R, error)) TaskFunc[I, R]
type TaskHooks
type TaskOutput
type TaskResult

Examples ¶

Package

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Case ¶

type Case[I, R any] struct {
	// Input is the input to the task function.
	Input I

	// Expected is the expected output (for scoring).
	// Optional.
	Expected R

	// Tags are labels to attach to this case.
	// Optional.
	Tags []string

	// Metadata is additional metadata for this case.
	// Optional.
	Metadata map[string]interface{}

	// These fields are only set if the Case is part of a Dataset.
	// They link the eval result back to the source dataset row.
	ID      string // Dataset record ID
	XactID  string // Transaction ID
	Created string // Creation timestamp
}

Case represents a single test case in an evaluation.

type Dataset ¶

type Dataset[I, R any] interface {
	// Next returns the next case, or io.EOF if there are no more cases.
	Next() (Case[I, R], error)

	// ID returns the dataset ID if backed by a Braintrust dataset.
	// Returns empty string for literal in-memory cases.
	ID() string

	// Version returns the dataset version if applicable.
	// Returns empty string for literal cases or unversioned datasets.
	Version() string
}

Dataset is an iterator interface for evaluation datasets. It is commonly an in-memory slice of cases, but can also be a dataset lazily loaded from the Braintrust API.

func NewDataset ¶

func NewDataset[I, R any](cases []Case[I, R]) Dataset[I, R]

NewDataset creates a Dataset iterator from a slice of cases.

type DatasetAPI ¶

type DatasetAPI[I, R any] struct {
	// contains filtered or unexported fields
}

DatasetAPI provides methods for loading datasets with automatic type conversion so they can be easily used in evals.

func (*DatasetAPI[I, R]) Get ¶

func (d *DatasetAPI[I, R]) Get(ctx context.Context, id string) (Dataset[I, R], error)

Get loads a dataset by ID and returns a Dataset iterator.

func (*DatasetAPI[I, R]) Query ¶

func (d *DatasetAPI[I, R]) Query(ctx context.Context, opts DatasetQueryOpts) (Dataset[I, R], error)

Query loads a dataset with advanced query options.

type DatasetQueryOpts ¶

type DatasetQueryOpts struct {
	// Name is the dataset name (requires project context)
	Name string

	// ID is the dataset ID
	ID string

	// Version specifies a specific dataset version
	Version string

	// Limit specifies the maximum number of records to return (0 = unlimited)
	Limit int
}

DatasetQueryOpts contains options for querying datasets.

type Evaluator ¶

type Evaluator[I, R any] struct {
	// contains filtered or unexported fields
}

Evaluator provides a reusable way to run multiple evaluations with the same input and output types. This is useful when you need to run several evaluations in sequence with the same type signature, or use hosted prompts, scorers and datasets with automatic type conversion.

func NewEvaluator ¶

func NewEvaluator[I, R any](s *auth.Session, tp *trace.TracerProvider, api *api.API, project string) *Evaluator[I, R]

NewEvaluator creates a new evaluator with explicit dependencies. The type parameters I (input) and R (result/output) must be specified explicitly. Users create Evaluators with braintrust.NewEvaluator.

func (*Evaluator[I, R]) Datasets ¶

func (e *Evaluator[I, R]) Datasets() *DatasetAPI[I, R]

Datasets is used to access Datasets API for loading datasets with this evaluator's type parameters.

func (*Evaluator[I, R]) Functions ¶ added in v0.1.0

func (e *Evaluator[I, R]) Functions() *FunctionsAPI[I, R]

Functions is used to execute hosted Braintrust functions (e.g. hosted tasks and hosted scorers) as part of an eval. As long as I and R are JSON-serializable, FunctionsAPI will automatically convert the input and output to and from JSON.

func (*Evaluator[I, R]) Run ¶

func (e *Evaluator[I, R]) Run(ctx context.Context, opts Opts[I, R]) (*Result, error)

Run executes an evaluation using this evaluator's dependencies.

type FunctionOpts ¶ added in v0.1.0

type FunctionOpts struct {
	// Slug is the function slug (required)
	Slug string

	// Project overrides the default project name (optional)
	Project string

	// Version pins to a specific function version (optional, e.g., "5878bd218351fb8e")
	Version string

	// Environment specifies the deployment environment (optional, e.g., "dev", "staging", "production")
	Environment string
}

FunctionOpts contains options for loading functions.

type FunctionsAPI ¶ added in v0.1.0

type FunctionsAPI[I, R any] struct {
	// contains filtered or unexported fields
}

FunctionsAPI provides access for executing tasks and scorers hosted at braintrust.dev.

func (*FunctionsAPI[I, R]) Scorer ¶ added in v0.1.0

func (f *FunctionsAPI[I, R]) Scorer(ctx context.Context, opts FunctionOpts) (Scorer[I, R], error)

Scorer loads a server-side scorer and returns a Scorer. The returned scorer, when called, will invoke the Braintrust scorer function remotely.

func (*FunctionsAPI[I, R]) Task ¶ added in v0.1.0

func (f *FunctionsAPI[I, R]) Task(ctx context.Context, opts FunctionOpts) (TaskFunc[I, R], error)

Task loads a server-side task/prompt and returns a TaskFunc. The returned function, when called, will invoke the Braintrust function remotely.

type Metadata ¶

type Metadata map[string]any

Metadata is a map of strings to a JSON-encodable value.

type Opts ¶

type Opts[I, R any] struct {
	// Required
	Experiment string
	Dataset    Dataset[I, R]
	Task       TaskFunc[I, R]
	Scorers    []Scorer[I, R]

	// Optional
	ProjectName string   // Project name (uses default from config if not specified)
	Tags        []string // Tags to apply to the experiment
	Metadata    Metadata // Metadata to attach to the experiment
	Update      bool     // If true, append to existing experiment (default: false)
	Parallelism int      // Number of goroutines (default: 1)
	Quiet       bool     // Suppress result output (default: false)
}

Opts defines the options for running an evaluation. I is the input type and R is the result/output type.

Dataset can be in-memory cases created with NewDataset or API-backed datasets loaded with Evaluator.Datasets.

Task can be a TaskFunc, a function wrapped with T, or a hosted task function loaded with [Evaluator.Functions().Task].

Scorers can be local functions created with NewScorer or hosted scorer functions loaded with [Evaluator.Functions().Scorer].

type Result ¶

type Result struct {
	// contains filtered or unexported fields
}

Result contains the results of an evaluation.

func (*Result) Error ¶

func (r *Result) Error() error

Error returns an errors that were encountered while running the eval.

func (*Result) ID ¶

func (r *Result) ID() string

ID returns the experiment ID.

func (*Result) Name ¶

func (r *Result) Name() string

Name returns the experiment name.

func (*Result) Permalink ¶

func (r *Result) Permalink() (string, error)

Permalink returns link to this eval in the Braintrust UI.

func (*Result) String ¶

func (r *Result) String() string

String returns a string representaton of the result for printing on the console.

The format it prints will change and shouldn't be relied on for programmatic use.

type Score ¶

type Score struct {
	// Name is the name of the score (e.g., "accuracy", "exact_match").
	Name string

	// Score is the numeric score value.
	Score float64

	// Metadata is optional additional metadata for this score.
	Metadata map[string]interface{}
}

Score represents a single score result.

type ScoreFunc ¶

type ScoreFunc[I, R any] func(ctx context.Context, result TaskResult[I, R]) (Scores, error)

ScoreFunc is a function that evaluates a task result and returns a list of Scores.

type Scorer ¶

type Scorer[I, R any] interface {
	// Name returns the name of this scorer.
	Name() string
	// Run evaluates the task result.
	// It returns one or more Score results.
	Run(context.Context, TaskResult[I, R]) (Scores, error)
}

Scorer is an interface for scoring the output of a task.

func NewScorer ¶

func NewScorer[I, R any](name string, scoreFunc ScoreFunc[I, R]) Scorer[I, R]

NewScorer creates a new scorer with the given name and score function.

type Scores ¶

type Scores = []Score

Scores is a collection of Score results returned by scorers.

func S ¶

func S(score float64) Scores

S is a helper function to concisely return a single score from scorers. Scores created with S will default to the name of the scorer that creates them. S(0.5) is equivalent to Scores{{Score: 0.5}}.

type TaskFunc ¶

type TaskFunc[I, R any] func(ctx context.Context, input I, hooks *TaskHooks) (TaskOutput[R], error)

TaskFunc is the signature for evaluation task functions. It receives the input, hooks for accessing eval context, and returns a TaskOutput.

func T ¶

func T[I, R any](fn func(ctx context.Context, input I) (R, error)) TaskFunc[I, R]

T is a convenience function for writing short task functions (TaskFunc) that only use the input and output and don't need Hooks or other advanced features.

task := eval.T(func(ctx context.Context, input string) (string, error) {
	return input, nil
})

type TaskHooks ¶

type TaskHooks struct {
	// The eval and task spans are included, if you want to add custom attributes or events.
	TaskSpan oteltrace.Span
	EvalSpan oteltrace.Span

	// Readonly fields. These aren't necessarily recommended to be included in the task function,
	// but are available for advanced use cases.
	Expected any      // Not usually used in tasks, so this is untyped
	Metadata Metadata // Case metadata
	Tags     []string // Case tags
}

TaskHooks provides access to evaluation context within a task. All fields are read-only except for span modification.

type TaskOutput ¶

type TaskOutput[R any] struct {
	Value R

	// UserData allows passing custom application context to scorers.
	// This field is NOT logged and isn't supported outside the context of the Go SDK.
	// Use this for in-process data like database connections, file handles, or metrics.
	UserData any
}

TaskOutput wraps the output value from a task.

type TaskResult ¶

type TaskResult[I, R any] struct {
	Input    I        // The case input
	Expected R        // What we expected
	Output   R        // What the task actually returned
	Metadata Metadata // Case metadata

	// UserData is custom application context from the task.
	// This field is NOT logged and isn't supported outside the context of the Go SDK.
	UserData any
}

TaskResult represents the complete result of executing a task on a case. This is passed to scorers for evaluation.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL