Documentation
¶
Overview ¶
Package eval provides tools for evaluating AI model outputs. Evaluations help measure AI application performance (accuracy/quality) and create an effective feedback loop for AI development. They help teams understand if updates improve or regress application quality. Evaluations are a key part of the Braintrust platform.
An evaluation consists of three main components:
- Dataset: A set of test examples with inputs and expected outputs
- TaskFunc: The unit of work we are evaluating, usually one or more calls to an LLM
- Scorer: A function that scores the result of a task against the expected result
Type Parameters ¶
This package uses two generic type parameters throughout its API:
- I: The input type for the task (e.g., string, struct, []byte)
- R: The result/output type from the task (e.g., string, struct, complex types)
All of the input and result types must be JSON-encodable.
For example:
- Case[string, string] is a test case with string input and string output
- TaskFunc[Input, Output] is a task that takes Input and returns Output
- Dataset[string, bool] is an iterator over Cases with string inputs and boolean outputs
See Evaluator.Run for running evaluations.
Example ¶
Example demonstrates how to run a basic evaluation.
ctx := context.Background()
// Create tracer provider
tp := trace.NewTracerProvider()
defer func() { _ = tp.Shutdown(ctx) }()
// Create Braintrust client (reads BRAINTRUST_API_KEY from environment)
client, err := braintrust.New(tp, braintrust.WithProject("test-project"))
if err != nil {
log.Fatal(err)
}
// Create an evaluator with string input and output types
evaluator := braintrust.NewEvaluator[string, string](client)
// Define a simple task that adds exclamation marks
task := eval.T(func(ctx context.Context, input string) (string, error) {
return input + "!", nil
})
// Create test cases
dataset := eval.NewDataset([]eval.Case[string, string]{
{Input: "hello", Expected: "hello!"},
{Input: "world", Expected: "world!"},
})
// Create a scorer
scorer := eval.NewScorer("exact-match", func(ctx context.Context, result eval.TaskResult[string, string]) (eval.Scores, error) {
if result.Output == result.Expected {
return eval.S(1.0), nil
}
return eval.S(0.0), nil
})
// Run the evaluation
result, err := evaluator.Run(ctx, eval.Opts[string, string]{
Experiment: "example-eval",
Dataset: dataset,
Task: task,
Scorers: []eval.Scorer[string, string]{scorer},
Quiet: true,
})
if err != nil {
log.Fatal(err)
}
fmt.Printf("Evaluation complete: %s\n", result.Name())
Index ¶
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Case ¶
type Case[I, R any] struct { // Input is the input to the task function. Input I // Expected is the expected output (for scoring). // Optional. Expected R // Tags are labels to attach to this case. // Optional. Tags []string // Metadata is additional metadata for this case. // Optional. Metadata map[string]interface{} // These fields are only set if the Case is part of a Dataset. // They link the eval result back to the source dataset row. ID string // Dataset record ID XactID string // Transaction ID Created string // Creation timestamp }
Case represents a single test case in an evaluation.
type Dataset ¶
type Dataset[I, R any] interface { // Next returns the next case, or io.EOF if there are no more cases. Next() (Case[I, R], error) // ID returns the dataset ID if backed by a Braintrust dataset. // Returns empty string for literal in-memory cases. ID() string // Version returns the dataset version if applicable. // Returns empty string for literal cases or unversioned datasets. Version() string }
Dataset is an iterator interface for evaluation datasets. It is commonly an in-memory slice of cases, but can also be a dataset lazily loaded from the Braintrust API.
func NewDataset ¶
NewDataset creates a Dataset iterator from a slice of cases.
type DatasetAPI ¶
type DatasetAPI[I, R any] struct { // contains filtered or unexported fields }
DatasetAPI provides methods for loading datasets with automatic type conversion so they can be easily used in evals.
func (*DatasetAPI[I, R]) Query ¶
func (d *DatasetAPI[I, R]) Query(ctx context.Context, opts DatasetQueryOpts) (Dataset[I, R], error)
Query loads a dataset with advanced query options.
type DatasetQueryOpts ¶
type DatasetQueryOpts struct {
// Name is the dataset name (requires project context)
Name string
// ID is the dataset ID
ID string
// Version specifies a specific dataset version
Version string
// Limit specifies the maximum number of records to return (0 = unlimited)
Limit int
}
DatasetQueryOpts contains options for querying datasets.
type Evaluator ¶
type Evaluator[I, R any] struct { // contains filtered or unexported fields }
Evaluator provides a reusable way to run multiple evaluations with the same input and output types. This is useful when you need to run several evaluations in sequence with the same type signature, or use hosted prompts, scorers and datasets with automatic type conversion.
func NewEvaluator ¶
func NewEvaluator[I, R any](s *auth.Session, tp *trace.TracerProvider, api *api.API, project string) *Evaluator[I, R]
NewEvaluator creates a new evaluator with explicit dependencies. The type parameters I (input) and R (result/output) must be specified explicitly. Users create Evaluators with braintrust.NewEvaluator.
func (*Evaluator[I, R]) Datasets ¶
func (e *Evaluator[I, R]) Datasets() *DatasetAPI[I, R]
Datasets is used to access Datasets API for loading datasets with this evaluator's type parameters.
func (*Evaluator[I, R]) Functions ¶ added in v0.1.0
func (e *Evaluator[I, R]) Functions() *FunctionsAPI[I, R]
Functions is used to execute hosted Braintrust functions (e.g. hosted tasks and hosted scorers) as part of an eval. As long as I and R are JSON-serializable, FunctionsAPI will automatically convert the input and output to and from JSON.
type FunctionOpts ¶ added in v0.1.0
type FunctionOpts struct {
// Slug is the function slug (required)
Slug string
// Project overrides the default project name (optional)
Project string
// Version pins to a specific function version (optional, e.g., "5878bd218351fb8e")
Version string
// Environment specifies the deployment environment (optional, e.g., "dev", "staging", "production")
Environment string
}
FunctionOpts contains options for loading functions.
type FunctionsAPI ¶ added in v0.1.0
type FunctionsAPI[I, R any] struct { // contains filtered or unexported fields }
FunctionsAPI provides access for executing tasks and scorers hosted at braintrust.dev.
func (*FunctionsAPI[I, R]) Scorer ¶ added in v0.1.0
func (f *FunctionsAPI[I, R]) Scorer(ctx context.Context, opts FunctionOpts) (Scorer[I, R], error)
Scorer loads a server-side scorer and returns a Scorer. The returned scorer, when called, will invoke the Braintrust scorer function remotely.
func (*FunctionsAPI[I, R]) Task ¶ added in v0.1.0
func (f *FunctionsAPI[I, R]) Task(ctx context.Context, opts FunctionOpts) (TaskFunc[I, R], error)
Task loads a server-side task/prompt and returns a TaskFunc. The returned function, when called, will invoke the Braintrust function remotely.
type Opts ¶
type Opts[I, R any] struct { // Required Experiment string Dataset Dataset[I, R] Task TaskFunc[I, R] Scorers []Scorer[I, R] // Optional ProjectName string // Project name (uses default from config if not specified) Tags []string // Tags to apply to the experiment Metadata Metadata // Metadata to attach to the experiment Update bool // If true, append to existing experiment (default: false) Parallelism int // Number of goroutines (default: 1) Quiet bool // Suppress result output (default: false) }
Opts defines the options for running an evaluation. I is the input type and R is the result/output type.
Dataset can be in-memory cases created with NewDataset or API-backed datasets loaded with Evaluator.Datasets.
Task can be a TaskFunc, a function wrapped with T, or a hosted task function loaded with [Evaluator.Functions().Task].
Scorers can be local functions created with NewScorer or hosted scorer functions loaded with [Evaluator.Functions().Scorer].
type Result ¶
type Result struct {
// contains filtered or unexported fields
}
Result contains the results of an evaluation.
type Score ¶
type Score struct {
// Name is the name of the score (e.g., "accuracy", "exact_match").
Name string
// Score is the numeric score value.
Score float64
// Metadata is optional additional metadata for this score.
Metadata map[string]interface{}
}
Score represents a single score result.
type Scorer ¶
type Scorer[I, R any] interface { // Name returns the name of this scorer. Name() string // Run evaluates the task result. // It returns one or more Score results. Run(context.Context, TaskResult[I, R]) (Scores, error) }
Scorer is an interface for scoring the output of a task.
type TaskFunc ¶
TaskFunc is the signature for evaluation task functions. It receives the input, hooks for accessing eval context, and returns a TaskOutput.
type TaskHooks ¶
type TaskHooks struct {
// The eval and task spans are included, if you want to add custom attributes or events.
TaskSpan oteltrace.Span
EvalSpan oteltrace.Span
// Readonly fields. These aren't necessarily recommended to be included in the task function,
// but are available for advanced use cases.
Expected any // Not usually used in tasks, so this is untyped
Metadata Metadata // Case metadata
Tags []string // Case tags
}
TaskHooks provides access to evaluation context within a task. All fields are read-only except for span modification.
type TaskOutput ¶
type TaskOutput[R any] struct { Value R // UserData allows passing custom application context to scorers. // This field is NOT logged and isn't supported outside the context of the Go SDK. // Use this for in-process data like database connections, file handles, or metrics. UserData any }
TaskOutput wraps the output value from a task.
type TaskResult ¶
type TaskResult[I, R any] struct { Input I // The case input Expected R // What we expected Output R // What the task actually returned Metadata Metadata // Case metadata // UserData is custom application context from the task. // This field is NOT logged and isn't supported outside the context of the Go SDK. UserData any }
TaskResult represents the complete result of executing a task on a case. This is passed to scorers for evaluation.