llmtest

package module
v0.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 28, 2026 License: MIT Imports: 14 Imported by: 0

README

llmtest

LLM testing for Go. Extends go test with eval scorers, no new tools to learn.

Install

go get github.com/adamwoolhether/llmtest

Quick Start

Already using strings.Contains in your tests? Replace it with llmtest.Contains. Same logic, now with structured results, scorer metrics, and a path to LLM-as-judge when you're ready.

package myapp_test

import (
    "testing"

    "github.com/adamwoolhether/llmtest"
)

func TestGreeting(t *testing.T) {
    llmtest.Run(t, "polite", func(e *llmtest.E) {
        e.Case(llmtest.TestCase{
            Input:        "Say hello",
            ActualOutput: callMyLLM("Say hello"),
        })
        e.Require(llmtest.Contains("hello"))
        e.Check(llmtest.LengthBetween(1, 500))
    })
}

Run with:

LLMTEST=1 go test ./...

Tests are skipped unless LLMTEST=1 is set, so evals never run accidentally during normal development.

Zero-Cost Quickstart (Ollama)

No API key needed. Run evals locally with Ollama:

# Install Ollama and pull a model
ollama pull llama3.2

# Run your eval tests (llmtest auto-detects Ollama)
LLMTEST=1 LLMTEST_PROVIDER=ollama go test -v ./...

Free, private, runs on your laptop. Swap to OpenAI or Anthropic later by setting the API key.

Core Workflow

Run(t, name, func(e *E) {
    e.Case(tc)            // set the test case
    e.Config(opts...)     // optional: set eval-level options
    e.Require(scorer)     // hard constraint: stops on failure
    e.Check(scorer)       // soft constraint: records but continues
})

Require vs Check: Require calls t.FailNow on failure, so the test stops immediately. Check marks the test as failed but does not stop it; remaining scorers still execute. Use Require for hard constraints and Check for soft/informational metrics.

Run does not call t.Parallel(). Call it on t before Run if you want parallel subtests.

Verdicts

Every scorer returns one of three verdicts:

Verdict Score Meaning
Pass 1.0 Criterion fully met
Partial 0.5 Criterion partially met
Fail 0.0 Criterion not met

Deterministic scorers only return Pass or Fail. The Partial verdict is used by LLM-based scorers like Rubric.

TestCase Fields

Field Type Description
Input string The prompt sent to the LLM under test
ActualOutput string The LLM's response (the text being evaluated)
ExpectedOutput string Reference/ideal answer (optional)
Context []string Background info given to the LLM (optional)
RetrievalContext []string Documents retrieved by a RAG pipeline (optional)
Metadata map[string]any Arbitrary data for custom scorers

Deterministic Scorers

These run locally with no network calls.

Substring
Scorer Description
Contains(s) Output contains substring s
ContainsAll(s..) Output contains every substring
ContainsAny(s..) Output contains at least one substring
NotContains(s) Output does not contain s
NotContainsAny(s..) Output contains none of the substrings
Regex
Scorer Description
MatchesRegex(p) Output matches regex pattern p
NotMatchesRegex(p) Output does not match regex pattern p
JSON
Scorer Description
IsJSON() Output is valid JSON
MatchesJSONSchema(s) Output conforms to JSON Schema s (reports schema errors at scoring time)
MustMatchJSONSchema(s) Same, but panics on invalid schema (compile-time safety)
Structure
Scorer Description
LengthBetween(min, max) Output byte length is in [min, max]
ContainsTag(tag) Output has <tag>...</tag> pair
ExtractTag(tag, inner) Extract tag content, then apply inner scorer
Scorer Composition

ExtractTag enables composition: extract content from a tag, then evaluate it:

// Verify the <answer> tag contains "42"
llmtest.ExtractTag("answer", llmtest.Contains("42"))

Rubric Scorer

Rubric is an LLM-based scorer that sends the test case to a judge model for criterion evaluation.

e.Require(llmtest.Rubric("Response is polite and professional"))

The judge returns one of three verdicts:

  • Pass (1.0): criterion fully met
  • Partial (0.5): criterion partially met
  • Fail (0.0): criterion not met
Options
Option Description
Model(m) Override the judge model for this scorer
Provider(p) Override the provider for this scorer
Threshold(t) Minimum score to pass (default 1.0; set to 0.5 to accept Partial)
ConsistencyCheck(n) Run n times, take majority verdict
e.Require(llmtest.Rubric("Is polite",
    llmtest.Model("gpt-4o"),
    llmtest.Threshold(0.5),
))

LLM calls include automatic rate-limit retry with backoff and JSON repair (re-prompts if the judge returns invalid JSON).

Configuration Priority

LLM-based scorers resolve model and provider in this order:

  1. Scorer-level: Model(), Provider() options on individual scorers
  2. Eval-level: EvalModel(), EvalProvider() via e.Config()
  3. Environment: LLMTEST_MODEL, LLMTEST_PROVIDER
  4. Auto-detection: probe API keys / local services

Providers

Provider Constructor Default Model Required Env Var
OpenAI OpenAI() gpt-4.1-mini OPENAI_API_KEY
Anthropic Anthropic() claude-sonnet-4-5-20250929 ANTHROPIC_API_KEY
Ollama Ollama() llama3.2 (none, local service)

Auto-detection order when LLMTEST_PROVIDER is unset:

  1. OPENAI_API_KEY present → OpenAI
  2. ANTHROPIC_API_KEY present → Anthropic
  3. Ollama reachable → Ollama

Most users don't need to import provider directly. Just set the environment variable.

Environment Variables

Variable Description Default
LLMTEST Set to 1 to enable eval tests (unset = skip all)
LLMTEST_PROVIDER Provider: openai, anthropic, or ollama auto-detect
LLMTEST_MODEL Override default model provider default
LLMTEST_CONCURRENCY Max parallel LLM calls 5
LLMTEST_NO_CACHE Set to 1 to bypass response cache (unset = cache enabled)
LLMTEST_OUTPUT Path to write JSON summary (unset = no file)
LLMTEST_OLLAMA_URL Ollama endpoint http://localhost:11434

Patterns

Table-Driven Tests
cases := []struct {
    name   string
    input  string
    output string
}{
    {"greeting", "Say hi", "Hello there!"},
    {"farewell", "Say bye", "Goodbye!"},
}

for _, tc := range cases {
    llmtest.Run(t, tc.name, func(e *llmtest.E) {
        e.Case(llmtest.TestCase{
            Input:        tc.input,
            ActualOutput: tc.output,
        })
        e.Require(llmtest.LengthBetween(1, 200))
    })
}

Call Run directly in the loop. Do not wrap it in another t.Run, or you'll get double-nested test names.

Combining Scorers
llmtest.Run(t, "structured_response", func(e *llmtest.E) {
    e.Case(llmtest.TestCase{
        Input:        "Explain Go interfaces",
        ActualOutput: response,
    })
    e.Require(llmtest.IsJSON())
    e.Require(llmtest.Contains("interface"))
    e.Check(llmtest.LengthBetween(100, 2000))
    e.Check(llmtest.Rubric("Explanation is clear and accurate"))
})
JSON Output

Collect structured results by setting LLMTEST_OUTPUT and calling Flush from TestMain:

func TestMain(m *testing.M) {
    code := m.Run()
    llmtest.Flush()
    os.Exit(code)
}
LLMTEST=1 LLMTEST_OUTPUT=results.json go test ./...

Caching & Concurrency

  • LLM responses are cached to disk in a .llmtest-cache/ directory (created in the working directory). Cache keys are derived from scorer name, prompt, and model. Entries expire after 24 hours. Add .llmtest-cache/ to your .gitignore.
  • Set LLMTEST_NO_CACHE=1 to bypass both reads and writes.
  • Concurrent LLM calls are limited by LLMTEST_CONCURRENCY (default: 5).

Structured Test Attributes (T.Attr)

Every scorer call emits structured key-value attributes via testing.T.Attr (Go 1.25+):

Attribute Key Value
llmtest.scorer Scorer name, e.g. Contains("hello")
llmtest.verdict PASS, PARTIAL, or FAIL
llmtest.score Numeric score: 1.00, 0.50, 0.00
llmtest.reason Human-readable explanation
llmtest.tokens LLM tokens consumed (0 for deterministic)
llmtest.latency_ms Scorer wall-clock time in ms

Attributes are visible in go test -v output and machine-readable via go test -json. Eval results are structured data inside go test, not a separate tool.

CI Integration

Run evals in GitHub Actions with go test -json to get machine-readable results:

name: LLM Evals
on: [push]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-go@v5
        with:
          go-version: '1.25'
      - name: Run evals
        env:
          LLMTEST: "1"
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: go test -json ./... | tee eval-results.json

The go test -json output contains llmtest.* attributes for each scorer result. Parse them downstream for dashboards, alerts, or trend tracking.

Comparison

Feature llmtest promptfoo deepeval goeval maragu.dev/gai/eval
Language Go YAML/Node Python Go Go
Runs in go test ❌ (separate CLI) ❌ (pytest)
Structured attrs (T.Attr)
LLM judge ⚠️ minimal
Deterministic scorers ⚠️ limited ⚠️ minimal
Config format Go code YAML Python Go code Go code

Documentation

Overview

Package llmtest provides a testing framework for evaluating LLM outputs using deterministic and LLM-based scorers within Go's testing.T.

Getting Started

The minimal workflow is: call Run inside a test, set the TestCase with E.Case, then assert with E.Require or E.Check:

func TestGreeting(t *testing.T) {
    llmtest.Run(t, "polite", func(e *llmtest.E) {
        e.Case(llmtest.TestCase{
            Input:        "Say hello",
            ActualOutput: callMyLLM("Say hello"),
        })
        e.Require(llmtest.Contains("hello"))
    })
}

Tests are skipped unless LLMTEST=1 is set, so LLM evals never run accidentally during normal development.

Scorers

Scorers come in two flavors:

Deterministic scorers run locally with no network calls:

LLM-based scorers send the test case to a judge model:

  • Rubric — criterion evaluation with Pass/Partial/Fail verdicts

Require vs Check

E.Require calls t.Fatalf on failure — the test stops immediately. Use it for hard constraints that make further evaluation meaningless.

E.Check marks the test as failed but does not stop it — remaining scorers still execute. Use it for soft constraints where you want all results.

Scorer Composition

ExtractTag wraps an inner scorer, enabling composition:

// Extract <answer>...</answer> then check its content.
llmtest.ExtractTag("answer", llmtest.Contains("42"))

Configuration Priority

LLM-based scorers resolve model and provider with this priority:

  1. Scorer-level options: Model, Provider
  2. Eval-level options: EvalModel, EvalProvider (via E.Config)
  3. Environment variables: LLMTEST_MODEL, LLMTEST_PROVIDER
  4. Auto-detection (see below)

Environment Variables

  • LLMTEST=1 Enable LLM eval tests (omit to skip all evals)
  • LLMTEST_PROVIDER Select provider: "openai", "anthropic", or "ollama"
  • LLMTEST_MODEL Override the default model for the selected provider
  • LLMTEST_CONCURRENCY Max parallel LLM calls (default: 5)
  • LLMTEST_NO_CACHE=1 Bypass the response cache
  • LLMTEST_OUTPUT Path to write a JSON summary file (e.g. results.json)
  • LLMTEST_OLLAMA_URL Ollama endpoint (default: http://localhost:11434)

Auto-detection order when LLMTEST_PROVIDER is unset:

  1. OPENAI_API_KEY present → OpenAI
  2. ANTHROPIC_API_KEY present → Anthropic
  3. Ollama reachable at URL → Ollama
  4. None found → clear error listing all options

JSON Output

Set LLMTEST_OUTPUT to a file path to collect structured results. Call Flush from TestMain after m.Run completes:

func TestMain(m *testing.M) {
    code := m.Run()
    llmtest.Flush()
    os.Exit(code)
}

Caching

LLM-based scorer results are cached to disk in a `.llmtest-cache/` directory (created in the working directory). Cache keys are derived from the scorer name, prompt, and model; entries expire after 24 hours. Set LLMTEST_NO_CACHE=1 to bypass both reads and writes.

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func Flush

func Flush()

Flush writes accumulated eval results to the file specified by LLMTEST_OUTPUT. If the variable is unset, Flush is a no-op.

Call Flush from TestMain after m.Run completes:

func TestMain(m *testing.M) {
    code := m.Run()
    llmtest.Flush()
    os.Exit(code)
}

func Run

func Run(t *testing.T, name string, fn func(e *E))

Run executes an LLM eval within a *testing.T context. Evals are skipped unless LLMTEST=1 is set.

Run internally calls t.Run(name, ...) to create a subtest. In table-driven tests, call Run directly in the loop — do NOT wrap it in another t.Run, or you'll get double-nested output:

for _, tc := range cases {
    llmtest.Run(t, tc.name, func(e *llmtest.E) { ... })  // correct
}

NOT:

for _, tc := range cases {
    t.Run(tc.name, func(t *testing.T) {
        llmtest.Run(t, tc.name, ...)  // double-nested: TestX/name/name
    })
}

Run does NOT call t.Parallel(). Call it yourself before Run if needed.

Types

type E

type E struct {
	// contains filtered or unexported fields
}

E is the eval context passed to every Run callback. Use its methods to set the test case, configure LLM options, and run scorers.

func (*E) Case

func (e *E) Case(tc TestCase)

Case sets the TestCase to evaluate. Call this before any Require or Check.

func (*E) Check

func (e *E) Check(s Scorer)

Check runs the scorer and marks the test as failed if the verdict is Fail, but does not stop the test. Use Check for soft constraints — remaining scorers still execute even on failure.

func (*E) Config

func (e *E) Config(opts ...EvalOption)

Config applies eval-level options (e.g. EvalModel, EvalProvider) that are inherited by all LLM-based scorers in this eval. Scorer-level options take precedence over eval-level options.

func (*E) Require

func (e *E) Require(s Scorer)

Require runs the scorer and calls t.FailNow if the verdict is Fail. Use Require for hard constraints — the test stops immediately on failure.

type EvalOption

type EvalOption func(*evalOptions)

EvalOption configures all LLM-based scorers within a single Run call. Set via E.Config. Scorer-level ScorerOption values take precedence.

func EvalModel

func EvalModel(m string) EvalOption

EvalModel sets the model for all LLM-based scorers in this eval. Scorer-level Model() takes precedence over EvalModel().

func EvalProvider

func EvalProvider(p provider.Provider) EvalOption

EvalProvider sets the provider for all LLM-based scorers in this eval. Scorer-level Provider() takes precedence over EvalProvider().

type Result

type Result struct {
	// Verdict is the categorical outcome: Pass, Partial, or Fail.
	Verdict Verdict

	// Score is a numeric value in the range [0.0, 1.0].
	// Pass=1.0, Partial=0.5, Fail=0.0.
	Score float64

	// Reason is a human-readable explanation of the verdict.
	Reason string

	// Tokens is the total number of LLM tokens consumed (input + output).
	// Zero for deterministic scorers.
	Tokens int

	// LatencyMS is the wall-clock time of the scorer in milliseconds.
	// Zero for deterministic scorers.
	LatencyMS int64
}

Result is the outcome of a Scorer evaluation.

type Scorer

type Scorer interface {
	// Name returns a human-readable identifier for this scorer,
	// typically including its configuration (e.g. `Contains("hello")`).
	Name() string

	// Score evaluates tc and returns a [Result]. The context carries
	// cancellation and deadline from the test. Deterministic scorers
	// may ignore ctx; LLM-based scorers use it for API calls.
	Score(ctx context.Context, tc TestCase) (Result, error)
}

Scorer evaluates a TestCase and returns a Result.

Deterministic scorers (e.g. Contains, IsJSON) run locally with no network calls. LLM-based scorers (e.g. Rubric) send the test case to a language model for evaluation.

Every Scorer must be safe for concurrent use.

func Contains

func Contains(substr string) Scorer

Contains returns a scorer that passes when ActualOutput contains substr.

Example
package main

import (
	"context"
	"fmt"

	"github.com/adamwoolhether/llmtest"
)

func main() {
	tc := llmtest.TestCase{ActualOutput: "Hello, world!"}
	res, _ := llmtest.Contains("world").Score(context.Background(), tc)
	fmt.Println(res.Verdict)
}
Output:

PASS

func ContainsAll

func ContainsAll(substrs ...string) Scorer

ContainsAll returns a scorer that passes when ActualOutput contains every one of the given substrings.

Example
package main

import (
	"context"
	"fmt"

	"github.com/adamwoolhether/llmtest"
)

func main() {
	tc := llmtest.TestCase{ActualOutput: "The quick brown fox"}
	res, _ := llmtest.ContainsAll("quick", "fox").Score(context.Background(), tc)
	fmt.Println(res.Verdict)
}
Output:

PASS

func ContainsAny

func ContainsAny(substrs ...string) Scorer

ContainsAny returns a scorer that passes when ActualOutput contains at least one of the given substrings.

Example
package main

import (
	"context"
	"fmt"

	"github.com/adamwoolhether/llmtest"
)

func main() {
	tc := llmtest.TestCase{ActualOutput: "The color is blue"}
	res, _ := llmtest.ContainsAny("red", "blue", "green").Score(context.Background(), tc)
	fmt.Println(res.Verdict)
}
Output:

PASS

func ContainsTag

func ContainsTag(tag string) Scorer

ContainsTag returns a scorer that passes when ActualOutput contains a matching XML-style tag pair: <tag>...</tag>.

Example
package main

import (
	"context"
	"fmt"

	"github.com/adamwoolhether/llmtest"
)

func main() {
	tc := llmtest.TestCase{
		ActualOutput: "Reasoning: <thinking>Let me think...</thinking>",
	}
	res, _ := llmtest.ContainsTag("thinking").Score(context.Background(), tc)
	fmt.Println(res.Verdict)
}
Output:

PASS

func ExtractTag

func ExtractTag(tag string, inner Scorer) Scorer

ExtractTag returns a composable scorer that extracts the text between <tag> and </tag> in ActualOutput, then delegates to inner. This enables scorer composition — for example:

ExtractTag("answer", Contains("42"))

extracts the content of <answer>...</answer> and checks that it contains "42".

Example
package main

import (
	"context"
	"fmt"

	"github.com/adamwoolhether/llmtest"
)

func main() {
	tc := llmtest.TestCase{
		ActualOutput: "Here is the result: <answer>42</answer>",
	}
	scorer := llmtest.ExtractTag("answer", llmtest.Contains("42"))
	res, _ := scorer.Score(context.Background(), tc)
	fmt.Println(res.Verdict)
}
Output:

PASS

func IsJSON

func IsJSON() Scorer

IsJSON returns a scorer that passes when ActualOutput is valid JSON.

Example
package main

import (
	"context"
	"fmt"

	"github.com/adamwoolhether/llmtest"
)

func main() {
	tc := llmtest.TestCase{ActualOutput: `{"status":"ok","count":3}`}
	res, _ := llmtest.IsJSON().Score(context.Background(), tc)
	fmt.Println(res.Verdict)
}
Output:

PASS

func LengthBetween

func LengthBetween(min, max int) Scorer

LengthBetween returns a scorer that passes when the byte length of ActualOutput is between min and max (inclusive).

Example
package main

import (
	"context"
	"fmt"

	"github.com/adamwoolhether/llmtest"
)

func main() {
	tc := llmtest.TestCase{ActualOutput: "short"}
	res, _ := llmtest.LengthBetween(1, 10).Score(context.Background(), tc)
	fmt.Println(res.Verdict)
}
Output:

PASS

func MatchesJSONSchema

func MatchesJSONSchema(schema string) Scorer

MatchesJSONSchema returns a scorer that passes when ActualOutput is valid JSON conforming to the given JSON Schema. The schema is compiled on each call to Score; for compile-time validation use MustMatchJSONSchema.

Example
package main

import (
	"context"
	"fmt"

	"github.com/adamwoolhether/llmtest"
)

func main() {
	schema := `{
		"type": "object",
		"properties": {
			"name": {"type": "string"},
			"age":  {"type": "integer"}
		},
		"required": ["name", "age"]
	}`
	tc := llmtest.TestCase{ActualOutput: `{"name":"Alice","age":30}`}
	res, _ := llmtest.MatchesJSONSchema(schema).Score(context.Background(), tc)
	fmt.Println(res.Verdict)
}
Output:

PASS

func MatchesRegex

func MatchesRegex(pattern string) Scorer

MatchesRegex returns a scorer that passes when ActualOutput matches the given regular expression pattern. The pattern is compiled on each call to Score; an invalid pattern returns an error.

Example
package main

import (
	"context"
	"fmt"

	"github.com/adamwoolhether/llmtest"
)

func main() {
	tc := llmtest.TestCase{ActualOutput: "Order #12345 confirmed"}
	res, _ := llmtest.MatchesRegex(`#\d{5}`).Score(context.Background(), tc)
	fmt.Println(res.Verdict)
}
Output:

PASS

func MustMatchJSONSchema

func MustMatchJSONSchema(schema string) Scorer

MustMatchJSONSchema compiles the JSON schema at construction time and panics if the schema is invalid (similar to regexp.MustCompile). Use this when the schema is a compile-time constant and you want to catch errors early. For schemas that may be invalid at runtime, use MatchesJSONSchema instead, which reports schema errors from Score().

Example
package main

import (
	"context"
	"fmt"

	"github.com/adamwoolhether/llmtest"
)

func main() {
	schema := `{
		"type": "object",
		"properties": {
			"status": {"type": "string"},
			"count":  {"type": "integer"}
		},
		"required": ["status", "count"]
	}`
	tc := llmtest.TestCase{ActualOutput: `{"status":"ok","count":3}`}
	res, _ := llmtest.MustMatchJSONSchema(schema).Score(context.Background(), tc)
	fmt.Println(res.Verdict)
}
Output:

PASS

func NotContains

func NotContains(substr string) Scorer

NotContains returns a scorer that passes when ActualOutput does not contain substr.

Example
package main

import (
	"context"
	"fmt"

	"github.com/adamwoolhether/llmtest"
)

func main() {
	tc := llmtest.TestCase{ActualOutput: "The answer is 42"}
	res, _ := llmtest.NotContains("error").Score(context.Background(), tc)
	fmt.Println(res.Verdict)
}
Output:

PASS

func NotContainsAny

func NotContainsAny(substrs ...string) Scorer

NotContainsAny returns a scorer that passes when ActualOutput contains none of the given substrings.

Example
package main

import (
	"context"
	"fmt"

	"github.com/adamwoolhether/llmtest"
)

func main() {
	tc := llmtest.TestCase{ActualOutput: "The weather is sunny"}
	res, _ := llmtest.NotContainsAny("error", "fail", "crash").Score(context.Background(), tc)
	fmt.Println(res.Verdict)
}
Output:

PASS

func NotMatchesRegex

func NotMatchesRegex(pattern string) Scorer

NotMatchesRegex returns a scorer that passes when ActualOutput does not match the given regular expression pattern.

Example
package main

import (
	"context"
	"fmt"

	"github.com/adamwoolhether/llmtest"
)

func main() {
	tc := llmtest.TestCase{ActualOutput: "No numbers here"}
	res, _ := llmtest.NotMatchesRegex(`\d+`).Score(context.Background(), tc)
	fmt.Println(res.Verdict)
}
Output:

PASS

func Rubric

func Rubric(criterion string, opts ...ScorerOption) Scorer

Rubric returns an LLM-based scorer that asks a judge model to evaluate ActualOutput against the given criterion. The judge returns a verdict of Pass, Partial, or Fail along with its reasoning.

Results are cached by (criterion + input + output + model) so identical calls within a test run do not repeat the LLM request. Set LLMTEST_NO_CACHE=1 to bypass the cache.

Options:

  • Model / Provider: override the judge model or provider for this scorer
  • Threshold: set the minimum score to promote a Partial verdict to Pass
  • ConsistencyCheck: run the judge n times and take the majority verdict

type ScorerOption

type ScorerOption func(*scorerOptions)

ScorerOption configures a single LLM-based scorer (e.g. Rubric). Scorer-level options take precedence over eval-level EvalOption values, which in turn override environment variables.

func ConsistencyCheck

func ConsistencyCheck(n int) ScorerOption

ConsistencyCheck runs the scorer n times and requires majority agreement.

func Model

func Model(m string) ScorerOption

Model sets the model for this specific scorer. Takes precedence over EvalModel().

func Provider

func Provider(p provider.Provider) ScorerOption

Provider sets the provider for this specific scorer. Takes precedence over EvalProvider().

func Threshold

func Threshold(t float64) ScorerOption

Threshold sets the minimum score for a Rubric scorer to be promoted to Pass. Only meaningful for LLM-based scorers (Rubric); deterministic scorers ignore it. A threshold of 0 is treated as unset (defaults to 1.0).

type TestCase

type TestCase struct {
	// Input is the prompt or question sent to the LLM under test.
	Input string

	// ActualOutput is the LLM's response — the text being evaluated.
	ActualOutput string

	// ExpectedOutput is the ideal or reference answer (optional).
	// Used by scorers that compare actual vs expected output.
	ExpectedOutput string

	// Context provides background information given to the LLM (optional).
	// For example, system prompt content or retrieved documents.
	Context []string

	// RetrievalContext holds the documents retrieved by a RAG pipeline
	// (optional). Useful for evaluating retrieval relevance.
	RetrievalContext []string

	// Metadata carries arbitrary key-value data for custom scorers.
	Metadata map[string]any
}

TestCase holds the data that scorers evaluate.

type Verdict

type Verdict string

Verdict represents a scorer's classification of the output. The three possible values are Pass (score 1.0), Partial (score 0.5), and Fail (score 0.0).

const (
	Pass    Verdict = "PASS"
	Partial Verdict = "PARTIAL"
	Fail    Verdict = "FAIL"
)

Verdict constants for scorer results.

func (Verdict) String

func (v Verdict) String() string

Directories

Path Synopsis
internal
cache
Package cache provides disk-based caching for LLM scorer responses.
Package cache provides disk-based caching for LLM scorer responses.
limiter
Package limiter provides concurrency limiting for parallel LLM calls.
Package limiter provides concurrency limiting for parallel LLM calls.
prompt
Package prompt provides prompt templates for LLM-based scorers.
Package prompt provides prompt templates for LLM-based scorers.
retry
Package retry provides retry logic with backoff for LLM API calls.
Package retry provides retry logic with backoff for LLM API calls.
Package provider defines the Provider interface and built-in LLM backends used by llmtest's LLM-based scorers.
Package provider defines the Provider interface and built-in LLM backends used by llmtest's LLM-based scorers.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL