Documentation
¶
Overview ¶
Package llmtest provides a testing framework for evaluating LLM outputs using deterministic and LLM-based scorers within Go's testing.T.
Getting Started ¶
The minimal workflow is: call Run inside a test, set the TestCase with E.Case, then assert with E.Require or E.Check:
func TestGreeting(t *testing.T) {
llmtest.Run(t, "polite", func(e *llmtest.E) {
e.Case(llmtest.TestCase{
Input: "Say hello",
ActualOutput: callMyLLM("Say hello"),
})
e.Require(llmtest.Contains("hello"))
})
}
Tests are skipped unless LLMTEST=1 is set, so LLM evals never run accidentally during normal development.
Scorers ¶
Scorers come in two flavors:
Deterministic scorers run locally with no network calls:
- Contains, ContainsAll, ContainsAny — substring checks
- NotContains, NotContainsAny — forbidden-term checks
- MatchesRegex, NotMatchesRegex — regex matching
- IsJSON, MatchesJSONSchema, MustMatchJSONSchema — JSON validation
- LengthBetween — output length bounds
- ContainsTag — XML-style tag presence
- ExtractTag — composable tag extraction (wraps another scorer)
LLM-based scorers send the test case to a judge model:
- Rubric — criterion evaluation with Pass/Partial/Fail verdicts
Require vs Check ¶
E.Require calls t.Fatalf on failure — the test stops immediately. Use it for hard constraints that make further evaluation meaningless.
E.Check marks the test as failed but does not stop it — remaining scorers still execute. Use it for soft constraints where you want all results.
Scorer Composition ¶
ExtractTag wraps an inner scorer, enabling composition:
// Extract <answer>...</answer> then check its content.
llmtest.ExtractTag("answer", llmtest.Contains("42"))
Configuration Priority ¶
LLM-based scorers resolve model and provider with this priority:
- Scorer-level options: Model, Provider
- Eval-level options: EvalModel, EvalProvider (via E.Config)
- Environment variables: LLMTEST_MODEL, LLMTEST_PROVIDER
- Auto-detection (see below)
Environment Variables ¶
- LLMTEST=1 Enable LLM eval tests (omit to skip all evals)
- LLMTEST_PROVIDER Select provider: "openai", "anthropic", or "ollama"
- LLMTEST_MODEL Override the default model for the selected provider
- LLMTEST_CONCURRENCY Max parallel LLM calls (default: 5)
- LLMTEST_NO_CACHE=1 Bypass the response cache
- LLMTEST_OUTPUT Path to write a JSON summary file (e.g. results.json)
- LLMTEST_OLLAMA_URL Ollama endpoint (default: http://localhost:11434)
Auto-detection order when LLMTEST_PROVIDER is unset:
- OPENAI_API_KEY present → OpenAI
- ANTHROPIC_API_KEY present → Anthropic
- Ollama reachable at URL → Ollama
- None found → clear error listing all options
JSON Output ¶
Set LLMTEST_OUTPUT to a file path to collect structured results. Call Flush from TestMain after m.Run completes:
func TestMain(m *testing.M) {
code := m.Run()
llmtest.Flush()
os.Exit(code)
}
Caching ¶
LLM-based scorer results are cached to disk in a `.llmtest-cache/` directory (created in the working directory). Cache keys are derived from the scorer name, prompt, and model; entries expire after 24 hours. Set LLMTEST_NO_CACHE=1 to bypass both reads and writes.
Index ¶
- func Flush()
- func Run(t *testing.T, name string, fn func(e *E))
- type E
- type EvalOption
- type Result
- type Scorer
- func Contains(substr string) Scorer
- func ContainsAll(substrs ...string) Scorer
- func ContainsAny(substrs ...string) Scorer
- func ContainsTag(tag string) Scorer
- func ExtractTag(tag string, inner Scorer) Scorer
- func IsJSON() Scorer
- func LengthBetween(min, max int) Scorer
- func MatchesJSONSchema(schema string) Scorer
- func MatchesRegex(pattern string) Scorer
- func MustMatchJSONSchema(schema string) Scorer
- func NotContains(substr string) Scorer
- func NotContainsAny(substrs ...string) Scorer
- func NotMatchesRegex(pattern string) Scorer
- func Rubric(criterion string, opts ...ScorerOption) Scorer
- type ScorerOption
- type TestCase
- type Verdict
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func Flush ¶
func Flush()
Flush writes accumulated eval results to the file specified by LLMTEST_OUTPUT. If the variable is unset, Flush is a no-op.
Call Flush from TestMain after m.Run completes:
func TestMain(m *testing.M) {
code := m.Run()
llmtest.Flush()
os.Exit(code)
}
func Run ¶
Run executes an LLM eval within a *testing.T context. Evals are skipped unless LLMTEST=1 is set.
Run internally calls t.Run(name, ...) to create a subtest. In table-driven tests, call Run directly in the loop — do NOT wrap it in another t.Run, or you'll get double-nested output:
for _, tc := range cases {
llmtest.Run(t, tc.name, func(e *llmtest.E) { ... }) // correct
}
NOT:
for _, tc := range cases {
t.Run(tc.name, func(t *testing.T) {
llmtest.Run(t, tc.name, ...) // double-nested: TestX/name/name
})
}
Run does NOT call t.Parallel(). Call it yourself before Run if needed.
Types ¶
type E ¶
type E struct {
// contains filtered or unexported fields
}
E is the eval context passed to every Run callback. Use its methods to set the test case, configure LLM options, and run scorers.
func (*E) Check ¶
Check runs the scorer and marks the test as failed if the verdict is Fail, but does not stop the test. Use Check for soft constraints — remaining scorers still execute even on failure.
func (*E) Config ¶
func (e *E) Config(opts ...EvalOption)
Config applies eval-level options (e.g. EvalModel, EvalProvider) that are inherited by all LLM-based scorers in this eval. Scorer-level options take precedence over eval-level options.
type EvalOption ¶
type EvalOption func(*evalOptions)
EvalOption configures all LLM-based scorers within a single Run call. Set via E.Config. Scorer-level ScorerOption values take precedence.
func EvalModel ¶
func EvalModel(m string) EvalOption
EvalModel sets the model for all LLM-based scorers in this eval. Scorer-level Model() takes precedence over EvalModel().
func EvalProvider ¶
func EvalProvider(p provider.Provider) EvalOption
EvalProvider sets the provider for all LLM-based scorers in this eval. Scorer-level Provider() takes precedence over EvalProvider().
type Result ¶
type Result struct {
// Verdict is the categorical outcome: Pass, Partial, or Fail.
Verdict Verdict
// Score is a numeric value in the range [0.0, 1.0].
// Pass=1.0, Partial=0.5, Fail=0.0.
Score float64
// Reason is a human-readable explanation of the verdict.
Reason string
// Tokens is the total number of LLM tokens consumed (input + output).
// Zero for deterministic scorers.
Tokens int
// LatencyMS is the wall-clock time of the scorer in milliseconds.
// Zero for deterministic scorers.
LatencyMS int64
}
Result is the outcome of a Scorer evaluation.
type Scorer ¶
type Scorer interface {
// Name returns a human-readable identifier for this scorer,
// typically including its configuration (e.g. `Contains("hello")`).
Name() string
// Score evaluates tc and returns a [Result]. The context carries
// cancellation and deadline from the test. Deterministic scorers
// may ignore ctx; LLM-based scorers use it for API calls.
Score(ctx context.Context, tc TestCase) (Result, error)
}
Scorer evaluates a TestCase and returns a Result.
Deterministic scorers (e.g. Contains, IsJSON) run locally with no network calls. LLM-based scorers (e.g. Rubric) send the test case to a language model for evaluation.
Every Scorer must be safe for concurrent use.
func Contains ¶
Contains returns a scorer that passes when ActualOutput contains substr.
Example ¶
package main
import (
"context"
"fmt"
"github.com/adamwoolhether/llmtest"
)
func main() {
tc := llmtest.TestCase{ActualOutput: "Hello, world!"}
res, _ := llmtest.Contains("world").Score(context.Background(), tc)
fmt.Println(res.Verdict)
}
Output: PASS
func ContainsAll ¶
ContainsAll returns a scorer that passes when ActualOutput contains every one of the given substrings.
Example ¶
package main
import (
"context"
"fmt"
"github.com/adamwoolhether/llmtest"
)
func main() {
tc := llmtest.TestCase{ActualOutput: "The quick brown fox"}
res, _ := llmtest.ContainsAll("quick", "fox").Score(context.Background(), tc)
fmt.Println(res.Verdict)
}
Output: PASS
func ContainsAny ¶
ContainsAny returns a scorer that passes when ActualOutput contains at least one of the given substrings.
Example ¶
package main
import (
"context"
"fmt"
"github.com/adamwoolhether/llmtest"
)
func main() {
tc := llmtest.TestCase{ActualOutput: "The color is blue"}
res, _ := llmtest.ContainsAny("red", "blue", "green").Score(context.Background(), tc)
fmt.Println(res.Verdict)
}
Output: PASS
func ContainsTag ¶
ContainsTag returns a scorer that passes when ActualOutput contains a matching XML-style tag pair: <tag>...</tag>.
Example ¶
package main
import (
"context"
"fmt"
"github.com/adamwoolhether/llmtest"
)
func main() {
tc := llmtest.TestCase{
ActualOutput: "Reasoning: <thinking>Let me think...</thinking>",
}
res, _ := llmtest.ContainsTag("thinking").Score(context.Background(), tc)
fmt.Println(res.Verdict)
}
Output: PASS
func ExtractTag ¶
ExtractTag returns a composable scorer that extracts the text between <tag> and </tag> in ActualOutput, then delegates to inner. This enables scorer composition — for example:
ExtractTag("answer", Contains("42"))
extracts the content of <answer>...</answer> and checks that it contains "42".
Example ¶
package main
import (
"context"
"fmt"
"github.com/adamwoolhether/llmtest"
)
func main() {
tc := llmtest.TestCase{
ActualOutput: "Here is the result: <answer>42</answer>",
}
scorer := llmtest.ExtractTag("answer", llmtest.Contains("42"))
res, _ := scorer.Score(context.Background(), tc)
fmt.Println(res.Verdict)
}
Output: PASS
func IsJSON ¶
func IsJSON() Scorer
IsJSON returns a scorer that passes when ActualOutput is valid JSON.
Example ¶
package main
import (
"context"
"fmt"
"github.com/adamwoolhether/llmtest"
)
func main() {
tc := llmtest.TestCase{ActualOutput: `{"status":"ok","count":3}`}
res, _ := llmtest.IsJSON().Score(context.Background(), tc)
fmt.Println(res.Verdict)
}
Output: PASS
func LengthBetween ¶
LengthBetween returns a scorer that passes when the byte length of ActualOutput is between min and max (inclusive).
Example ¶
package main
import (
"context"
"fmt"
"github.com/adamwoolhether/llmtest"
)
func main() {
tc := llmtest.TestCase{ActualOutput: "short"}
res, _ := llmtest.LengthBetween(1, 10).Score(context.Background(), tc)
fmt.Println(res.Verdict)
}
Output: PASS
func MatchesJSONSchema ¶
MatchesJSONSchema returns a scorer that passes when ActualOutput is valid JSON conforming to the given JSON Schema. The schema is compiled on each call to Score; for compile-time validation use MustMatchJSONSchema.
Example ¶
package main
import (
"context"
"fmt"
"github.com/adamwoolhether/llmtest"
)
func main() {
schema := `{
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"}
},
"required": ["name", "age"]
}`
tc := llmtest.TestCase{ActualOutput: `{"name":"Alice","age":30}`}
res, _ := llmtest.MatchesJSONSchema(schema).Score(context.Background(), tc)
fmt.Println(res.Verdict)
}
Output: PASS
func MatchesRegex ¶
MatchesRegex returns a scorer that passes when ActualOutput matches the given regular expression pattern. The pattern is compiled on each call to Score; an invalid pattern returns an error.
Example ¶
package main
import (
"context"
"fmt"
"github.com/adamwoolhether/llmtest"
)
func main() {
tc := llmtest.TestCase{ActualOutput: "Order #12345 confirmed"}
res, _ := llmtest.MatchesRegex(`#\d{5}`).Score(context.Background(), tc)
fmt.Println(res.Verdict)
}
Output: PASS
func MustMatchJSONSchema ¶
MustMatchJSONSchema compiles the JSON schema at construction time and panics if the schema is invalid (similar to regexp.MustCompile). Use this when the schema is a compile-time constant and you want to catch errors early. For schemas that may be invalid at runtime, use MatchesJSONSchema instead, which reports schema errors from Score().
Example ¶
package main
import (
"context"
"fmt"
"github.com/adamwoolhether/llmtest"
)
func main() {
schema := `{
"type": "object",
"properties": {
"status": {"type": "string"},
"count": {"type": "integer"}
},
"required": ["status", "count"]
}`
tc := llmtest.TestCase{ActualOutput: `{"status":"ok","count":3}`}
res, _ := llmtest.MustMatchJSONSchema(schema).Score(context.Background(), tc)
fmt.Println(res.Verdict)
}
Output: PASS
func NotContains ¶
NotContains returns a scorer that passes when ActualOutput does not contain substr.
Example ¶
package main
import (
"context"
"fmt"
"github.com/adamwoolhether/llmtest"
)
func main() {
tc := llmtest.TestCase{ActualOutput: "The answer is 42"}
res, _ := llmtest.NotContains("error").Score(context.Background(), tc)
fmt.Println(res.Verdict)
}
Output: PASS
func NotContainsAny ¶
NotContainsAny returns a scorer that passes when ActualOutput contains none of the given substrings.
Example ¶
package main
import (
"context"
"fmt"
"github.com/adamwoolhether/llmtest"
)
func main() {
tc := llmtest.TestCase{ActualOutput: "The weather is sunny"}
res, _ := llmtest.NotContainsAny("error", "fail", "crash").Score(context.Background(), tc)
fmt.Println(res.Verdict)
}
Output: PASS
func NotMatchesRegex ¶
NotMatchesRegex returns a scorer that passes when ActualOutput does not match the given regular expression pattern.
Example ¶
package main
import (
"context"
"fmt"
"github.com/adamwoolhether/llmtest"
)
func main() {
tc := llmtest.TestCase{ActualOutput: "No numbers here"}
res, _ := llmtest.NotMatchesRegex(`\d+`).Score(context.Background(), tc)
fmt.Println(res.Verdict)
}
Output: PASS
func Rubric ¶
func Rubric(criterion string, opts ...ScorerOption) Scorer
Rubric returns an LLM-based scorer that asks a judge model to evaluate ActualOutput against the given criterion. The judge returns a verdict of Pass, Partial, or Fail along with its reasoning.
Results are cached by (criterion + input + output + model) so identical calls within a test run do not repeat the LLM request. Set LLMTEST_NO_CACHE=1 to bypass the cache.
Options:
- Model / Provider: override the judge model or provider for this scorer
- Threshold: set the minimum score to promote a Partial verdict to Pass
- ConsistencyCheck: run the judge n times and take the majority verdict
type ScorerOption ¶
type ScorerOption func(*scorerOptions)
ScorerOption configures a single LLM-based scorer (e.g. Rubric). Scorer-level options take precedence over eval-level EvalOption values, which in turn override environment variables.
func ConsistencyCheck ¶
func ConsistencyCheck(n int) ScorerOption
ConsistencyCheck runs the scorer n times and requires majority agreement.
func Model ¶
func Model(m string) ScorerOption
Model sets the model for this specific scorer. Takes precedence over EvalModel().
func Provider ¶
func Provider(p provider.Provider) ScorerOption
Provider sets the provider for this specific scorer. Takes precedence over EvalProvider().
func Threshold ¶
func Threshold(t float64) ScorerOption
Threshold sets the minimum score for a Rubric scorer to be promoted to Pass. Only meaningful for LLM-based scorers (Rubric); deterministic scorers ignore it. A threshold of 0 is treated as unset (defaults to 1.0).
type TestCase ¶
type TestCase struct {
// Input is the prompt or question sent to the LLM under test.
Input string
// ActualOutput is the LLM's response — the text being evaluated.
ActualOutput string
// ExpectedOutput is the ideal or reference answer (optional).
// Used by scorers that compare actual vs expected output.
ExpectedOutput string
// Context provides background information given to the LLM (optional).
// For example, system prompt content or retrieved documents.
Context []string
// RetrievalContext holds the documents retrieved by a RAG pipeline
// (optional). Useful for evaluating retrieval relevance.
RetrievalContext []string
// Metadata carries arbitrary key-value data for custom scorers.
Metadata map[string]any
}
TestCase holds the data that scorers evaluate.
Source Files
¶
Directories
¶
| Path | Synopsis |
|---|---|
|
internal
|
|
|
cache
Package cache provides disk-based caching for LLM scorer responses.
|
Package cache provides disk-based caching for LLM scorer responses. |
|
limiter
Package limiter provides concurrency limiting for parallel LLM calls.
|
Package limiter provides concurrency limiting for parallel LLM calls. |
|
prompt
Package prompt provides prompt templates for LLM-based scorers.
|
Package prompt provides prompt templates for LLM-based scorers. |
|
retry
Package retry provides retry logic with backoff for LLM API calls.
|
Package retry provides retry logic with backoff for LLM API calls. |
|
Package provider defines the Provider interface and built-in LLM backends used by llmtest's LLM-based scorers.
|
Package provider defines the Provider interface and built-in LLM backends used by llmtest's LLM-based scorers. |