llmtest

package module

v0.0.1 Latest Latest Go to latest Published: Feb 28, 2026 License: MIT Imports: 14 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/adamwoolhether/llmtest

Links

Open Source Insights

README ¶

llmtest

LLM testing for Go. Extends go test with eval scorers, no new tools to learn.

Install

go get github.com/adamwoolhether/llmtest

Quick Start

Already using strings.Contains in your tests? Replace it with llmtest.Contains. Same logic, now with structured results, scorer metrics, and a path to LLM-as-judge when you're ready.

package myapp_test

import (
    "testing"

    "github.com/adamwoolhether/llmtest"
)

func TestGreeting(t *testing.T) {
    llmtest.Run(t, "polite", func(e *llmtest.E) {
        e.Case(llmtest.TestCase{
            Input:        "Say hello",
            ActualOutput: callMyLLM("Say hello"),
        })
        e.Require(llmtest.Contains("hello"))
        e.Check(llmtest.LengthBetween(1, 500))
    })
}

Run with:

LLMTEST=1 go test ./...

Tests are skipped unless LLMTEST=1 is set, so evals never run accidentally during normal development.

Zero-Cost Quickstart (Ollama)

No API key needed. Run evals locally with Ollama:

# Install Ollama and pull a model
ollama pull llama3.2

# Run your eval tests (llmtest auto-detects Ollama)
LLMTEST=1 LLMTEST_PROVIDER=ollama go test -v ./...

Free, private, runs on your laptop. Swap to OpenAI or Anthropic later by setting the API key.

Core Workflow

Run(t, name, func(e *E) {
    e.Case(tc)            // set the test case
    e.Config(opts...)     // optional: set eval-level options
    e.Require(scorer)     // hard constraint: stops on failure
    e.Check(scorer)       // soft constraint: records but continues
})

Require vs Check: Require calls t.FailNow on failure, so the test stops immediately. Check marks the test as failed but does not stop it; remaining scorers still execute. Use Require for hard constraints and Check for soft/informational metrics.

Run does not call t.Parallel(). Call it on t before Run if you want parallel subtests.

Verdicts

Every scorer returns one of three verdicts:

Verdict	Score	Meaning
`Pass`	1.0	Criterion fully met
`Partial`	0.5	Criterion partially met
`Fail`	0.0	Criterion not met

Deterministic scorers only return Pass or Fail. The Partial verdict is used by LLM-based scorers like Rubric.

TestCase Fields

Field	Type	Description
`Input`	`string`	The prompt sent to the LLM under test
`ActualOutput`	`string`	The LLM's response (the text being evaluated)
`ExpectedOutput`	`string`	Reference/ideal answer (optional)
`Context`	`[]string`	Background info given to the LLM (optional)
`RetrievalContext`	`[]string`	Documents retrieved by a RAG pipeline (optional)
`Metadata`	`map[string]any`	Arbitrary data for custom scorers

Deterministic Scorers

These run locally with no network calls.

Substring

Scorer	Description
`Contains(s)`	Output contains substring `s`
`ContainsAll(s..)`	Output contains every substring
`ContainsAny(s..)`	Output contains at least one substring
`NotContains(s)`	Output does not contain `s`
`NotContainsAny(s..)`	Output contains none of the substrings

Regex

Scorer	Description
`MatchesRegex(p)`	Output matches regex pattern `p`
`NotMatchesRegex(p)`	Output does not match regex pattern `p`

JSON

Scorer	Description
`IsJSON()`	Output is valid JSON
`MatchesJSONSchema(s)`	Output conforms to JSON Schema `s` (reports schema errors at scoring time)
`MustMatchJSONSchema(s)`	Same, but panics on invalid schema (compile-time safety)

Structure

Scorer	Description
`LengthBetween(min, max)`	Output byte length is in `[min, max]`
`ContainsTag(tag)`	Output has `<tag>...</tag>` pair
`ExtractTag(tag, inner)`	Extract tag content, then apply `inner` scorer

Scorer Composition

ExtractTag enables composition: extract content from a tag, then evaluate it:

// Verify the <answer> tag contains "42"
llmtest.ExtractTag("answer", llmtest.Contains("42"))

Rubric Scorer

Rubric is an LLM-based scorer that sends the test case to a judge model for criterion evaluation.

e.Require(llmtest.Rubric("Response is polite and professional"))

The judge returns one of three verdicts:

Pass (1.0): criterion fully met
Partial (0.5): criterion partially met
Fail (0.0): criterion not met

Options

Option	Description
`Model(m)`	Override the judge model for this scorer
`Provider(p)`	Override the provider for this scorer
`Threshold(t)`	Minimum score to pass (default 1.0; set to 0.5 to accept Partial)
`ConsistencyCheck(n)`	Run `n` times, take majority verdict

e.Require(llmtest.Rubric("Is polite",
    llmtest.Model("gpt-4o"),
    llmtest.Threshold(0.5),
))

LLM calls include automatic rate-limit retry with backoff and JSON repair (re-prompts if the judge returns invalid JSON).

Configuration Priority

LLM-based scorers resolve model and provider in this order:

Scorer-level: Model(), Provider() options on individual scorers
Eval-level: EvalModel(), EvalProvider() via e.Config()
Environment: LLMTEST_MODEL, LLMTEST_PROVIDER
Auto-detection: probe API keys / local services

Providers

Provider	Constructor	Default Model	Required Env Var
OpenAI	`OpenAI()`	`gpt-4.1-mini`	`OPENAI_API_KEY`
Anthropic	`Anthropic()`	`claude-sonnet-4-5-20250929`	`ANTHROPIC_API_KEY`
Ollama	`Ollama()`	`llama3.2`	(none, local service)

Auto-detection order when LLMTEST_PROVIDER is unset:

OPENAI_API_KEY present → OpenAI
ANTHROPIC_API_KEY present → Anthropic
Ollama reachable → Ollama

Most users don't need to import provider directly. Just set the environment variable.

Environment Variables

Variable	Description	Default
`LLMTEST`	Set to `1` to enable eval tests	(unset = skip all)
`LLMTEST_PROVIDER`	Provider: `openai`, `anthropic`, or `ollama`	auto-detect
`LLMTEST_MODEL`	Override default model	provider default
`LLMTEST_CONCURRENCY`	Max parallel LLM calls	`5`
`LLMTEST_NO_CACHE`	Set to `1` to bypass response cache	(unset = cache enabled)
`LLMTEST_OUTPUT`	Path to write JSON summary	(unset = no file)
`LLMTEST_OLLAMA_URL`	Ollama endpoint	`http://localhost:11434`

Patterns

Table-Driven Tests

cases := []struct {
    name   string
    input  string
    output string
}{
    {"greeting", "Say hi", "Hello there!"},
    {"farewell", "Say bye", "Goodbye!"},
}

for _, tc := range cases {
    llmtest.Run(t, tc.name, func(e *llmtest.E) {
        e.Case(llmtest.TestCase{
            Input:        tc.input,
            ActualOutput: tc.output,
        })
        e.Require(llmtest.LengthBetween(1, 200))
    })
}

Call Run directly in the loop. Do not wrap it in another t.Run, or you'll get double-nested test names.

Combining Scorers

llmtest.Run(t, "structured_response", func(e *llmtest.E) {
    e.Case(llmtest.TestCase{
        Input:        "Explain Go interfaces",
        ActualOutput: response,
    })
    e.Require(llmtest.IsJSON())
    e.Require(llmtest.Contains("interface"))
    e.Check(llmtest.LengthBetween(100, 2000))
    e.Check(llmtest.Rubric("Explanation is clear and accurate"))
})

JSON Output

Collect structured results by setting LLMTEST_OUTPUT and calling Flush from TestMain:

func TestMain(m *testing.M) {
    code := m.Run()
    llmtest.Flush()
    os.Exit(code)
}

LLMTEST=1 LLMTEST_OUTPUT=results.json go test ./...

Caching & Concurrency

LLM responses are cached to disk in a .llmtest-cache/ directory (created in the working directory). Cache keys are derived from scorer name, prompt, and model. Entries expire after 24 hours. Add .llmtest-cache/ to your .gitignore.
Set LLMTEST_NO_CACHE=1 to bypass both reads and writes.
Concurrent LLM calls are limited by LLMTEST_CONCURRENCY (default: 5).

Structured Test Attributes (`T.Attr`)

Every scorer call emits structured key-value attributes via testing.T.Attr (Go 1.25+):

Attribute Key	Value
`llmtest.scorer`	Scorer name, e.g. `Contains("hello")`
`llmtest.verdict`	`PASS`, `PARTIAL`, or `FAIL`
`llmtest.score`	Numeric score: `1.00`, `0.50`, `0.00`
`llmtest.reason`	Human-readable explanation
`llmtest.tokens`	LLM tokens consumed (0 for deterministic)
`llmtest.latency_ms`	Scorer wall-clock time in ms

Attributes are visible in go test -v output and machine-readable via go test -json. Eval results are structured data inside go test, not a separate tool.

CI Integration

Run evals in GitHub Actions with go test -json to get machine-readable results:

name: LLM Evals
on: [push]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-go@v5
        with:
          go-version: '1.25'
      - name: Run evals
        env:
          LLMTEST: "1"
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: go test -json ./... | tee eval-results.json

The go test -json output contains llmtest.* attributes for each scorer result. Parse them downstream for dashboards, alerts, or trend tracking.

Comparison

Feature	llmtest	promptfoo	deepeval	goeval	maragu.dev/gai/eval
Language	Go	YAML/Node	Python	Go	Go
Runs in `go test`	✅	❌ (separate CLI)	❌ (pytest)	❌	✅
Structured attrs (`T.Attr`)	✅	❌	❌	❌	❌
LLM judge	✅	✅	✅	✅	⚠️ minimal
Deterministic scorers	✅	✅	✅	⚠️ limited	⚠️ minimal
Config format	Go code	YAML	Python	Go code	Go code

Documentation ¶

Overview ¶

Package llmtest provides a testing framework for evaluating LLM outputs using deterministic and LLM-based scorers within Go's testing.T.

Getting Started ¶

The minimal workflow is: call Run inside a test, set the TestCase with E.Case, then assert with E.Require or E.Check:

func TestGreeting(t *testing.T) {
    llmtest.Run(t, "polite", func(e *llmtest.E) {
        e.Case(llmtest.TestCase{
            Input:        "Say hello",
            ActualOutput: callMyLLM("Say hello"),
        })
        e.Require(llmtest.Contains("hello"))
    })
}

Tests are skipped unless LLMTEST=1 is set, so LLM evals never run accidentally during normal development.

Scorers ¶

Scorers come in two flavors:

Deterministic scorers run locally with no network calls:

Contains, ContainsAll, ContainsAny — substring checks
NotContains, NotContainsAny — forbidden-term checks
MatchesRegex, NotMatchesRegex — regex matching
IsJSON, MatchesJSONSchema, MustMatchJSONSchema — JSON validation
LengthBetween — output length bounds
ContainsTag — XML-style tag presence
ExtractTag — composable tag extraction (wraps another scorer)

LLM-based scorers send the test case to a judge model:

Rubric — criterion evaluation with Pass/Partial/Fail verdicts

Require vs Check ¶

E.Require calls t.Fatalf on failure — the test stops immediately. Use it for hard constraints that make further evaluation meaningless.

E.Check marks the test as failed but does not stop it — remaining scorers still execute. Use it for soft constraints where you want all results.

Scorer Composition ¶

ExtractTag wraps an inner scorer, enabling composition:

// Extract <answer>...</answer> then check its content.
llmtest.ExtractTag("answer", llmtest.Contains("42"))

Configuration Priority ¶

LLM-based scorers resolve model and provider with this priority:

Scorer-level options: Model, Provider
Eval-level options: EvalModel, EvalProvider (via E.Config)
Environment variables: LLMTEST_MODEL, LLMTEST_PROVIDER
Auto-detection (see below)

Environment Variables ¶

LLMTEST=1 Enable LLM eval tests (omit to skip all evals)
LLMTEST_PROVIDER Select provider: "openai", "anthropic", or "ollama"
LLMTEST_MODEL Override the default model for the selected provider
LLMTEST_CONCURRENCY Max parallel LLM calls (default: 5)
LLMTEST_NO_CACHE=1 Bypass the response cache
LLMTEST_OUTPUT Path to write a JSON summary file (e.g. results.json)
LLMTEST_OLLAMA_URL Ollama endpoint (default: http://localhost:11434)

Auto-detection order when LLMTEST_PROVIDER is unset:

OPENAI_API_KEY present → OpenAI
ANTHROPIC_API_KEY present → Anthropic
Ollama reachable at URL → Ollama
None found → clear error listing all options

JSON Output ¶

Set LLMTEST_OUTPUT to a file path to collect structured results. Call Flush from TestMain after m.Run completes:

func TestMain(m *testing.M) {
    code := m.Run()
    llmtest.Flush()
    os.Exit(code)
}

Caching ¶

LLM-based scorer results are cached to disk in a `.llmtest-cache/` directory (created in the working directory). Cache keys are derived from the scorer name, prompt, and model; entries expire after 24 hours. Set LLMTEST_NO_CACHE=1 to bypass both reads and writes.

Index ¶

func Flush()
func Run(t *testing.T, name string, fn func(e *E))
type E
type EvalOption
- func EvalModel(m string) EvalOption
- func EvalProvider(p provider.Provider) EvalOption
type Result
type Scorer
type ScorerOption
type TestCase
type Verdict
- func (v Verdict) String() string

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Flush ¶

func Flush()

Flush writes accumulated eval results to the file specified by LLMTEST_OUTPUT. If the variable is unset, Flush is a no-op.

Call Flush from TestMain after m.Run completes:

func TestMain(m *testing.M) {
    code := m.Run()
    llmtest.Flush()
    os.Exit(code)
}

func Run ¶

func Run(t *testing.T, name string, fn func(e *E))

Run executes an LLM eval within a *testing.T context. Evals are skipped unless LLMTEST=1 is set.

Run internally calls t.Run(name, ...) to create a subtest. In table-driven tests, call Run directly in the loop — do NOT wrap it in another t.Run, or you'll get double-nested output:

for _, tc := range cases {
    llmtest.Run(t, tc.name, func(e *llmtest.E) { ... })  // correct
}

NOT:

for _, tc := range cases {
    t.Run(tc.name, func(t *testing.T) {
        llmtest.Run(t, tc.name, ...)  // double-nested: TestX/name/name
    })
}

Run does NOT call t.Parallel(). Call it yourself before Run if needed.

Types ¶

type E ¶

type E struct {
	// contains filtered or unexported fields
}

E is the eval context passed to every Run callback. Use its methods to set the test case, configure LLM options, and run scorers.

func (*E) Case ¶

func (e *E) Case(tc TestCase)

Case sets the TestCase to evaluate. Call this before any Require or Check.

func (*E) Check ¶

func (e *E) Check(s Scorer)

Check runs the scorer and marks the test as failed if the verdict is Fail, but does not stop the test. Use Check for soft constraints — remaining scorers still execute even on failure.

func (*E) Config ¶

func (e *E) Config(opts ...EvalOption)

Config applies eval-level options (e.g. EvalModel, EvalProvider) that are inherited by all LLM-based scorers in this eval. Scorer-level options take precedence over eval-level options.

func (*E) Require ¶

func (e *E) Require(s Scorer)

Require runs the scorer and calls t.FailNow if the verdict is Fail. Use Require for hard constraints — the test stops immediately on failure.

type EvalOption ¶

type EvalOption func(*evalOptions)

EvalOption configures all LLM-based scorers within a single Run call. Set via E.Config. Scorer-level ScorerOption values take precedence.

func EvalModel ¶

func EvalModel(m string) EvalOption

EvalModel sets the model for all LLM-based scorers in this eval. Scorer-level Model() takes precedence over EvalModel().

func EvalProvider ¶

func EvalProvider(p provider.Provider) EvalOption

EvalProvider sets the provider for all LLM-based scorers in this eval. Scorer-level Provider() takes precedence over EvalProvider().

type Result ¶

type Result struct {
	// Verdict is the categorical outcome: Pass, Partial, or Fail.
	Verdict Verdict

	// Score is a numeric value in the range [0.0, 1.0].
	// Pass=1.0, Partial=0.5, Fail=0.0.
	Score float64

	// Reason is a human-readable explanation of the verdict.
	Reason string

	// Tokens is the total number of LLM tokens consumed (input + output).
	// Zero for deterministic scorers.
	Tokens int

	// LatencyMS is the wall-clock time of the scorer in milliseconds.
	// Zero for deterministic scorers.
	LatencyMS int64
}

Result is the outcome of a Scorer evaluation.

type Scorer ¶

type Scorer interface {
	// Name returns a human-readable identifier for this scorer,
	// typically including its configuration (e.g. `Contains("hello")`).
	Name() string

	// Score evaluates tc and returns a [Result]. The context carries
	// cancellation and deadline from the test. Deterministic scorers
	// may ignore ctx; LLM-based scorers use it for API calls.
	Score(ctx context.Context, tc TestCase) (Result, error)
}

Scorer evaluates a TestCase and returns a Result.

Deterministic scorers (e.g. Contains, IsJSON) run locally with no network calls. LLM-based scorers (e.g. Rubric) send the test case to a language model for evaluation.

Every Scorer must be safe for concurrent use.

func Contains ¶

func Contains(substr string) Scorer

Contains returns a scorer that passes when ActualOutput contains substr.

Example ¶

package main

import (
	"context"
	"fmt"

	"github.com/adamwoolhether/llmtest"
)

func main() {
	tc := llmtest.TestCase{ActualOutput: "Hello, world!"}
	res, _ := llmtest.Contains("world").Score(context.Background(), tc)
	fmt.Println(res.Verdict)
}

Output:

PASS

func ContainsAll ¶

func ContainsAll(substrs ...string) Scorer

ContainsAll returns a scorer that passes when ActualOutput contains every one of the given substrings.

Example ¶

package main

import (
	"context"
	"fmt"

	"github.com/adamwoolhether/llmtest"
)

func main() {
	tc := llmtest.TestCase{ActualOutput: "The quick brown fox"}
	res, _ := llmtest.ContainsAll("quick", "fox").Score(context.Background(), tc)
	fmt.Println(res.Verdict)
}

Output:

PASS

func ContainsAny ¶

func ContainsAny(substrs ...string) Scorer

ContainsAny returns a scorer that passes when ActualOutput contains at least one of the given substrings.

Example ¶

package main

import (
	"context"
	"fmt"

	"github.com/adamwoolhether/llmtest"
)

func main() {
	tc := llmtest.TestCase{ActualOutput: "The color is blue"}
	res, _ := llmtest.ContainsAny("red", "blue", "green").Score(context.Background(), tc)
	fmt.Println(res.Verdict)
}

Output:

PASS

func ContainsTag ¶

func ContainsTag(tag string) Scorer

ContainsTag returns a scorer that passes when ActualOutput contains a matching XML-style tag pair: <tag>...</tag>.

Example ¶

package main

import (
	"context"
	"fmt"

	"github.com/adamwoolhether/llmtest"
)

func main() {
	tc := llmtest.TestCase{
		ActualOutput: "Reasoning: <thinking>Let me think...</thinking>",
	}
	res, _ := llmtest.ContainsTag("thinking").Score(context.Background(), tc)
	fmt.Println(res.Verdict)
}

Output:

PASS

func ExtractTag ¶

func ExtractTag(tag string, inner Scorer) Scorer

ExtractTag returns a composable scorer that extracts the text between <tag> and </tag> in ActualOutput, then delegates to inner. This enables scorer composition — for example:

ExtractTag("answer", Contains("42"))

extracts the content of <answer>...</answer> and checks that it contains "42".

Example ¶

package main

import (
	"context"
	"fmt"

	"github.com/adamwoolhether/llmtest"
)

func main() {
	tc := llmtest.TestCase{
		ActualOutput: "Here is the result: <answer>42</answer>",
	}
	scorer := llmtest.ExtractTag("answer", llmtest.Contains("42"))
	res, _ := scorer.Score(context.Background(), tc)
	fmt.Println(res.Verdict)
}

Output:

PASS

func IsJSON ¶

func IsJSON() Scorer

IsJSON returns a scorer that passes when ActualOutput is valid JSON.

Example ¶

package main

import (
	"context"
	"fmt"

	"github.com/adamwoolhether/llmtest"
)

func main() {
	tc := llmtest.TestCase{ActualOutput: `{"status":"ok","count":3}`}
	res, _ := llmtest.IsJSON().Score(context.Background(), tc)
	fmt.Println(res.Verdict)
}

Output:

PASS

func LengthBetween ¶

func LengthBetween(min, max int) Scorer

LengthBetween returns a scorer that passes when the byte length of ActualOutput is between min and max (inclusive).

Example ¶

package main

import (
	"context"
	"fmt"

	"github.com/adamwoolhether/llmtest"
)

func main() {
	tc := llmtest.TestCase{ActualOutput: "short"}
	res, _ := llmtest.LengthBetween(1, 10).Score(context.Background(), tc)
	fmt.Println(res.Verdict)
}

Output:

PASS

func MatchesJSONSchema ¶

func MatchesJSONSchema(schema string) Scorer

MatchesJSONSchema returns a scorer that passes when ActualOutput is valid JSON conforming to the given JSON Schema. The schema is compiled on each call to Score; for compile-time validation use MustMatchJSONSchema.

Example ¶

package main

import (
	"context"
	"fmt"

	"github.com/adamwoolhether/llmtest"
)

func main() {
	schema := `{
		"type": "object",
		"properties": {
			"name": {"type": "string"},
			"age":  {"type": "integer"}
		},
		"required": ["name", "age"]
	}`
	tc := llmtest.TestCase{ActualOutput: `{"name":"Alice","age":30}`}
	res, _ := llmtest.MatchesJSONSchema(schema).Score(context.Background(), tc)
	fmt.Println(res.Verdict)
}

Output:

PASS

func MatchesRegex ¶

func MatchesRegex(pattern string) Scorer

MatchesRegex returns a scorer that passes when ActualOutput matches the given regular expression pattern. The pattern is compiled on each call to Score; an invalid pattern returns an error.

Example ¶

package main

import (
	"context"
	"fmt"

	"github.com/adamwoolhether/llmtest"
)

func main() {
	tc := llmtest.TestCase{ActualOutput: "Order #12345 confirmed"}
	res, _ := llmtest.MatchesRegex(`#\d{5}`).Score(context.Background(), tc)
	fmt.Println(res.Verdict)
}

Output:

PASS

func MustMatchJSONSchema ¶

func MustMatchJSONSchema(schema string) Scorer

MustMatchJSONSchema compiles the JSON schema at construction time and panics if the schema is invalid (similar to regexp.MustCompile). Use this when the schema is a compile-time constant and you want to catch errors early. For schemas that may be invalid at runtime, use MatchesJSONSchema instead, which reports schema errors from Score().

Example ¶

package main

import (
	"context"
	"fmt"

	"github.com/adamwoolhether/llmtest"
)

func main() {
	schema := `{
		"type": "object",
		"properties": {
			"status": {"type": "string"},
			"count":  {"type": "integer"}
		},
		"required": ["status", "count"]
	}`
	tc := llmtest.TestCase{ActualOutput: `{"status":"ok","count":3}`}
	res, _ := llmtest.MustMatchJSONSchema(schema).Score(context.Background(), tc)
	fmt.Println(res.Verdict)
}

Output:

PASS

func NotContains ¶

func NotContains(substr string) Scorer

NotContains returns a scorer that passes when ActualOutput does not contain substr.

Example ¶

package main

import (
	"context"
	"fmt"

	"github.com/adamwoolhether/llmtest"
)

func main() {
	tc := llmtest.TestCase{ActualOutput: "The answer is 42"}
	res, _ := llmtest.NotContains("error").Score(context.Background(), tc)
	fmt.Println(res.Verdict)
}

Output:

PASS

func NotContainsAny ¶

func NotContainsAny(substrs ...string) Scorer

NotContainsAny returns a scorer that passes when ActualOutput contains none of the given substrings.

Example ¶

package main

import (
	"context"
	"fmt"

	"github.com/adamwoolhether/llmtest"
)

func main() {
	tc := llmtest.TestCase{ActualOutput: "The weather is sunny"}
	res, _ := llmtest.NotContainsAny("error", "fail", "crash").Score(context.Background(), tc)
	fmt.Println(res.Verdict)
}

Output:

PASS

func NotMatchesRegex ¶

func NotMatchesRegex(pattern string) Scorer

NotMatchesRegex returns a scorer that passes when ActualOutput does not match the given regular expression pattern.

Example ¶

package main

import (
	"context"
	"fmt"

	"github.com/adamwoolhether/llmtest"
)

func main() {
	tc := llmtest.TestCase{ActualOutput: "No numbers here"}
	res, _ := llmtest.NotMatchesRegex(`\d+`).Score(context.Background(), tc)
	fmt.Println(res.Verdict)
}

Output:

PASS

func Rubric ¶

func Rubric(criterion string, opts ...ScorerOption) Scorer

Rubric returns an LLM-based scorer that asks a judge model to evaluate ActualOutput against the given criterion. The judge returns a verdict of Pass, Partial, or Fail along with its reasoning.

Results are cached by (criterion + input + output + model) so identical calls within a test run do not repeat the LLM request. Set LLMTEST_NO_CACHE=1 to bypass the cache.

Options:

Model / Provider: override the judge model or provider for this scorer
Threshold: set the minimum score to promote a Partial verdict to Pass
ConsistencyCheck: run the judge n times and take the majority verdict

type ScorerOption ¶

type ScorerOption func(*scorerOptions)

ScorerOption configures a single LLM-based scorer (e.g. Rubric). Scorer-level options take precedence over eval-level EvalOption values, which in turn override environment variables.

func ConsistencyCheck ¶

func ConsistencyCheck(n int) ScorerOption

ConsistencyCheck runs the scorer n times and requires majority agreement.

func Model ¶

func Model(m string) ScorerOption

Model sets the model for this specific scorer. Takes precedence over EvalModel().

func Provider ¶

func Provider(p provider.Provider) ScorerOption

Provider sets the provider for this specific scorer. Takes precedence over EvalProvider().

func Threshold ¶

func Threshold(t float64) ScorerOption

Threshold sets the minimum score for a Rubric scorer to be promoted to Pass. Only meaningful for LLM-based scorers (Rubric); deterministic scorers ignore it. A threshold of 0 is treated as unset (defaults to 1.0).

type TestCase ¶

type TestCase struct {
	// Input is the prompt or question sent to the LLM under test.
	Input string

	// ActualOutput is the LLM's response — the text being evaluated.
	ActualOutput string

	// ExpectedOutput is the ideal or reference answer (optional).
	// Used by scorers that compare actual vs expected output.
	ExpectedOutput string

	// Context provides background information given to the LLM (optional).
	// For example, system prompt content or retrieved documents.
	Context []string

	// RetrievalContext holds the documents retrieved by a RAG pipeline
	// (optional). Useful for evaluating retrieval relevance.
	RetrievalContext []string

	// Metadata carries arbitrary key-value data for custom scorers.
	Metadata map[string]any
}

TestCase holds the data that scorers evaluate.

type Verdict ¶

type Verdict string

Verdict represents a scorer's classification of the output. The three possible values are Pass (score 1.0), Partial (score 0.5), and Fail (score 0.0).

const (
	Pass    Verdict = "PASS"
	Partial Verdict = "PARTIAL"
	Fail    Verdict = "FAIL"
)

Verdict constants for scorer results.

func (Verdict) String ¶

func (v Verdict) String() string

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
internal
cache Package cache provides disk-based caching for LLM scorer responses.	Package cache provides disk-based caching for LLM scorer responses.
limiter Package limiter provides concurrency limiting for parallel LLM calls.	Package limiter provides concurrency limiting for parallel LLM calls.
prompt Package prompt provides prompt templates for LLM-based scorers.	Package prompt provides prompt templates for LLM-based scorers.
retry Package retry provides retry logic with backoff for LLM API calls.	Package retry provides retry logic with backoff for LLM API calls.
provider Package provider defines the Provider interface and built-in LLM backends used by llmtest's LLM-based scorers.	Package provider defines the Provider interface and built-in LLM backends used by llmtest's LLM-based scorers.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

llmtest

Install

Quick Start

Zero-Cost Quickstart (Ollama)

Core Workflow

Verdicts

TestCase Fields

Deterministic Scorers

Substring

Regex

JSON

Structure

Scorer Composition

Rubric Scorer

Options

Configuration Priority

Providers

Environment Variables

Patterns

Table-Driven Tests

Combining Scorers

JSON Output

Caching & Concurrency

Structured Test Attributes (T.Attr)

CI Integration

Comparison

Documentation ¶

Overview ¶

Getting Started ¶

Scorers ¶

Require vs Check ¶

Scorer Composition ¶

Configuration Priority ¶

Environment Variables ¶

JSON Output ¶

Caching ¶

Index ¶

Examples ¶

Constants ¶

Variables ¶

Functions ¶

func Flush ¶

func Run ¶

Types ¶

type E ¶

func (*E) Case ¶

func (*E) Check ¶

func (*E) Config ¶

func (*E) Require ¶

type EvalOption ¶

func EvalModel ¶

func EvalProvider ¶

type Result ¶

type Scorer ¶

func Contains ¶

func ContainsAll ¶

func ContainsAny ¶

func ContainsTag ¶

func ExtractTag ¶

func IsJSON ¶

func LengthBetween ¶

func MatchesJSONSchema ¶

func MatchesRegex ¶

func MustMatchJSONSchema ¶

func NotContains ¶

func NotContainsAny ¶

func NotMatchesRegex ¶

func Rubric ¶

type ScorerOption ¶

func ConsistencyCheck ¶

func Model ¶

func Provider ¶

func Threshold ¶

type TestCase ¶

type Verdict ¶

func (Verdict) String ¶

Source Files ¶

Directories ¶

Structured Test Attributes (`T.Attr`)