gaugo

package module
v1.1.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 16, 2026 License: MIT Imports: 13 Imported by: 0

README

Gaugo logo

Gaugo

Go Reference

Go-native evaluations for AI applications, runnable through go test.

Gaugo lets Go teams evaluate RAG systems, agents, chatbots, and other AI-backed services with deterministic test cases, optional LLM judges, concurrent execution, and structured results that fit naturally into CI. Gaugo includes 24 built-in metrics across RAG, safety, generation quality, structured output, instruction following, domain-specific checks, and deterministic contracts.

go get github.com/nnull13/gaugo

Why Gaugo

Capability What it gives you
Native Go tests Write AI evaluations as normal testing tests.
Deterministic reporting Run cases concurrently while preserving registration order.
No-LLM checks Catch required behavior with JSON, regex, latency, length, and ExpectedContains assertions.
LLM-judged metrics Use structured-output judges for RAG, safety, answer quality, citations, summaries, and custom criteria.
Programmatic runs Use Runner to feed dashboards, CLIs, and internal pipelines.
Provider adapters Start with OpenAI, Anthropic, Gemini, xAI, or a local model service.
Extension points Bring your own judge, metric, or reporter.

Quickstart (No Provider Needed)

This is the smallest end-to-end evaluation you can run with go test.

package rag_test

import (
	"context"
	"testing"

	"github.com/nnull13/gaugo"
)

func TestEnterprisePricing(t *testing.T) {
	suite := gaugo.New(t)

	suite.Case("answer mentions sales",
		gaugo.Question("What is enterprise pricing?"),
		gaugo.ContextDocs(
			gaugo.Document{
				ID:   "pricing.md",
				Text: "Enterprise plans are custom and sold via sales.",
			},
		),
		gaugo.ExpectedContains("sales"),
	)

	suite.Assert(context.Background(), func(ctx context.Context, in gaugo.Input) (gaugo.Output, error) {
		return gaugo.Output{Answer: "Contact sales for enterprise pricing."}, nil
	})
}

Run it like any other Go test:

go test ./...

Add an LLM judge

package rag_test

import (
	"context"
	"os"
	"testing"
	"time"

	"github.com/nnull13/gaugo"
	"github.com/nnull13/gaugo/provider/openai"
)

func TestRAGQuality(t *testing.T) {
	apiKey := os.Getenv("OPENAI_API_KEY")
	if apiKey == "" {
		t.Skip("OPENAI_API_KEY is not set")
	}

	judge, err := openai.New(openai.Config{
		APIKey: apiKey,
		Model:  "gpt-4.1-mini",
	})
	if err != nil {
		t.Fatal(err)
	}

	suite := gaugo.New(t,
		gaugo.WithJudge(judge),
		gaugo.WithParallelism(8),
		gaugo.WithCaseTimeout(15*time.Second),
	)

	suite.Case("pricing answer",
		gaugo.Question("What is enterprise pricing?"),
		gaugo.ContextDocs(gaugo.Document{
			ID:   "pricing.md",
			Text: "Enterprise pricing is custom and handled by sales.",
		}),
		gaugo.ExpectedContains("sales"),
	)

	suite.Assert(context.Background(), yourGaugoEvaluation,
		gaugo.ContextRelevancy(gaugo.WithThreshold(0.75)),
		gaugo.Faithfulness(gaugo.WithThreshold(0.8)),
		gaugo.AnswerRelevancy(gaugo.WithThreshold(0.7)),
	)
}

Hosted providers validate URLs in strict mode by default (https + official provider hosts). Use AllowUnsafeURL: true only for trusted local stubs or custom gateways.

yourGaugoEvaluation is your adapter:

func yourGaugoEvaluation(ctx context.Context, in gaugo.Input) (gaugo.Output, error) {
	answer, err := myApp.Answer(ctx, in.Question, in.Context)
	if err != nil {
		return gaugo.Output{}, err
	}
	return gaugo.Output{Answer: answer}, nil
}

Go-native positioning

Python-first evaluation frameworks such as Ragas, DeepEval, and TruLens are good options when your eval stack already lives in notebooks, Python services, or dedicated observability platforms. Gaugo's narrower focus is Go-native evaluation: keep cases beside Go application code, run them with go test, and send structured results to the CI and reporting systems your team already uses.

Docs & Next Steps

I want to... Go to
Browse all docs from one place Documentation index
Write my first evaluation Getting started
Understand the evaluation model Concepts
Use Gaugo inside go test Testing with Suite
Run evaluations from a CLI or pipeline Programmatic Runner
Configure metrics and thresholds Metrics reference
Choose and configure an LLM provider Provider index (OpenAI, Anthropic, Gemini, xAI, Local)
Add a custom judge, metric, or reporter Extending Gaugo
Debug a failure Troubleshooting

License

See LICENSE.

If Gaugo helps your team ship safer AI, consider giving it a star.

Crafted by NoName13.

Documentation

Overview

Package gaugo provides an idiomatic Go testing harness for AI application evaluation.

Gaugo evaluates RAG pipelines and AI systems directly within Go's testing workflow, producing deterministic, concurrent, CI-friendly results without external orchestration.

Quick start

Use Suite inside a standard Go test to register cases and assert metrics:

func TestRAG(t *testing.T) {
    suite := gaugo.New(t, gaugo.WithJudge(judge))
    suite.Case("basic",
        gaugo.Question("What is Go?"),
        gaugo.ContextDocs(gaugo.Doc("d1", "Go is a programming language.")),
    )
    suite.Assert(ctx, myRAG, gaugo.Faithfulness(), gaugo.AnswerRelevancy())
}

Programmatic usage

Use Runner when you need structured results outside the testing framework:

runner, _ := gaugo.NewRunner(gaugo.WithJudge(judge))
runner.Case("example", gaugo.Question("Q?"), gaugo.ExpectedContains("answer"))
result, _ := runner.Run(ctx, myFunc)
fmt.Println(result.Summary())

Built-in metrics

Built-in metrics cover RAG, safety, generation quality, structured output, instruction following, domain-specific checks, and deterministic contracts.

RAG and answer quality:

Safety and generation quality:

Structured output and deterministic checks:

Instruction and domain-specific metrics:

All metrics accept WithThreshold to set a custom pass/fail score in [0,1]. Metric interfaces, shared input/output types, and built-in constructors also live in the github.com/nnull13/gaugo/metric sub-package. The root package re-exports that public surface so callers can use either Faithfulness or metric.Faithfulness interchangeably.

Provider judges

Metrics that require LLM evaluation use a Judge interface. Built-in adapters are provided for OpenAI, Anthropic, Gemini, xAI, and local models (Ollama). See the provider sub-packages for configuration details.

Index

Constants

View Source
const (
	ErrorKindUnknown             = failure.KindUnknown
	ErrorKindConfig              = failure.KindConfig
	ErrorKindValidation          = failure.KindValidation
	ErrorKindContextCanceled     = failure.KindContextCanceled
	ErrorKindContextDeadline     = failure.KindContextDeadline
	ErrorKindPanic               = failure.KindPanic
	ErrorKindMetric              = failure.KindMetric
	ErrorKindMetricParse         = failure.KindMetricParse
	ErrorKindProviderRequest     = failure.KindProviderRequest
	ErrorKindProviderAuth        = failure.KindProviderAuth
	ErrorKindProviderRateLimit   = failure.KindProviderRateLimit
	ErrorKindProviderUnavailable = failure.KindProviderUnavailable
	ErrorKindProviderResponse    = failure.KindProviderResponse
	ErrorKindProviderRefusal     = failure.KindProviderRefusal
	ErrorKindProviderTruncated   = failure.KindProviderTruncated
)

Variables

View Source
var (
	ErrConfig              = failure.ErrConfig
	ErrValidation          = failure.ErrValidation
	ErrMetric              = failure.ErrMetric
	ErrMetricParse         = failure.ErrMetricParse
	ErrProviderRequest     = failure.ErrProviderRequest
	ErrProviderAuth        = failure.ErrProviderAuth
	ErrProviderRateLimit   = failure.ErrProviderRateLimit
	ErrProviderUnavailable = failure.ErrProviderUnavailable
	ErrProviderResponse    = failure.ErrProviderResponse
	ErrProviderRefusal     = failure.ErrProviderRefusal
	ErrProviderTruncated   = failure.ErrProviderTruncated
	ErrPanic               = failure.ErrPanic
)

Functions

func Assert

func Assert(t testing.TB, result RunResult)

Assert reports result failures through the Go testing package.

Types

type Case

type Case struct {
	Name     string
	Input    Input
	Expected Expected
}

Case defines one evaluation scenario.

type CaseOption

type CaseOption func(*Case)

CaseOption mutates one Case definition.

func ContextDocs

func ContextDocs(docs ...Document) CaseOption

ContextDocs sets retrieved context documents for a case.

func ExpectedAnswer added in v1.1.0

func ExpectedAnswer(answer string) CaseOption

ExpectedAnswer sets the reference answer for metrics that need ground truth.

func ExpectedContains

func ExpectedContains(substr string) CaseOption

ExpectedContains requires the output answer to contain the given substring.

func ExpectedInstructions added in v1.1.0

func ExpectedInstructions(instructions string) CaseOption

ExpectedInstructions sets the reference instructions for instruction-following metrics.

func Question

func Question(question string) CaseOption

Question sets the user question for a case.

type CaseResult

type CaseResult struct {
	Name     string
	Metrics  []MetricResult
	RunError error
	Elapsed  time.Duration
}

CaseResult contains execution and metric results for a single case.

func (CaseResult) Failed added in v1.1.0

func (c CaseResult) Failed() bool

Failed reports whether this case has a run error or any failing metric.

func (CaseResult) FailedMetrics added in v1.1.0

func (c CaseResult) FailedMetrics() []MetricResult

FailedMetrics returns only the metrics that did not pass.

func (CaseResult) MetricsByName added in v1.1.0

func (c CaseResult) MetricsByName(name string) []MetricResult

MetricsByName returns metrics matching the given name.

type Document

type Document = metric.Document

func Doc added in v1.1.0

func Doc(id, text string) Document

Doc is a convenience constructor for Document.

type Error added in v1.1.1

type Error = failure.Error

Error is Gaugo's typed operational error. It keeps a human message, structured integration metadata, and an optional wrapped cause.

type ErrorCode added in v1.1.1

type ErrorCode = failure.Code

ErrorCode is a stable machine-readable reason for a Gaugo failure.

type ErrorInfo added in v1.1.0

type ErrorInfo = failure.Info

ErrorInfo is a redacted, structured description of an operational failure.

func ClassifyError added in v1.1.0

func ClassifyError(err error) ErrorInfo

ClassifyError returns redacted operational metadata for err.

func MetricErrorInfo added in v1.1.0

func MetricErrorInfo(m MetricResult) (ErrorInfo, bool)

MetricErrorInfo extracts ErrorInfo stored in MetricResult details.

type ErrorKind added in v1.1.0

type ErrorKind = failure.Kind

ErrorKind classifies operational failures separately from low quality scores.

type EvalInput

type EvalInput = metric.EvalInput

type Expected

type Expected = metric.Expected

type Input

type Input = metric.Input

type Judge

type Judge = metric.Judge

type JudgeRequest

type JudgeRequest = metric.JudgeRequest

type JudgeResponse

type JudgeResponse = metric.JudgeResponse

type Metric

type Metric = metric.Metric

func AnswerCorrectness added in v1.1.0

func AnswerCorrectness(opts ...MetricOption) Metric

AnswerCorrectness scores how well the answer matches Expected.Answer.

func AnswerLength added in v1.1.0

func AnswerLength(opts ...MetricOption) Metric

AnswerLength scores whether the answer length lies in [min,max] runes.

func AnswerRelevancy

func AnswerRelevancy(opts ...MetricOption) Metric

AnswerRelevancy scores how well the answer addresses the question.

func AnswerSimilarity added in v1.1.0

func AnswerSimilarity(opts ...MetricOption) Metric

AnswerSimilarity scores Jaccard token overlap against Expected.Answer.

func Bias added in v1.1.0

func Bias(opts ...MetricOption) Metric

Bias scores how unbiased the answer is.

func CitationAccuracy added in v1.1.0

func CitationAccuracy(opts ...MetricOption) Metric

CitationAccuracy scores how accurate inline citations are against context.

func Coherence added in v1.1.0

func Coherence(opts ...MetricOption) Metric

Coherence scores how internally consistent the answer is.

func Completeness added in v1.1.0

func Completeness(opts ...MetricOption) Metric

Completeness scores how completely the answer addresses the question.

func Conciseness added in v1.1.0

func Conciseness(opts ...MetricOption) Metric

Conciseness scores how succinct the answer is without losing meaning.

func ContextPrecision added in v1.1.0

func ContextPrecision(opts ...MetricOption) Metric

ContextPrecision scores the fraction of context documents that are useful.

func ContextRecall added in v1.1.0

func ContextRecall(opts ...MetricOption) Metric

ContextRecall scores how much of the expected answer is supported by the context.

func ContextRelevancy added in v1.1.0

func ContextRelevancy(opts ...MetricOption) Metric

ContextRelevancy scores how relevant each context document is to the question.

func ExpectedJSON added in v1.1.0

func ExpectedJSON(opts ...MetricOption) Metric

ExpectedJSON scores how many expected JSON fields match the answer.

func ExpectedRegex added in v1.1.0

func ExpectedRegex(pattern string, opts ...MetricOption) Metric

ExpectedRegex scores whether the answer matches a Go regular expression.

func Faithfulness

func Faithfulness(opts ...MetricOption) Metric

Faithfulness scores whether the answer is supported by the provided context.

func GEval added in v1.1.0

func GEval(criteria string, opts ...MetricOption) Metric

GEval scores along a free-form criteria string. Criteria must be non-empty.

func Hallucination added in v1.1.0

func Hallucination(opts ...MetricOption) Metric

Hallucination scores the fraction of claims that are not hallucinated.

func InstructionAdherence added in v1.1.0

func InstructionAdherence(opts ...MetricOption) Metric

InstructionAdherence scores how closely the answer followed Expected.Instructions.

func JSONValidity added in v1.1.0

func JSONValidity(opts ...MetricOption) Metric

JSONValidity scores whether the answer parses as valid JSON.

func Latency added in v1.1.0

func Latency(opts ...MetricOption) Metric

Latency scores whether the run elapsed within the configured maximum.

func SchemaCompliance added in v1.1.0

func SchemaCompliance(opts ...MetricOption) Metric

SchemaCompliance scores whether the answer matches the configured JSON schema.

func SummarizationQuality added in v1.1.0

func SummarizationQuality(opts ...MetricOption) Metric

SummarizationQuality scores summary coverage, fidelity and conciseness.

func Toxicity added in v1.1.0

func Toxicity(opts ...MetricOption) Metric

Toxicity scores how non-toxic the answer is.

type MetricOption

type MetricOption = metric.Option

func WithExpectedFields added in v1.1.0

func WithExpectedFields(fields map[string]any) MetricOption

WithExpectedFields sets the dotted JSON paths and expected values used by ExpectedJSON.

func WithMaxLatency added in v1.1.0

func WithMaxLatency(d time.Duration) MetricOption

WithMaxLatency sets the maximum allowed run latency for Latency.

func WithMaxLength added in v1.1.0

func WithMaxLength(n int) MetricOption

WithMaxLength sets the maximum allowed answer length in runes.

func WithMinLength added in v1.1.0

func WithMinLength(n int) MetricOption

WithMinLength sets the minimum allowed answer length in runes.

func WithSchema added in v1.1.0

func WithSchema(schema json.RawMessage) MetricOption

WithSchema sets the JSON Schema used by SchemaCompliance.

func WithThreshold

func WithThreshold(v float64) MetricOption

WithThreshold sets the pass/fail threshold in [0,1]. Default is 0.7.

type MetricResult

type MetricResult = metric.Result

type Option

type Option func(*config) error

Option configures a Runner or Suite.

func WithCaseTimeout

func WithCaseTimeout(d time.Duration) Option

WithCaseTimeout applies a per-case timeout to run and metric evaluation.

func WithJudge

func WithJudge(j Judge) Option

WithJudge configures the LLM judge used by metrics.

func WithMetricDetailsLimit

func WithMetricDetailsLimit(bytes int) Option

WithMetricDetailsLimit caps stored metric detail bytes per metric result. Use 0 to disable details entirely.

func WithParallelism

func WithParallelism(n int) Option

WithParallelism configures the maximum number of concurrent case executions.

func WithReporter

func WithReporter(r Reporter) Option

WithReporter overrides the default testing reporter.

type Output

type Output = metric.Output

type Reporter

type Reporter interface {
	Report(ctx context.Context, result RunResult)
}

Reporter receives completed suite results without depending on testing.T.

type RetryConfig

type RetryConfig struct {
	// MaxAttempts is the total number of attempts, including the first request.
	// Zero uses the package default.
	MaxAttempts int
	// BaseDelay is the fallback delay before the second attempt.
	// Zero uses the package default.
	BaseDelay time.Duration
	// MaxDelay caps exponential backoff and Retry-After delays.
	// Zero uses the package default.
	MaxDelay time.Duration
}

RetryConfig controls retries for transient provider HTTP failures.

func DefaultRetryConfig

func DefaultRetryConfig() RetryConfig

DefaultRetryConfig returns the retry defaults used by bundled providers.

func (RetryConfig) Validate

func (cfg RetryConfig) Validate() error

Validate reports whether cfg is internally consistent.

type RunFunc

type RunFunc func(ctx context.Context, in Input) (Output, error)

RunFunc executes the system under test for one case.

type RunResult

type RunResult struct {
	Cases []CaseResult
}

RunResult is the deterministic output of a suite execution.

func (RunResult) Failed added in v1.1.0

func (r RunResult) Failed() bool

Failed reports whether any case has a run error or a failing metric.

func (RunResult) PassRate added in v1.1.0

func (r RunResult) PassRate() float64

PassRate returns passed checks divided by explicit checks plus run errors. Run errors count as failed executions in the denominator.

func (RunResult) Summary added in v1.1.0

func (r RunResult) Summary() string

Summary returns a human-readable one-line summary suitable for logs and CI output.

type Runner

type Runner struct {
	// contains filtered or unexported fields
}

Runner executes registered cases and returns structured results without depending on the testing package.

func NewRunner

func NewRunner(opts ...Option) (*Runner, error)

NewRunner creates a programmatic evaluation runner.

func (*Runner) Case

func (r *Runner) Case(name string, opts ...CaseOption) error

Case registers one evaluation case.

func (*Runner) Run

func (r *Runner) Run(ctx context.Context, run RunFunc, metrics ...Metric) (RunResult, error)

Run executes all registered cases.

type Suite

type Suite struct {
	// contains filtered or unexported fields
}

Suite is a testing wrapper around Runner.

func New

func New(t testing.TB, opts ...Option) *Suite

New creates a testing Suite. Configuration errors fail the test immediately.

func (*Suite) Assert

func (s *Suite) Assert(ctx context.Context, run RunFunc, metrics ...Metric)

Assert executes all registered cases and reports failures through testing.

func (*Suite) Case

func (s *Suite) Case(name string, opts ...CaseOption)

Case registers one evaluation case and fails the test if it is invalid.

Directories

Path Synopsis
internal
metrics
Package metrics provides validation helpers shared by every built-in judge-metric sub-package under internal/metrics/*.
Package metrics provides validation helpers shared by every built-in judge-metric sub-package under internal/metrics/*.
strictjson
Package strictjson decodes JSON into Go values while rejecting unknown fields and trailing tokens.
Package strictjson decodes JSON into Go values while rejecting unknown fields and trailing tokens.
Package metric defines the Metric interface used by gaugo to evaluate one case at a time, plus the built-in deterministic and LLM-judge metrics.
Package metric defines the Metric interface used by gaugo to evaluate one case at a time, plus the built-in deterministic and LLM-judge metrics.
provider
xai

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL