goeval

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 22, 2025 License: MIT Imports: 7 Imported by: 0

README

goeval

A Go library for fast, automated evaluation of Large Language Model (LLM) outputs, inspired by Braintrust's autoevals.

Features

  • Simple, consistent scoring API returning scores in [0.0, 1.0]
  • LLM-as-a-judge evaluators: factuality, tonality, and moderation
  • Heuristic and embedding-based evaluators for speed and semantics
  • Structured outputs from LLM judges for debuggability (choices, confidences, evidence)
  • Support for Google Vertex AI (Gemini) via a pluggable generator/provider (more providers planned)

How Scoring Works

All scorers follow a simple design pattern:

  • output: The actual response from your model
  • expected: The expected/reference response (optional for some scorers)

The scorer compares output against expected and returns a score between 0.0 and 1.0, where 1.0 is the best possible score.

Getting Started

import (
    "context"
    "fmt"

    "github.com/datar-psa/goeval"
    "github.com/datar-psa/goeval/llmjudge"
    "google.golang.org/genai"
)

func main() {
    ctx := context.Background()

    // Create Gemini client
    genaiClient, _ := genai.NewClient(ctx, &genai.ClientConfig{ /* Your config */ })

    // Create Judge wrapper with functional options
    judge := goeval.NewGeminiLLMJudge(
        goeval.WithGenaiClient(genaiClient),
        goeval.WithModelName("publishers/google/models/gemini-2.5-flash"),
    )

    // Create the Factuality scorer without passing LLM each time
    scorer := judge.Factuality(goeval.FactualityOptions{})

    // Score the model output against an expected answer, with the original question as input
    result := scorer.Score(ctx, goeval.ScoreInputs{
        Input:    "What is the capital of France?",
        Output:   "Paris",
        Expected: "Paris",
    })

    if result.Error != nil {
        panic(result.Error)
    }
    fmt.Printf("Score: %.2f, choice=%v\n", result.Score, result.Metadata["choice"])
}

Scorers

LLM-as-a-Judge Evaluations

Sophisticated evaluations using language models as judges.

Package: github.com/datar-psa/goeval/llmjudge

Scorer Description
Factuality LLM judge comparing Output vs Expected for factual consistency
Tonality LLM judge for professionalism, kindness, clarity, helpfulness (A–E anchors)
Moderation Content safety via moderation provider; 1.0 safe, 0.0 unsafe
Heuristic Evaluations

Fast, rule-based scorers that don't require LLMs.

Package: github.com/datar-psa/goeval/heuristic

Scorer Description
ExactMatch Simple equality (configurable case/whitespace)
Embedding Evaluations

Semantic similarity using vector embeddings.

Package: github.com/datar-psa/goeval/embedding

Scorer Description
EmbeddingSimilarity Cosine similarity over embeddings (semantic closeness)

Use Cases

1) FAQ Answer Accuracy (Factuality)

Evaluate if an assistant's answer matches a knowledge base answer.

judge := goeval.NewGeminiLLMJudge(
    goeval.WithGenaiClient(genaiClient),
    goeval.WithModelName("publishers/google/models/gemini-2.5-flash"),
)
res := judge.Factuality(goeval.FactualityOptions{}).Score(ctx, goeval.ScoreInputs{
    Input:    "What are store hours on Sundays?",
    Output:   "We're open 10am–6pm on Sundays.",
    Expected: "We are open from 10:00 to 18:00 on Sundays.",
})
// res.Score in [0..1]; metadata includes {choice, explanation, raw_response}
2) Support Reply Tone (Tonality)

Enforce minimum tone quality across all dimensions with a threshold gate.

judge := goeval.NewGeminiLLMJudge(
    goeval.WithGenaiClient(genaiClient),
    goeval.WithModelName("publishers/google/models/gemini-2.5-flash"),
)
tonality := judge.Tonality(goeval.TonalityOptions{
    ProfessionalismWeight: 0.25,
    KindnessWeight:        0.25,
    ClarityWeight:         0.25,
    HelpfulnessWeight:     0.25,
    Threshold:             0.4, // if any used dimension < 0.4, overall score becomes 0
})

res := tonality.Score(ctx, goeval.ScoreInputs{
    Input:  "Customer complaint about delayed shipment",
    Output: "I'm sorry for the delay — here's what we're doing next...",
})
// res.Metadata contains per-dimension choices/scores and applied weights
3) Chat Moderation (Moderation)

Block unsafe replies and steer away from sensitive topics (e.g., religion/politics).

// Create Google Cloud Language client
langClient, _ := language.NewRESTClient(ctx)

judge := goeval.NewGeminiLLMJudge(
    goeval.WithLanguageClient(langClient),
)
moderation := judge.Moderation(goeval.ModerationOptions{
    Threshold:  0.5,
    Categories: []string{"Toxic", "Derogatory", "Violent", "Insult", "ReligionBelief", "Politics"},
})

res := moderation.Score(ctx, goeval.ScoreInputs{Output: "Let's discuss your religion and political views..."})
// res.Score = 0.0 if unsafe; metadata includes flagged categories and is_safe=false
4) Intent Similarity (Embeddings)

Group similar user requests or route to the right workflow.

// Create embedding scorer
embedding := goeval.NewGeminiEmbedding(
    goeval.WithGenaiClient(genaiClient),
    goeval.WithModelName("text-embedding-005"),
)
sim := embedding.Similarity(goeval.EmbeddingSimilarityOptions{})

res := sim.Score(ctx, goeval.ScoreInputs{
    Output:   "Reset my password",
    Expected: "I can't log in to my account",
})
// Higher scores indicate closer semantic intent
5) Exact Match Validation (Heuristic)

Fast validation for exact matches with configurable options.

// Create heuristic scorer
heuristic := goeval.NewHeuristic()
exactMatch := heuristic.ExactMatch(goeval.ExactMatchOptions{
    CaseSensitive: false,
    TrimWhitespace: true,
})

res := exactMatch.Score(ctx, goeval.ScoreInputs{
    Output:   "Paris",
    Expected: "paris",
})
// res.Score = 1.0 for exact match (case-insensitive)

Design Philosophy

The library is designed with flexibility and composability in mind:

  • Client-First Approach: We accept pre-configured clients (like *genai.Client, *language.Client) rather than raw credentials or project IDs. This gives you complete control over authentication, retry policies, and other client configurations.

  • Functional Options Pattern: All constructors use functional options for clean, extensible APIs that grow gracefully over time.

  • Pluggable Providers: The scoring interfaces are designed to be implemented by any provider, making it easy to add support for new LLM providers or evaluation services.

Development

Running Tests
go test -short              # Unit tests only
go test                     # All tests
UPDATE_TESTS=true go test   # Update integration test cache (LLM requests)
Request Caching

Currently we're using hypert to cache LLM requests. The library's integration tests already demonstrate this pattern.

Roadmap
  • More Scorers: Additional evaluation methods
  • Request Caching: Built-in caching layer for LLM requests (currently one option is hypert)
  • OpenAI Provider: Native support for OpenAI's GPT models alongside Google Gemini

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	ErrNoExpectedValue     = api.ErrNoExpectedValue
	ErrLLMGenerationFailed = api.ErrLLMGenerationFailed
)
View Source
var ModerationCategories = api.ModerationCategories

Functions

func WithEmbedder

func WithEmbedder(embedder api.Embedder) func(*EmbeddingOptions)

WithEmbedder sets the embedder for the embedding scorer

func WithGenaiClient

func WithGenaiClient(client *genai.Client) func(*GeminiOptions)

WithGenaiClient sets the Gemini client for the judge

func WithLLMGenerator

func WithLLMGenerator(llm api.LLMGenerator) func(*LLMJudgeOptions)

WithLLMGenerator sets the LLM generator for the judge

func WithLanguageClient

func WithLanguageClient(langClient *language.Client) func(*GeminiOptions)

WithLanguageClient sets the Google Cloud Language client for moderation

func WithModelName

func WithModelName(modelName string) func(*GeminiOptions)

WithModelName sets the model name for the judge

func WithModerationProvider

func WithModerationProvider(provider api.ModerationProvider) func(*LLMJudgeOptions)

WithModerationProvider sets the moderation provider for the judge

Types

type Embedder

type Embedder = api.Embedder

type Embedding

type Embedding struct {
	// contains filtered or unexported fields
}

Embedding wraps an embedder and exposes convenient constructors for embedding-based scorers.

func NewEmbedding

func NewEmbedding(opts ...func(*EmbeddingOptions)) *Embedding

NewEmbedding creates a new Embedding wrapper using functional options.

func NewGeminiEmbedding

func NewGeminiEmbedding(opts ...func(*GeminiOptions)) *Embedding

NewGeminiEmbedding creates an Embedding using Gemini client and model name. Example model: "text-embedding-005".

func (*Embedding) Similarity

func (e *Embedding) Similarity(opts EmbeddingSimilarityOptions) api.Scorer

Similarity returns a scorer that measures semantic similarity using embeddings.

type EmbeddingOptions

type EmbeddingOptions struct {
	// contains filtered or unexported fields
}

EmbeddingOptions configures Embedding creation

type EmbeddingSimilarityOptions

type EmbeddingSimilarityOptions = embedding.EmbeddingSimilarityOptions

type ExactMatchOptions

type ExactMatchOptions = heuristic.ExactMatchOptions

type FactualityOptions

type FactualityOptions = llmjudge.FactualityOptions

type GeminiOptions

type GeminiOptions struct {
	// contains filtered or unexported fields
}

GeminiOptions configures Gemini LLMJudge creation

type Heuristic

type Heuristic struct{}

Heuristic exposes convenient constructors for heuristic scorers.

func NewHeuristic

func NewHeuristic() *Heuristic

NewHeuristic creates a new Heuristic.

func (*Heuristic) ExactMatch

func (h *Heuristic) ExactMatch(opts ExactMatchOptions) api.Scorer

ExactMatch returns a scorer that checks if the output exactly matches the expected value.

type LLMGenerator

type LLMGenerator = api.LLMGenerator

type LLMJudge

type LLMJudge struct {
	// contains filtered or unexported fields
}

LLMJudge wraps an LLM generator and exposes convenient constructors for LLM-as-a-judge scorers. It allows creating scorers like Factuality and Tonality without passing the LLM each time.

func NewGeminiLLMJudge

func NewGeminiLLMJudge(opts ...func(*GeminiOptions)) *LLMJudge

NewGeminiLLMJudge creates a Judge using Gemini client and model name. Example model: "publishers/google/models/gemini-2.5-flash".

func NewLLMJudge

func NewLLMJudge(opts ...func(*LLMJudgeOptions)) *LLMJudge

NewLLMJudge creates a new Judge wrapper using functional options.

func (*LLMJudge) Factuality

func (j *LLMJudge) Factuality(opts FactualityOptions) api.Scorer

Factuality returns a scorer that compares Output against Expected for factual consistency.

func (*LLMJudge) Moderation

func (j *LLMJudge) Moderation(opts ModerationOptions) api.Scorer

Moderation returns a scorer that evaluates content safety using a moderation provider.

func (*LLMJudge) Tonality

func (j *LLMJudge) Tonality(opts TonalityOptions) api.Scorer

Tonality returns a scorer that evaluates professionalism, kindness, clarity and helpfulness.

type LLMJudgeOptions

type LLMJudgeOptions struct {
	// contains filtered or unexported fields
}

LLMJudgeOptions configures LLMJudge creation

type ModerationCategory

type ModerationCategory = api.ModerationCategory

type ModerationOptions

type ModerationOptions = llmjudge.ModerationOptions

type ModerationProvider

type ModerationProvider = api.ModerationProvider

type ModerationResult

type ModerationResult = api.ModerationResult

type Score

type Score = api.Score

type ScoreInputs

type ScoreInputs = api.ScoreInputs

type Scorer

type Scorer = api.Scorer

type TonalityOptions

type TonalityOptions = llmjudge.TonalityOptions

Directories

Path Synopsis
internal

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL