goeval

package module

v0.1.0 Latest Latest Go to latest Published: Oct 22, 2025 License: MIT Imports: 7 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/datar-psa/goeval

Links

Open Source Insights

README ¶

goeval

A Go library for fast, automated evaluation of Large Language Model (LLM) outputs, inspired by Braintrust's autoevals.

Features

Simple, consistent scoring API returning scores in [0.0, 1.0]
LLM-as-a-judge evaluators: factuality, tonality, and moderation
Heuristic and embedding-based evaluators for speed and semantics
Structured outputs from LLM judges for debuggability (choices, confidences, evidence)
Support for Google Vertex AI (Gemini) via a pluggable generator/provider (more providers planned)

How Scoring Works

All scorers follow a simple design pattern:

output: The actual response from your model
expected: The expected/reference response (optional for some scorers)

The scorer compares output against expected and returns a score between 0.0 and 1.0, where 1.0 is the best possible score.

Getting Started

import (
    "context"
    "fmt"

    "github.com/datar-psa/goeval"
    "github.com/datar-psa/goeval/llmjudge"
    "google.golang.org/genai"
)

func main() {
    ctx := context.Background()

    // Create Gemini client
    genaiClient, _ := genai.NewClient(ctx, &genai.ClientConfig{ /* Your config */ })

    // Create Judge wrapper with functional options
    judge := goeval.NewGeminiLLMJudge(
        goeval.WithGenaiClient(genaiClient),
        goeval.WithModelName("publishers/google/models/gemini-2.5-flash"),
    )

    // Create the Factuality scorer without passing LLM each time
    scorer := judge.Factuality(goeval.FactualityOptions{})

    // Score the model output against an expected answer, with the original question as input
    result := scorer.Score(ctx, goeval.ScoreInputs{
        Input:    "What is the capital of France?",
        Output:   "Paris",
        Expected: "Paris",
    })

    if result.Error != nil {
        panic(result.Error)
    }
    fmt.Printf("Score: %.2f, choice=%v\n", result.Score, result.Metadata["choice"])
}

Scorers

LLM-as-a-Judge Evaluations

Sophisticated evaluations using language models as judges.

Package: github.com/datar-psa/goeval/llmjudge

Scorer	Description
Factuality	LLM judge comparing Output vs Expected for factual consistency
Tonality	LLM judge for professionalism, kindness, clarity, helpfulness (A–E anchors)
Moderation	Content safety via moderation provider; 1.0 safe, 0.0 unsafe

Heuristic Evaluations

Fast, rule-based scorers that don't require LLMs.

Package: github.com/datar-psa/goeval/heuristic

Scorer	Description
ExactMatch	Simple equality (configurable case/whitespace)

Embedding Evaluations

Semantic similarity using vector embeddings.

Package: github.com/datar-psa/goeval/embedding

Scorer	Description
EmbeddingSimilarity	Cosine similarity over embeddings (semantic closeness)

Use Cases

1) FAQ Answer Accuracy (Factuality)

Evaluate if an assistant's answer matches a knowledge base answer.

judge := goeval.NewGeminiLLMJudge(
    goeval.WithGenaiClient(genaiClient),
    goeval.WithModelName("publishers/google/models/gemini-2.5-flash"),
)
res := judge.Factuality(goeval.FactualityOptions{}).Score(ctx, goeval.ScoreInputs{
    Input:    "What are store hours on Sundays?",
    Output:   "We're open 10am–6pm on Sundays.",
    Expected: "We are open from 10:00 to 18:00 on Sundays.",
})
// res.Score in [0..1]; metadata includes {choice, explanation, raw_response}

2) Support Reply Tone (Tonality)

Enforce minimum tone quality across all dimensions with a threshold gate.

judge := goeval.NewGeminiLLMJudge(
    goeval.WithGenaiClient(genaiClient),
    goeval.WithModelName("publishers/google/models/gemini-2.5-flash"),
)
tonality := judge.Tonality(goeval.TonalityOptions{
    ProfessionalismWeight: 0.25,
    KindnessWeight:        0.25,
    ClarityWeight:         0.25,
    HelpfulnessWeight:     0.25,
    Threshold:             0.4, // if any used dimension < 0.4, overall score becomes 0
})

res := tonality.Score(ctx, goeval.ScoreInputs{
    Input:  "Customer complaint about delayed shipment",
    Output: "I'm sorry for the delay — here's what we're doing next...",
})
// res.Metadata contains per-dimension choices/scores and applied weights

3) Chat Moderation (Moderation)

Block unsafe replies and steer away from sensitive topics (e.g., religion/politics).

// Create Google Cloud Language client
langClient, _ := language.NewRESTClient(ctx)

judge := goeval.NewGeminiLLMJudge(
    goeval.WithLanguageClient(langClient),
)
moderation := judge.Moderation(goeval.ModerationOptions{
    Threshold:  0.5,
    Categories: []string{"Toxic", "Derogatory", "Violent", "Insult", "ReligionBelief", "Politics"},
})

res := moderation.Score(ctx, goeval.ScoreInputs{Output: "Let's discuss your religion and political views..."})
// res.Score = 0.0 if unsafe; metadata includes flagged categories and is_safe=false

4) Intent Similarity (Embeddings)

Group similar user requests or route to the right workflow.

// Create embedding scorer
embedding := goeval.NewGeminiEmbedding(
    goeval.WithGenaiClient(genaiClient),
    goeval.WithModelName("text-embedding-005"),
)
sim := embedding.Similarity(goeval.EmbeddingSimilarityOptions{})

res := sim.Score(ctx, goeval.ScoreInputs{
    Output:   "Reset my password",
    Expected: "I can't log in to my account",
})
// Higher scores indicate closer semantic intent

5) Exact Match Validation (Heuristic)

Fast validation for exact matches with configurable options.

// Create heuristic scorer
heuristic := goeval.NewHeuristic()
exactMatch := heuristic.ExactMatch(goeval.ExactMatchOptions{
    CaseSensitive: false,
    TrimWhitespace: true,
})

res := exactMatch.Score(ctx, goeval.ScoreInputs{
    Output:   "Paris",
    Expected: "paris",
})
// res.Score = 1.0 for exact match (case-insensitive)

Design Philosophy

The library is designed with flexibility and composability in mind:

Client-First Approach: We accept pre-configured clients (like *genai.Client, *language.Client) rather than raw credentials or project IDs. This gives you complete control over authentication, retry policies, and other client configurations.
Functional Options Pattern: All constructors use functional options for clean, extensible APIs that grow gracefully over time.
Pluggable Providers: The scoring interfaces are designed to be implemented by any provider, making it easy to add support for new LLM providers or evaluation services.

Development

Running Tests

go test -short              # Unit tests only
go test                     # All tests
UPDATE_TESTS=true go test   # Update integration test cache (LLM requests)

Request Caching

Currently we're using hypert to cache LLM requests. The library's integration tests already demonstrate this pattern.

Roadmap

More Scorers: Additional evaluation methods
Request Caching: Built-in caching layer for LLM requests (currently one option is hypert)
OpenAI Provider: Native support for OpenAI's GPT models alongside Google Gemini

Documentation ¶

Index ¶

Variables
func WithEmbedder(embedder api.Embedder) func(*EmbeddingOptions)
func WithGenaiClient(client *genai.Client) func(*GeminiOptions)
func WithLLMGenerator(llm api.LLMGenerator) func(*LLMJudgeOptions)
func WithLanguageClient(langClient *language.Client) func(*GeminiOptions)
func WithModelName(modelName string) func(*GeminiOptions)
func WithModerationProvider(provider api.ModerationProvider) func(*LLMJudgeOptions)
type Embedder
type Embedding
- func NewEmbedding(opts ...func(*EmbeddingOptions)) *Embedding
- func NewGeminiEmbedding(opts ...func(*GeminiOptions)) *Embedding
- func (e *Embedding) Similarity(opts EmbeddingSimilarityOptions) api.Scorer
type EmbeddingOptions
type EmbeddingSimilarityOptions
type ExactMatchOptions
type FactualityOptions
type GeminiOptions
type Heuristic
- func NewHeuristic() *Heuristic
- func (h *Heuristic) ExactMatch(opts ExactMatchOptions) api.Scorer
type LLMGenerator
type LLMJudge
- func NewGeminiLLMJudge(opts ...func(*GeminiOptions)) *LLMJudge
- func NewLLMJudge(opts ...func(*LLMJudgeOptions)) *LLMJudge
- func (j *LLMJudge) Factuality(opts FactualityOptions) api.Scorer
- func (j *LLMJudge) Moderation(opts ModerationOptions) api.Scorer
- func (j *LLMJudge) Tonality(opts TonalityOptions) api.Scorer
type LLMJudgeOptions
type ModerationCategory
type ModerationOptions
type ModerationProvider
type ModerationResult
type Score
type ScoreInputs
type Scorer
type TonalityOptions

Constants ¶

This section is empty.

Variables ¶

View Source

var (
	ErrNoExpectedValue     = api.ErrNoExpectedValue
	ErrLLMGenerationFailed = api.ErrLLMGenerationFailed
)

View Source

var ModerationCategories = api.ModerationCategories

Functions ¶

func WithEmbedder ¶

func WithEmbedder(embedder api.Embedder) func(*EmbeddingOptions)

WithEmbedder sets the embedder for the embedding scorer

func WithGenaiClient ¶

func WithGenaiClient(client *genai.Client) func(*GeminiOptions)

WithGenaiClient sets the Gemini client for the judge

func WithLLMGenerator ¶

func WithLLMGenerator(llm api.LLMGenerator) func(*LLMJudgeOptions)

WithLLMGenerator sets the LLM generator for the judge

func WithLanguageClient ¶

func WithLanguageClient(langClient *language.Client) func(*GeminiOptions)

WithLanguageClient sets the Google Cloud Language client for moderation

func WithModelName ¶

func WithModelName(modelName string) func(*GeminiOptions)

WithModelName sets the model name for the judge

func WithModerationProvider ¶

func WithModerationProvider(provider api.ModerationProvider) func(*LLMJudgeOptions)

WithModerationProvider sets the moderation provider for the judge

Types ¶

type Embedder ¶

type Embedder = api.Embedder

type Embedding ¶

type Embedding struct {
	// contains filtered or unexported fields
}

Embedding wraps an embedder and exposes convenient constructors for embedding-based scorers.

func NewEmbedding ¶

func NewEmbedding(opts ...func(*EmbeddingOptions)) *Embedding

NewEmbedding creates a new Embedding wrapper using functional options.

func NewGeminiEmbedding ¶

func NewGeminiEmbedding(opts ...func(*GeminiOptions)) *Embedding

NewGeminiEmbedding creates an Embedding using Gemini client and model name. Example model: "text-embedding-005".

func (*Embedding) Similarity ¶

func (e *Embedding) Similarity(opts EmbeddingSimilarityOptions) api.Scorer

Similarity returns a scorer that measures semantic similarity using embeddings.

type EmbeddingOptions ¶

type EmbeddingOptions struct {
	// contains filtered or unexported fields
}

EmbeddingOptions configures Embedding creation

type EmbeddingSimilarityOptions ¶

type EmbeddingSimilarityOptions = embedding.EmbeddingSimilarityOptions

type ExactMatchOptions ¶

type ExactMatchOptions = heuristic.ExactMatchOptions

type FactualityOptions ¶

type FactualityOptions = llmjudge.FactualityOptions

type GeminiOptions ¶

type GeminiOptions struct {
	// contains filtered or unexported fields
}

GeminiOptions configures Gemini LLMJudge creation

type Heuristic ¶

type Heuristic struct{}

Heuristic exposes convenient constructors for heuristic scorers.

func NewHeuristic ¶

func NewHeuristic() *Heuristic

NewHeuristic creates a new Heuristic.

func (*Heuristic) ExactMatch ¶

func (h *Heuristic) ExactMatch(opts ExactMatchOptions) api.Scorer

ExactMatch returns a scorer that checks if the output exactly matches the expected value.

type LLMGenerator ¶

type LLMGenerator = api.LLMGenerator

type LLMJudge ¶

type LLMJudge struct {
	// contains filtered or unexported fields
}

LLMJudge wraps an LLM generator and exposes convenient constructors for LLM-as-a-judge scorers. It allows creating scorers like Factuality and Tonality without passing the LLM each time.

func NewGeminiLLMJudge ¶

func NewGeminiLLMJudge(opts ...func(*GeminiOptions)) *LLMJudge

NewGeminiLLMJudge creates a Judge using Gemini client and model name. Example model: "publishers/google/models/gemini-2.5-flash".

func NewLLMJudge ¶

func NewLLMJudge(opts ...func(*LLMJudgeOptions)) *LLMJudge

NewLLMJudge creates a new Judge wrapper using functional options.

func (*LLMJudge) Factuality ¶

func (j *LLMJudge) Factuality(opts FactualityOptions) api.Scorer

Factuality returns a scorer that compares Output against Expected for factual consistency.

func (*LLMJudge) Moderation ¶

func (j *LLMJudge) Moderation(opts ModerationOptions) api.Scorer

Moderation returns a scorer that evaluates content safety using a moderation provider.

func (*LLMJudge) Tonality ¶

func (j *LLMJudge) Tonality(opts TonalityOptions) api.Scorer

Tonality returns a scorer that evaluates professionalism, kindness, clarity and helpfulness.

type LLMJudgeOptions ¶

type LLMJudgeOptions struct {
	// contains filtered or unexported fields
}

LLMJudgeOptions configures LLMJudge creation

type ModerationCategory ¶

type ModerationCategory = api.ModerationCategory

type ModerationOptions ¶

type ModerationOptions = llmjudge.ModerationOptions

type ModerationProvider ¶

type ModerationProvider = api.ModerationProvider

type ModerationResult ¶

type ModerationResult = api.ModerationResult

type Score ¶

type Score = api.Score

type ScoreInputs ¶

type ScoreInputs = api.ScoreInputs

type Scorer ¶

type Scorer = api.Scorer

type TonalityOptions ¶

type TonalityOptions = llmjudge.TonalityOptions

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
api
embedding
gemini
heuristic
internal
testutils
llmjudge

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL