eval

package module
v0.11.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 21, 2026 License: MIT Imports: 14 Imported by: 0

README

axon-eval

Standalone tool · Part of the lamina workspace

Evaluation framework for scenario testing against live services. Define test plans in YAML with rubric criteria — response content checks, duration limits, tool-use expectations, and LLM-judged quality — then run them against a service cluster and collect graded results.

Getting started

go get github.com/benaskins/axon-eval@latest

Requires Go 1.26+.

Create a plan file (see example/smoke.yaml):

name: smoke test
scenarios:
  - name: greeting
    message: "Hello, how are you?"
    ideal_response: "A warm, friendly greeting."
    max_duration_ms: 5000
    rubric:
      - type: min_length
        value: "20"
      - type: llm_judge
        criterion: "Response is warm and conversational in tone"

Run it from Go:

plan, _ := eval.LoadPlan("example/smoke.yaml")

client, _ := eval.NewClient(eval.Config{
    AuthURL:      "https://auth.studio.internal",
    ChatURL:      "https://chat.studio.internal",
    AnalyticsURL: "https://look.studio.internal",
})

run, _ := client.Run("smoke test", []eval.Scenario{
    eval.Conversation("greeting", []eval.Message{
        {Role: "user", Content: plan.Scenarios[0].Message},
    }),
})

grade := eval.GradeScenario(plan.Scenarios[0], run.Responses[0].Responses[0], judge)

Or use the lamina eval command from the workspace:

lamina eval plans/smoke.yaml

Key types

  • Plan, PlanScenario — YAML test plan structure with scenarios, rubrics, and tool expectations
  • Criterion, ToolExpect — assertion types: contains, not_contains, min_length, max_length, llm_judge
  • Scenario, Message — test scenario with a name and ordered messages to send
  • Conversation — constructor that builds a Scenario from a name and messages
  • Client, Config — HTTP client for running scenarios against auth, chat, and analytics services
  • Run, ScenarioResult, ChatResult — execution results with response text, duration, and tools used
  • Judge, OllamaJudge — LLM-based grading interface and Ollama-backed implementation
  • ScenarioGrade, CriterionResult — graded results per scenario and per criterion
  • EvalScenario, EvalReport — programmatic evaluation with custom CheckFunc assertions

License

MIT — see LICENSE.

Documentation

Overview

Package eval provides an evaluation framework for running scenario plans against a live service cluster. It supports YAML-defined test plans with assertions on HTTP responses and LLM-generated content.

Class: experiment UseWhen: LLM output evaluation.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func BFCLResultSchema added in v0.3.0

func BFCLResultSchema() fact.Schema

BFCLResultSchema returns the fact.Schema for per-case BFCL eval results. One row per test case, ordered by (model, category, timestamp).

func LuthierResultSchema added in v0.3.0

func LuthierResultSchema() fact.Schema

LuthierResultSchema returns the fact.Schema for per-run luthier eval results. One row per evaluation run, ordered by (model, timestamp).

Types

type ChatResult

type ChatResult struct {
	Response   string   `json:"response"`
	Thinking   string   `json:"thinking,omitempty"`
	DurationMs int64    `json:"duration_ms"`
	ToolsUsed  []string `json:"tools_used"`
}

ChatResult holds the response from a synchronous chat request.

type CheckFunc

type CheckFunc func(response string) (pass bool, reason string)

CheckFunc evaluates a response and returns pass/fail with a reason.

func ResponseContains

func ResponseContains(substr string) CheckFunc

ResponseContains returns a check that passes if the response contains the substring.

func ResponseMinLength

func ResponseMinLength(n int) CheckFunc

ResponseMinLength returns a check that passes if the response has at least n characters.

type Client

type Client struct {
	// contains filtered or unexported fields
}

Client is the main entry point for running test scenarios and evaluations. Call NewClient to create one — it handles service user and agent setup.

func NewClient

func NewClient(cfg Config) (*Client, error)

NewClient creates a test client, setting up the service user and test agent. This calls the auth service to create a robot user, notifies the chat service, and creates the xagent test agent.

func (*Client) EmitEvalResult

func (c *Client) EmitEvalResult(runID string, grade *ScenarioGrade, result ChatResult) error

EmitEvalResult sends an eval_result event to the analytics service.

func (*Client) Evaluate

func (c *Client) Evaluate(description string, scenarios []EvalScenario) (*EvalReport, error)

Evaluate runs evaluation scenarios and checks responses against expected outcomes.

func (*Client) Run

func (c *Client) Run(description string, scenarios []Scenario) (*Run, error)

Run executes a batch of test scenarios, bracketed by run_started/run_completed events.

func (*Client) Verify

func (c *Client) Verify(runID string) (*VerifyResult, error)

Verify queries the analytics service for event counts from a specific run.

type Config

type Config struct {
	AuthURL      string
	ChatURL      string
	AnalyticsURL string
	AgentSlug    string // defaults to "xagent" if empty
}

Config holds the URLs for the services that axon-eval interacts with.

type Criterion

type Criterion struct {
	Type      string `yaml:"type"`
	Value     string `yaml:"value"`
	Criterion string `yaml:"criterion"`
}

Criterion is a single rubric item for grading a scenario.

type CriterionResult

type CriterionResult struct {
	Criterion string  `json:"criterion"`
	Pass      bool    `json:"pass"`
	Score     float64 `json:"score"`
	Reason    string  `json:"reason,omitempty"`
}

CriterionResult holds the result of evaluating a single criterion.

type EvalReport

type EvalReport struct {
	Passed  int
	Failed  int
	Results []EvalResult
}

EvalReport holds the aggregate results of an evaluation run.

type EvalResult

type EvalResult struct {
	Messages []Message
	Response string
	Result   ChatResult
	Pass     bool
	Reason   string
}

EvalResult holds the result of a single evaluation scenario.

type EvalScenario

type EvalScenario struct {
	Messages []Message
	Check    CheckFunc
}

EvalScenario is a test scenario with an expected outcome check.

type Judge

type Judge interface {
	Grade(ctx context.Context, response, idealResponse, criterion string) (*JudgeResult, error)
}

Judge is an interface for LLM-based grading. See judge.go for implementation.

type JudgeResult

type JudgeResult struct {
	Pass   bool    `json:"pass"`
	Score  float64 `json:"score"`
	Reason string  `json:"reason"`
}

JudgeResult holds the output of an LLM judge evaluation.

type Message

type Message struct {
	Role    string `json:"role"`
	Content string `json:"content"`
}

Message is a single message in a conversation scenario.

type OllamaJudge

type OllamaJudge struct {
	// contains filtered or unexported fields
}

OllamaJudge implements Judge using an LLM via TextGenerator.

func NewOllamaJudge

func NewOllamaJudge(generate TextGenerator) *OllamaJudge

NewOllamaJudge creates a judge backed by the given text generator.

func (*OllamaJudge) Grade

func (j *OllamaJudge) Grade(ctx context.Context, response, idealResponse, criterion string) (*JudgeResult, error)

Grade sends the response, ideal response, and criterion to the LLM judge and parses a structured JSON result.

type Plan

type Plan struct {
	Name      string         `yaml:"name"`
	Scenarios []PlanScenario `yaml:"scenarios"`
}

Plan is a YAML-driven test plan containing named scenarios.

func LoadPlan

func LoadPlan(path string) (*Plan, error)

LoadPlan reads and validates a YAML test plan from the given path.

type PlanScenario

type PlanScenario struct {
	Name          string      `yaml:"name"`
	Message       string      `yaml:"message"`
	IdealResponse string      `yaml:"ideal_response"`
	MaxDurationMs int64       `yaml:"max_duration_ms"`
	Tools         ToolExpect  `yaml:"tools"`
	Rubric        []Criterion `yaml:"rubric"`
}

PlanScenario is a single test scenario within a plan.

type Run

type Run struct {
	ID        string
	Responses []ScenarioResult
}

Run holds the results of a test run.

type Scenario

type Scenario struct {
	Name     string
	Messages []Message
}

Scenario is a test scenario to execute during a run.

func Conversation

func Conversation(name string, messages []Message) Scenario

Conversation creates a scenario from a name and messages.

type ScenarioGrade

type ScenarioGrade struct {
	Scenario string            `json:"scenario"`
	Results  []CriterionResult `json:"results"`
	Passed   int               `json:"passed"`
	Failed   int               `json:"failed"`
	Total    int               `json:"total"`
}

ScenarioGrade holds the aggregate grading results for a scenario.

func GradeScenario

func GradeScenario(ctx context.Context, scenario PlanScenario, result ChatResult, judge Judge) *ScenarioGrade

GradeScenario evaluates a scenario's rubric criteria and auto-checks against a chat result. If judge is nil, llm_judge criteria are skipped.

type ScenarioResult

type ScenarioResult struct {
	Name      string
	Responses []ChatResult
}

ScenarioResult holds the result of a single scenario.

type TextGenerator

type TextGenerator func(ctx context.Context, prompt string, temperature float64, maxTokens int) (string, error)

TextGenerator produces text from a prompt. Matches the axon-memo pattern.

type ToolExpect

type ToolExpect struct {
	Expect []string `yaml:"expect"`
	Reject []string `yaml:"reject"`
}

ToolExpect specifies which tools should or should not be used.

type VerifyResult

type VerifyResult struct {
	RunID                 string `json:"run_id"`
	Messages              int    `json:"messages"`
	ToolInvocations       int    `json:"tool_invocations"`
	Conversations         int    `json:"conversations"`
	Memories              int    `json:"memories"`
	RelationshipSnapshots int    `json:"relationship_snapshots"`
	Consolidations        int    `json:"consolidations"`
}

VerifyResult holds the event counts for a run, queried from the analytics service.

Directories

Path Synopsis
Package bfcl runs the Berkeley Function Calling Leaderboard (BFCL) benchmark against any loop.LLMClient.
Package bfcl runs the Berkeley Function Calling Leaderboard (BFCL) benchmark against any loop.LLMClient.
cmd/bfcl-run command
bfcl-run executes the full BFCL benchmark against a loop.LLMClient and reports per-category and overall accuracy.
bfcl-run executes the full BFCL benchmark against a loop.LLMClient and reports per-category and overall accuracy.
cmd
eval-ingest command
eval-ingest reads a JSON RunReport from stdin and records each result as a fact via the Pipeline to ClickHouse.
eval-ingest reads a JSON RunReport from stdin and records each result as a fact via the Pipeline to ClickHouse.
explorer command

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL