eval

package module

v0.11.1 Latest Latest Go to latest Published: Apr 21, 2026 License: MIT Imports: 14 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/benaskins/axon-eval

Links

Open Source Insights

README ¶

axon-eval

Standalone tool · Part of the lamina workspace

Evaluation framework for scenario testing against live services. Define test plans in YAML with rubric criteria — response content checks, duration limits, tool-use expectations, and LLM-judged quality — then run them against a service cluster and collect graded results.

Getting started

go get github.com/benaskins/axon-eval@latest

Requires Go 1.26+.

Create a plan file (see example/smoke.yaml):

name: smoke test
scenarios:
  - name: greeting
    message: "Hello, how are you?"
    ideal_response: "A warm, friendly greeting."
    max_duration_ms: 5000
    rubric:
      - type: min_length
        value: "20"
      - type: llm_judge
        criterion: "Response is warm and conversational in tone"

Run it from Go:

plan, _ := eval.LoadPlan("example/smoke.yaml")

client, _ := eval.NewClient(eval.Config{
    AuthURL:      "https://auth.studio.internal",
    ChatURL:      "https://chat.studio.internal",
    AnalyticsURL: "https://look.studio.internal",
})

run, _ := client.Run("smoke test", []eval.Scenario{
    eval.Conversation("greeting", []eval.Message{
        {Role: "user", Content: plan.Scenarios[0].Message},
    }),
})

grade := eval.GradeScenario(plan.Scenarios[0], run.Responses[0].Responses[0], judge)

Or use the lamina eval command from the workspace:

lamina eval plans/smoke.yaml

Key types

Plan, PlanScenario — YAML test plan structure with scenarios, rubrics, and tool expectations
Criterion, ToolExpect — assertion types: contains, not_contains, min_length, max_length, llm_judge
Scenario, Message — test scenario with a name and ordered messages to send
Conversation — constructor that builds a Scenario from a name and messages
Client, Config — HTTP client for running scenarios against auth, chat, and analytics services
Run, ScenarioResult, ChatResult — execution results with response text, duration, and tools used
Judge, OllamaJudge — LLM-based grading interface and Ollama-backed implementation
ScenarioGrade, CriterionResult — graded results per scenario and per criterion
EvalScenario, EvalReport — programmatic evaluation with custom CheckFunc assertions

License

MIT — see LICENSE.

Documentation ¶

Overview ¶

Package eval provides an evaluation framework for running scenario plans against a live service cluster. It supports YAML-defined test plans with assertions on HTTP responses and LLM-generated content.

Class: experiment UseWhen: LLM output evaluation.

Index ¶

func BFCLResultSchema() fact.Schema
func LuthierResultSchema() fact.Schema
type ChatResult
type CheckFunc
- func ResponseContains(substr string) CheckFunc
- func ResponseMinLength(n int) CheckFunc
type Client
- func NewClient(cfg Config) (*Client, error)
- func (c *Client) EmitEvalResult(runID string, grade *ScenarioGrade, result ChatResult) error
- func (c *Client) Evaluate(description string, scenarios []EvalScenario) (*EvalReport, error)
- func (c *Client) Run(description string, scenarios []Scenario) (*Run, error)
- func (c *Client) Verify(runID string) (*VerifyResult, error)
type Config
type Criterion
type CriterionResult
type EvalReport
type EvalResult
type EvalScenario
type Judge
type JudgeResult
type Message
type OllamaJudge
- func NewOllamaJudge(generate TextGenerator) *OllamaJudge
- func (j *OllamaJudge) Grade(ctx context.Context, response, idealResponse, criterion string) (*JudgeResult, error)
type Plan
- func LoadPlan(path string) (*Plan, error)
type PlanScenario
type Run
type Scenario
- func Conversation(name string, messages []Message) Scenario
type ScenarioGrade
- func GradeScenario(ctx context.Context, scenario PlanScenario, result ChatResult, judge Judge) *ScenarioGrade
type ScenarioResult
type TextGenerator
type ToolExpect
type VerifyResult

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func BFCLResultSchema ¶ added in v0.3.0

func BFCLResultSchema() fact.Schema

BFCLResultSchema returns the fact.Schema for per-case BFCL eval results. One row per test case, ordered by (model, category, timestamp).

func LuthierResultSchema ¶ added in v0.3.0

func LuthierResultSchema() fact.Schema

LuthierResultSchema returns the fact.Schema for per-run luthier eval results. One row per evaluation run, ordered by (model, timestamp).

Types ¶

type ChatResult ¶

type ChatResult struct {
	Response   string   `json:"response"`
	Thinking   string   `json:"thinking,omitempty"`
	DurationMs int64    `json:"duration_ms"`
	ToolsUsed  []string `json:"tools_used"`
}

ChatResult holds the response from a synchronous chat request.

type CheckFunc ¶

type CheckFunc func(response string) (pass bool, reason string)

CheckFunc evaluates a response and returns pass/fail with a reason.

func ResponseContains ¶

func ResponseContains(substr string) CheckFunc

ResponseContains returns a check that passes if the response contains the substring.

func ResponseMinLength ¶

func ResponseMinLength(n int) CheckFunc

ResponseMinLength returns a check that passes if the response has at least n characters.

type Client ¶

type Client struct {
	// contains filtered or unexported fields
}

Client is the main entry point for running test scenarios and evaluations. Call NewClient to create one — it handles service user and agent setup.

func NewClient ¶

func NewClient(cfg Config) (*Client, error)

NewClient creates a test client, setting up the service user and test agent. This calls the auth service to create a robot user, notifies the chat service, and creates the xagent test agent.

func (*Client) EmitEvalResult ¶

func (c *Client) EmitEvalResult(runID string, grade *ScenarioGrade, result ChatResult) error

EmitEvalResult sends an eval_result event to the analytics service.

func (*Client) Evaluate ¶

func (c *Client) Evaluate(description string, scenarios []EvalScenario) (*EvalReport, error)

Evaluate runs evaluation scenarios and checks responses against expected outcomes.

func (*Client) Run ¶

func (c *Client) Run(description string, scenarios []Scenario) (*Run, error)

Run executes a batch of test scenarios, bracketed by run_started/run_completed events.

func (*Client) Verify ¶

func (c *Client) Verify(runID string) (*VerifyResult, error)

Verify queries the analytics service for event counts from a specific run.

type Config ¶

type Config struct {
	AuthURL      string
	ChatURL      string
	AnalyticsURL string
	AgentSlug    string // defaults to "xagent" if empty
}

Config holds the URLs for the services that axon-eval interacts with.

type Criterion ¶

type Criterion struct {
	Type      string `yaml:"type"`
	Value     string `yaml:"value"`
	Criterion string `yaml:"criterion"`
}

Criterion is a single rubric item for grading a scenario.

type CriterionResult ¶

type CriterionResult struct {
	Criterion string  `json:"criterion"`
	Pass      bool    `json:"pass"`
	Score     float64 `json:"score"`
	Reason    string  `json:"reason,omitempty"`
}

CriterionResult holds the result of evaluating a single criterion.

type EvalReport ¶

type EvalReport struct {
	Passed  int
	Failed  int
	Results []EvalResult
}

EvalReport holds the aggregate results of an evaluation run.

type EvalResult ¶

type EvalResult struct {
	Messages []Message
	Response string
	Result   ChatResult
	Pass     bool
	Reason   string
}

EvalResult holds the result of a single evaluation scenario.

type EvalScenario ¶

type EvalScenario struct {
	Messages []Message
	Check    CheckFunc
}

EvalScenario is a test scenario with an expected outcome check.

type Judge ¶

type Judge interface {
	Grade(ctx context.Context, response, idealResponse, criterion string) (*JudgeResult, error)
}

Judge is an interface for LLM-based grading. See judge.go for implementation.

type JudgeResult ¶

type JudgeResult struct {
	Pass   bool    `json:"pass"`
	Score  float64 `json:"score"`
	Reason string  `json:"reason"`
}

JudgeResult holds the output of an LLM judge evaluation.

type Message ¶

type Message struct {
	Role    string `json:"role"`
	Content string `json:"content"`
}

Message is a single message in a conversation scenario.

type OllamaJudge ¶

type OllamaJudge struct {
	// contains filtered or unexported fields
}

OllamaJudge implements Judge using an LLM via TextGenerator.

func NewOllamaJudge ¶

func NewOllamaJudge(generate TextGenerator) *OllamaJudge

NewOllamaJudge creates a judge backed by the given text generator.

func (*OllamaJudge) Grade ¶

func (j *OllamaJudge) Grade(ctx context.Context, response, idealResponse, criterion string) (*JudgeResult, error)

Grade sends the response, ideal response, and criterion to the LLM judge and parses a structured JSON result.

type Plan ¶

type Plan struct {
	Name      string         `yaml:"name"`
	Scenarios []PlanScenario `yaml:"scenarios"`
}

Plan is a YAML-driven test plan containing named scenarios.

func LoadPlan ¶

func LoadPlan(path string) (*Plan, error)

LoadPlan reads and validates a YAML test plan from the given path.

type PlanScenario ¶

type PlanScenario struct {
	Name          string      `yaml:"name"`
	Message       string      `yaml:"message"`
	IdealResponse string      `yaml:"ideal_response"`
	MaxDurationMs int64       `yaml:"max_duration_ms"`
	Tools         ToolExpect  `yaml:"tools"`
	Rubric        []Criterion `yaml:"rubric"`
}

PlanScenario is a single test scenario within a plan.

type Run ¶

type Run struct {
	ID        string
	Responses []ScenarioResult
}

Run holds the results of a test run.

type Scenario ¶

type Scenario struct {
	Name     string
	Messages []Message
}

Scenario is a test scenario to execute during a run.

func Conversation ¶

func Conversation(name string, messages []Message) Scenario

Conversation creates a scenario from a name and messages.

type ScenarioGrade ¶

type ScenarioGrade struct {
	Scenario string            `json:"scenario"`
	Results  []CriterionResult `json:"results"`
	Passed   int               `json:"passed"`
	Failed   int               `json:"failed"`
	Total    int               `json:"total"`
}

ScenarioGrade holds the aggregate grading results for a scenario.

func GradeScenario ¶

func GradeScenario(ctx context.Context, scenario PlanScenario, result ChatResult, judge Judge) *ScenarioGrade

GradeScenario evaluates a scenario's rubric criteria and auto-checks against a chat result. If judge is nil, llm_judge criteria are skipped.

type ScenarioResult ¶

type ScenarioResult struct {
	Name      string
	Responses []ChatResult
}

ScenarioResult holds the result of a single scenario.

type TextGenerator ¶

type TextGenerator func(ctx context.Context, prompt string, temperature float64, maxTokens int) (string, error)

TextGenerator produces text from a prompt. Matches the axon-memo pattern.

type ToolExpect ¶

type ToolExpect struct {
	Expect []string `yaml:"expect"`
	Reject []string `yaml:"reject"`
}

ToolExpect specifies which tools should or should not be used.

type VerifyResult ¶

type VerifyResult struct {
	RunID                 string `json:"run_id"`
	Messages              int    `json:"messages"`
	ToolInvocations       int    `json:"tool_invocations"`
	Conversations         int    `json:"conversations"`
	Memories              int    `json:"memories"`
	RelationshipSnapshots int    `json:"relationship_snapshots"`
	Consolidations        int    `json:"consolidations"`
}

VerifyResult holds the event counts for a run, queried from the analytics service.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
bfcl Package bfcl runs the Berkeley Function Calling Leaderboard (BFCL) benchmark against any loop.LLMClient.	Package bfcl runs the Berkeley Function Calling Leaderboard (BFCL) benchmark against any loop.LLMClient.
cmd/bfcl-run command bfcl-run executes the full BFCL benchmark against a loop.LLMClient and reports per-category and overall accuracy.	bfcl-run executes the full BFCL benchmark against a loop.LLMClient and reports per-category and overall accuracy.
cmd
eval-ingest command eval-ingest reads a JSON RunReport from stdin and records each result as a fact via the Pipeline to ClickHouse.	eval-ingest reads a JSON RunReport from stdin and records each result as a fact via the Pipeline to ClickHouse.
explorer command
explorer

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL