Documentation
¶
Overview ¶
Package eval provides an evaluation framework for running scenario plans against a live service cluster. It supports YAML-defined test plans with assertions on HTTP responses and LLM-generated content.
Class: experiment UseWhen: LLM output evaluation.
Index ¶
- func BFCLResultSchema() fact.Schema
- func LuthierResultSchema() fact.Schema
- type ChatResult
- type CheckFunc
- type Client
- func (c *Client) EmitEvalResult(runID string, grade *ScenarioGrade, result ChatResult) error
- func (c *Client) Evaluate(description string, scenarios []EvalScenario) (*EvalReport, error)
- func (c *Client) Run(description string, scenarios []Scenario) (*Run, error)
- func (c *Client) Verify(runID string) (*VerifyResult, error)
- type Config
- type Criterion
- type CriterionResult
- type EvalReport
- type EvalResult
- type EvalScenario
- type Judge
- type JudgeResult
- type Message
- type OllamaJudge
- type Plan
- type PlanScenario
- type Run
- type Scenario
- type ScenarioGrade
- type ScenarioResult
- type TextGenerator
- type ToolExpect
- type VerifyResult
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func BFCLResultSchema ¶ added in v0.3.0
BFCLResultSchema returns the fact.Schema for per-case BFCL eval results. One row per test case, ordered by (model, category, timestamp).
func LuthierResultSchema ¶ added in v0.3.0
LuthierResultSchema returns the fact.Schema for per-run luthier eval results. One row per evaluation run, ordered by (model, timestamp).
Types ¶
type ChatResult ¶
type ChatResult struct {
Response string `json:"response"`
Thinking string `json:"thinking,omitempty"`
DurationMs int64 `json:"duration_ms"`
ToolsUsed []string `json:"tools_used"`
}
ChatResult holds the response from a synchronous chat request.
type CheckFunc ¶
CheckFunc evaluates a response and returns pass/fail with a reason.
func ResponseContains ¶
ResponseContains returns a check that passes if the response contains the substring.
func ResponseMinLength ¶
ResponseMinLength returns a check that passes if the response has at least n characters.
type Client ¶
type Client struct {
// contains filtered or unexported fields
}
Client is the main entry point for running test scenarios and evaluations. Call NewClient to create one — it handles service user and agent setup.
func NewClient ¶
NewClient creates a test client, setting up the service user and test agent. This calls the auth service to create a robot user, notifies the chat service, and creates the xagent test agent.
func (*Client) EmitEvalResult ¶
func (c *Client) EmitEvalResult(runID string, grade *ScenarioGrade, result ChatResult) error
EmitEvalResult sends an eval_result event to the analytics service.
func (*Client) Evaluate ¶
func (c *Client) Evaluate(description string, scenarios []EvalScenario) (*EvalReport, error)
Evaluate runs evaluation scenarios and checks responses against expected outcomes.
type Config ¶
type Config struct {
AuthURL string
ChatURL string
AnalyticsURL string
AgentSlug string // defaults to "xagent" if empty
}
Config holds the URLs for the services that axon-eval interacts with.
type Criterion ¶
type Criterion struct {
Type string `yaml:"type"`
Value string `yaml:"value"`
Criterion string `yaml:"criterion"`
}
Criterion is a single rubric item for grading a scenario.
type CriterionResult ¶
type CriterionResult struct {
Criterion string `json:"criterion"`
Pass bool `json:"pass"`
Score float64 `json:"score"`
Reason string `json:"reason,omitempty"`
}
CriterionResult holds the result of evaluating a single criterion.
type EvalReport ¶
type EvalReport struct {
Passed int
Failed int
Results []EvalResult
}
EvalReport holds the aggregate results of an evaluation run.
type EvalResult ¶
type EvalResult struct {
Messages []Message
Response string
Result ChatResult
Pass bool
Reason string
}
EvalResult holds the result of a single evaluation scenario.
type EvalScenario ¶
EvalScenario is a test scenario with an expected outcome check.
type Judge ¶
type Judge interface {
Grade(ctx context.Context, response, idealResponse, criterion string) (*JudgeResult, error)
}
Judge is an interface for LLM-based grading. See judge.go for implementation.
type JudgeResult ¶
type JudgeResult struct {
Pass bool `json:"pass"`
Score float64 `json:"score"`
Reason string `json:"reason"`
}
JudgeResult holds the output of an LLM judge evaluation.
type OllamaJudge ¶
type OllamaJudge struct {
// contains filtered or unexported fields
}
OllamaJudge implements Judge using an LLM via TextGenerator.
func NewOllamaJudge ¶
func NewOllamaJudge(generate TextGenerator) *OllamaJudge
NewOllamaJudge creates a judge backed by the given text generator.
func (*OllamaJudge) Grade ¶
func (j *OllamaJudge) Grade(ctx context.Context, response, idealResponse, criterion string) (*JudgeResult, error)
Grade sends the response, ideal response, and criterion to the LLM judge and parses a structured JSON result.
type Plan ¶
type Plan struct {
Name string `yaml:"name"`
Scenarios []PlanScenario `yaml:"scenarios"`
}
Plan is a YAML-driven test plan containing named scenarios.
type PlanScenario ¶
type PlanScenario struct {
Name string `yaml:"name"`
Message string `yaml:"message"`
IdealResponse string `yaml:"ideal_response"`
MaxDurationMs int64 `yaml:"max_duration_ms"`
Tools ToolExpect `yaml:"tools"`
Rubric []Criterion `yaml:"rubric"`
}
PlanScenario is a single test scenario within a plan.
type Run ¶
type Run struct {
ID string
Responses []ScenarioResult
}
Run holds the results of a test run.
type Scenario ¶
Scenario is a test scenario to execute during a run.
func Conversation ¶
Conversation creates a scenario from a name and messages.
type ScenarioGrade ¶
type ScenarioGrade struct {
Scenario string `json:"scenario"`
Results []CriterionResult `json:"results"`
Passed int `json:"passed"`
Failed int `json:"failed"`
Total int `json:"total"`
}
ScenarioGrade holds the aggregate grading results for a scenario.
func GradeScenario ¶
func GradeScenario(ctx context.Context, scenario PlanScenario, result ChatResult, judge Judge) *ScenarioGrade
GradeScenario evaluates a scenario's rubric criteria and auto-checks against a chat result. If judge is nil, llm_judge criteria are skipped.
type ScenarioResult ¶
type ScenarioResult struct {
Name string
Responses []ChatResult
}
ScenarioResult holds the result of a single scenario.
type TextGenerator ¶
type TextGenerator func(ctx context.Context, prompt string, temperature float64, maxTokens int) (string, error)
TextGenerator produces text from a prompt. Matches the axon-memo pattern.
type ToolExpect ¶
ToolExpect specifies which tools should or should not be used.
type VerifyResult ¶
type VerifyResult struct {
RunID string `json:"run_id"`
Messages int `json:"messages"`
ToolInvocations int `json:"tool_invocations"`
Conversations int `json:"conversations"`
Memories int `json:"memories"`
RelationshipSnapshots int `json:"relationship_snapshots"`
Consolidations int `json:"consolidations"`
}
VerifyResult holds the event counts for a run, queried from the analytics service.
Source Files
¶
Directories
¶
| Path | Synopsis |
|---|---|
|
Package bfcl runs the Berkeley Function Calling Leaderboard (BFCL) benchmark against any loop.LLMClient.
|
Package bfcl runs the Berkeley Function Calling Leaderboard (BFCL) benchmark against any loop.LLMClient. |
|
cmd/bfcl-run
command
bfcl-run executes the full BFCL benchmark against a loop.LLMClient and reports per-category and overall accuracy.
|
bfcl-run executes the full BFCL benchmark against a loop.LLMClient and reports per-category and overall accuracy. |
|
cmd
|
|
|
eval-ingest
command
eval-ingest reads a JSON RunReport from stdin and records each result as a fact via the Pipeline to ClickHouse.
|
eval-ingest reads a JSON RunReport from stdin and records each result as a fact via the Pipeline to ClickHouse. |
|
explorer
command
|
|