scoreboard

package

v0.3.0 Latest Latest Go to latest Published: Mar 2, 2026 License: Apache-2.0 Imports: 8 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/maruel/genai

Links

Open Source Insights

Documentation ¶

Overview ¶

Package scoreboard declares the structures to define a scoreboard.

It is in a separate package from genai to reduce noise.

Index ¶

Variables
func CompareScenarios(a, b Scenario) int
type Functionality
- func (f *Functionality) Less(rhs *Functionality) bool
- func (f *Functionality) Validate() error
type ModalCapability
type Modality
- func (m Modality) Validate() error
type Model
- func (m *Model) String() string
type Reason
type Scenario
- func ConsolidateUntestedScenarios(scenarios []Scenario) []Scenario
- func (s *Scenario) Untested() bool
- func (s *Scenario) Validate() error
type Score
- func (s *Score) SortScenarios()
- func (s *Score) Validate() error
type TriState

Constants ¶

This section is empty.

Variables ¶

View Source

var TestdataFiles embed.FS

TestdataFiles embeds the testdata/ directory for use in smoke tests.

They are the canonical data to be used to declare the supported modalities.

Functions ¶

func CompareScenarios ¶

func CompareScenarios(a, b Scenario) int

CompareScenarios compares two scenarios for sorting. Scenarios are sorted by preference flags: SOTA (0), Good (1), Cheap (2), then others. Within the same preference, untested scenarios are sorted last. Within the same priority and tested status, reasoning scenarios come before non-reasoning. Within the same priority, tested status, and reasoning, scenarios are sorted alphabetically by first model name.

Types ¶

type Functionality ¶

type Functionality struct {
	// ReportRateLimits means that the provider reports rate limits in its Usage.
	ReportRateLimits bool `json:"reportRateLimits,omitzero"`
	// ReportTokenUsage means that the token usage is correctly reported in all cases. It is flaky if it is not
	// reported in some specific cases. A frequent example is tokens not being reported in JSON output mode.
	ReportTokenUsage TriState `json:"reportTokenUsage,omitzero"`
	// ReportFinishReason means that the finish reason (FinishStop, FinishLength, etc) is not correctly reported.
	ReportFinishReason TriState `json:"reportFinishReason,omitzero"`
	// Seed is set when the provider and model combination supports seed for reproducibility.
	Seed bool `json:"seed,omitzero"`

	// Tools means that tool call is supported. This is a requirement for MCP. Some provider support tool
	// calling but the model is very flaky at actually requesting the calls. This is more frequent on highly
	// quantized models, small models or MoE models.
	Tools TriState `json:"tools,omitzero"`
	// ToolsBiased is true when we ask the LLM to use a tool in an ambiguous biased question, it will always
	// reply with the first readily available answer.
	//
	// This means that when using enum, it is important to understand that the LLM will put heavy weight on the
	// first option.
	//
	// This is affected by two factors: model size and quantization. Quantization affects this dramatically.
	ToolsBiased TriState `json:"toolsBiased,omitzero"`
	// ToolsIndecisive is True when we ask the LLM to use a tool in an ambiguous biased question, it'll call both
	// options. It is Flaky when both can happen.
	//
	// This is actually fine, it means that the LLM will be less opinionated in some cases. The test into which
	// a LLM is indecisive is likely model-specific too.
	ToolsIndecisive TriState `json:"toolsIndecisive,omitzero"`
	// ToolCallRequired is true when the value genai.ToolCallRequired works. Not supporting it significantly
	// increases the risk of flakiness.
	ToolCallRequired bool `json:"toolCallRequired,omitzero"`
	// WebSearch is true if the provider supports web search via its own backend.
	WebSearch bool `json:"webSearch,omitzero"`
	// WebFetch is true if the provider supports fetching content from URLs via its own backend.
	WebFetch bool `json:"webFetch,omitzero"`

	// JSON means that the model supports enforcing that the response is valid JSON but not necessarily with a
	// schema.
	JSON bool `json:"json,omitzero"`
	// JSONSchema means that the model supports enforcing that the response is a specific JSON schema.
	JSONSchema bool `json:"jsonSchema,omitzero"`
	// Citations is set when the provider and model combination supports citations in the response.
	Citations bool `json:"citations,omitzero"`
	// TopLogprobs is set when the provider and model combination supports top_logprobs.
	TopLogprobs bool `json:"topLogprobs,omitzero"`
	// MaxTokens means that the provider supports limiting text output to a specific number of tokens.
	//
	// Tokens are characters nor words. The tokens are embedding specific, and each model family uses a
	// different vocabulary. Thus the number of characters generated varies wildly.
	//
	// It fails more often with model with implicit reasoning.
	MaxTokens bool `json:"maxTokens,omitzero"`
	// StopSequence means that the provider supports stop words. The number of stop words is generally limited,
	// frequently to 5 words. The sequence should be a valid token in the model's vocabulary.
	StopSequence bool `json:"stopSequence,omitzero"`
	// contains filtered or unexported fields
}

Functionality defines which functionalites are supported in a scenario.

The first group is for all models. The remainder is for text models.

The second group is about tool use, is as of 2025-08 is only supported for text models.

The third group is about text specific features.

func (*Functionality) Less ¶

func (f *Functionality) Less(rhs *Functionality) bool

Less returns true if the functionality is less that the other.

func (*Functionality) Validate ¶

func (f *Functionality) Validate() error

Validate returns an error if the Functionality contains invalid values.

type ModalCapability ¶

type ModalCapability struct {
	// Inline means content can be embedded directly (e.g., base64 encoded)
	Inline bool `json:"inline,omitzero"`
	// URL means content can be referenced by URL
	URL bool `json:"url,omitzero"`
	// MaxSize specifies the maximum size in bytes.
	MaxSize int64 `json:"maxSize,omitzero"`
	// SupportedFormats lists supported MIME types for this modality
	SupportedFormats []string `json:"supportedFormats,omitzero"`
}

ModalCapability describes how a modality is supported by a provider.

type Modality ¶

type Modality string

Modality is one of the supported modalities.

const (
	// ModalityAudio is support for audio formats like MP3, WAV, Opus, Flac, etc.
	ModalityAudio Modality = "audio"
	// ModalityDocument is support for PDF with multi-modal comprehension, both images and text. This includes
	// code blocks.
	ModalityDocument Modality = "document"
	// ModalityImage is support for image formats like PNG, JPEG, often single frame GIF, and WEBP.
	ModalityImage Modality = "image"
	// ModalityText is for raw text.
	ModalityText Modality = "text"
	// ModalityVideo is support for video formats like MP4 or MKV.
	ModalityVideo Modality = "video"
)

func (Modality) Validate ¶

func (m Modality) Validate() error

Validate returns an error if the Modality is not a known value.

type Model ¶

type Model struct {
	Model  string
	Reason bool
}

Model specifies a model to test and whether it should run in reasoning mode.

Most models only support one or the other, but some support both. Functionality often differs depending on whether reasoning is enabled.

func (*Model) String ¶

func (m *Model) String() string

type Reason ¶

type Reason int8

Reason specifies if a model Scenario supports reasoning (thinking).

const (
	// ReasonNone means that no reasoning is supported.
	ReasonNone Reason = 0
	// ReasonInline means that the reasoning tokens are inline and must be explicitly parsed from Content.Text
	// with adapters.ProviderReasoning.
	ReasonInline Reason = 1
	// ReasonAuto means that the reasoning tokens are properly generated and handled by the provider and are
	// returned as Content.Reasoning.
	ReasonAuto Reason = -1
)

func (Reason) MarshalJSON ¶

func (t Reason) MarshalJSON() ([]byte, error)

MarshalJSON implements json.Marshaler.

func (*Reason) UnmarshalJSON ¶

func (t *Reason) UnmarshalJSON(b []byte) error

UnmarshalJSON implements json.Unmarshaler.

func (Reason) Validate ¶

func (t Reason) Validate() error

Validate returns an error if the Reason is not a known value.

type Scenario ¶

type Scenario struct {
	// Comments are notes about the scenario. For example, if a scenario is known to be bugged, deprecated,
	// expensive, etc.
	Comments string `json:"comments,omitzero"`
	// Models is a *non exhaustive* list of models that support this scenario. It can't be exhaustive since
	// providers continuouly release new models. It is still valuable to use the first value. Required.
	Models []string `json:"models"`

	// These mean that the model is automatically selected. There can be multiple SOTA models, one per
	// modalities (e.g. text output vs image output). They must be first. Models must be a list of one model in
	// this case.
	SOTA  bool `json:"sota,omitzero"`
	Good  bool `json:"good,omitzero"`
	Cheap bool `json:"cheap,omitzero"`

	// Reason means that the model does either explicit chain-of-thought or hidden reasoning. For some
	// providers, this is controlled via a OptionsText. For some models (like Qwen3), a token "/no_think" or
	// "/think" is used to control. ReasoningTokenStart and ReasoningTokenEnd must only be set on explicit inline
	// reasoning models. They often use <think> and </think>.
	Reason              bool   `json:"reason,omitzero"`
	ReasoningTokenStart string `json:"reasoningTokenStart,omitzero"`
	ReasoningTokenEnd   string `json:"reasoningTokenEnd,omitzero"`

	In  map[Modality]ModalCapability `json:"in,omitzero,omitempty"`
	Out map[Modality]ModalCapability `json:"out,omitzero,omitempty"`

	// GenSync declares features supported when using Provider.GenSync
	GenSync *Functionality `json:"GenSync,omitzero,omitempty"`
	// GenStream declares features supported when using Provider.GenStream
	GenStream *Functionality `json:"GenStream,omitzero,omitempty"`
	// contains filtered or unexported fields
}

Scenario defines one way to use the provider.

func ConsolidateUntestedScenarios ¶

func ConsolidateUntestedScenarios(scenarios []Scenario) []Scenario

ConsolidateUntestedScenarios merges untested scenarios by comments/reason.

Scenarios with matching Comments and Reason are merged, with their models combined and sorted. Untested scenarios with preference flags (SOTA, Good, Cheap) are not merged with others.

func (*Scenario) Untested ¶

func (s *Scenario) Untested() bool

Untested returns true if the scenario has no test results.

func (*Scenario) Validate ¶

func (s *Scenario) Validate() error

Validate returns an error if the Scenario is not correctly configured.

type Score ¶

type Score struct {
	// Warnings lists concerns the user should be aware of.
	Warnings []string `json:"warnings,omitzero,omitempty"`
	// Country where the provider is based, e.g. "US", "CN", "EU". Two exceptions: "Local" for local and "N/A"
	// for pure routers.
	Country string `json:"country"`
	// DashboardURL is the URL to the provider's dashboard, if available.
	DashboardURL string `json:"dashboardURL"`

	// Scenarios is the list of all known supported and tested scenarios.
	//
	// A single provider can provide various distinct use cases, like text-to-text, multi-modal-to-text,
	// text-to-audio, audio-to-text, etc.
	Scenarios []Scenario `json:"scenarios"`
	// contains filtered or unexported fields
}

Score is a snapshot of the capabilities of the provider. These are smoke tested to confirm the accuracy.

func (*Score) SortScenarios ¶

func (s *Score) SortScenarios()

SortScenarios sorts the scenarios in place by preference flags. Untested scenarios are sorted last. Tested scenarios are sorted by preference flags: SOTA (0), Good (1), Cheap (2), then others. Within the same priority, reasoning scenarios come before non-reasoning. Within the same priority and reasoning status, scenarios are sorted alphabetically by first model name.

func (*Score) Validate ¶

func (s *Score) Validate() error

Validate returns an error if the Score is not correctly configured.

type TriState ¶

type TriState int8

TriState helps describing support when a feature "kinda work", which is frequent with LLM's inherent non-determinism.

const (
	// False means the feature is not supported.
	False TriState = 0
	// True means the feature is supported.
	True TriState = 1
	// Flaky means the feature works intermittently.
	Flaky TriState = -1
)

TriState values for feature support.

func (TriState) GoString ¶

func (t TriState) GoString() string

GoString implements fmt.GoStringer.

func (TriState) MarshalJSON ¶

func (t TriState) MarshalJSON() ([]byte, error)

MarshalJSON implements json.Marshaler.

func (TriState) String ¶

func (t TriState) String() string

func (*TriState) UnmarshalJSON ¶

func (t *TriState) UnmarshalJSON(b []byte) error

UnmarshalJSON implements json.Unmarshaler.

func (TriState) Validate ¶

func (t TriState) Validate() error

Validate returns an error if the TriState is not a known value.

Source Files ¶

View all Source files

scoreboard.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL