Documentation
¶
Overview ¶
Package scoreboard declares the structures to define a scoreboard.
It is in a separate package from genai to reduce noise.
Index ¶
Constants ¶
This section is empty.
Variables ¶
var TestdataFiles embed.FS
TestdataFiles embeds the testdata/ directory for use in smoke tests.
They are the canonical data to be used to declare the supported modalities.
Functions ¶
func CompareScenarios ¶
CompareScenarios compares two scenarios for sorting. Scenarios are sorted by preference flags: SOTA (0), Good (1), Cheap (2), then others. Within the same preference, untested scenarios are sorted last. Within the same priority and tested status, reasoning scenarios come before non-reasoning. Within the same priority, tested status, and reasoning, scenarios are sorted alphabetically by first model name.
Types ¶
type Functionality ¶
type Functionality struct {
// ReportRateLimits means that the provider reports rate limits in its Usage.
ReportRateLimits bool `json:"reportRateLimits,omitzero"`
// ReportTokenUsage means that the token usage is correctly reported in all cases. It is flaky if it is not
// reported in some specific cases. A frequent example is tokens not being reported in JSON output mode.
ReportTokenUsage TriState `json:"reportTokenUsage,omitzero"`
// ReportFinishReason means that the finish reason (FinishStop, FinishLength, etc) is not correctly reported.
ReportFinishReason TriState `json:"reportFinishReason,omitzero"`
// Seed is set when the provider and model combination supports seed for reproducibility.
Seed bool `json:"seed,omitzero"`
// Tools means that tool call is supported. This is a requirement for MCP. Some provider support tool
// calling but the model is very flaky at actually requesting the calls. This is more frequent on highly
// quantized models, small models or MoE models.
Tools TriState `json:"tools,omitzero"`
// ToolsBiased is true when we ask the LLM to use a tool in an ambiguous biased question, it will always
// reply with the first readily available answer.
//
// This means that when using enum, it is important to understand that the LLM will put heavy weight on the
// first option.
//
// This is affected by two factors: model size and quantization. Quantization affects this dramatically.
ToolsBiased TriState `json:"toolsBiased,omitzero"`
// ToolsIndecisive is True when we ask the LLM to use a tool in an ambiguous biased question, it'll call both
// options. It is Flaky when both can happen.
//
// This is actually fine, it means that the LLM will be less opinionated in some cases. The test into which
// a LLM is indecisive is likely model-specific too.
ToolsIndecisive TriState `json:"toolsIndecisive,omitzero"`
// ToolCallRequired is true when the value genai.ToolCallRequired works. Not supporting it significantly
// increases the risk of flakiness.
ToolCallRequired bool `json:"toolCallRequired,omitzero"`
// WebSearch is true if the provider supports web search via its own backend.
WebSearch bool `json:"webSearch,omitzero"`
// WebFetch is true if the provider supports fetching content from URLs via its own backend.
WebFetch bool `json:"webFetch,omitzero"`
// JSON means that the model supports enforcing that the response is valid JSON but not necessarily with a
// schema.
JSON bool `json:"json,omitzero"`
// JSONSchema means that the model supports enforcing that the response is a specific JSON schema.
JSONSchema bool `json:"jsonSchema,omitzero"`
// Citations is set when the provider and model combination supports citations in the response.
Citations bool `json:"citations,omitzero"`
// TopLogprobs is set when the provider and model combination supports top_logprobs.
TopLogprobs bool `json:"topLogprobs,omitzero"`
// MaxTokens means that the provider supports limiting text output to a specific number of tokens.
//
// Tokens are characters nor words. The tokens are embedding specific, and each model family uses a
// different vocabulary. Thus the number of characters generated varies wildly.
//
// It fails more often with model with implicit reasoning.
MaxTokens bool `json:"maxTokens,omitzero"`
// StopSequence means that the provider supports stop words. The number of stop words is generally limited,
// frequently to 5 words. The sequence should be a valid token in the model's vocabulary.
StopSequence bool `json:"stopSequence,omitzero"`
// contains filtered or unexported fields
}
Functionality defines which functionalites are supported in a scenario.
The first group is for all models. The remainder is for text models.
The second group is about tool use, is as of 2025-08 is only supported for text models.
The third group is about text specific features.
func (*Functionality) Less ¶
func (f *Functionality) Less(rhs *Functionality) bool
Less returns true if the functionality is less that the other.
func (*Functionality) Validate ¶
func (f *Functionality) Validate() error
Validate returns an error if the Functionality contains invalid values.
type ModalCapability ¶
type ModalCapability struct {
// Inline means content can be embedded directly (e.g., base64 encoded)
Inline bool `json:"inline,omitzero"`
// URL means content can be referenced by URL
URL bool `json:"url,omitzero"`
// MaxSize specifies the maximum size in bytes.
MaxSize int64 `json:"maxSize,omitzero"`
// SupportedFormats lists supported MIME types for this modality
SupportedFormats []string `json:"supportedFormats,omitzero"`
}
ModalCapability describes how a modality is supported by a provider.
type Modality ¶
type Modality string
Modality is one of the supported modalities.
const ( // ModalityAudio is support for audio formats like MP3, WAV, Opus, Flac, etc. ModalityAudio Modality = "audio" // ModalityDocument is support for PDF with multi-modal comprehension, both images and text. This includes // code blocks. ModalityDocument Modality = "document" // ModalityImage is support for image formats like PNG, JPEG, often single frame GIF, and WEBP. ModalityImage Modality = "image" // ModalityText is for raw text. ModalityText Modality = "text" // ModalityVideo is support for video formats like MP4 or MKV. ModalityVideo Modality = "video" )
type Model ¶
Model specifies a model to test and whether it should run in reasoning mode.
Most models only support one or the other, but some support both. Functionality often differs depending on whether reasoning is enabled.
type Reason ¶
type Reason int8
Reason specifies if a model Scenario supports reasoning (thinking).
const ( // ReasonNone means that no reasoning is supported. ReasonNone Reason = 0 // ReasonInline means that the reasoning tokens are inline and must be explicitly parsed from Content.Text // with adapters.ProviderReasoning. ReasonInline Reason = 1 // ReasonAuto means that the reasoning tokens are properly generated and handled by the provider and are // returned as Content.Reasoning. ReasonAuto Reason = -1 )
func (Reason) MarshalJSON ¶
MarshalJSON implements json.Marshaler.
func (*Reason) UnmarshalJSON ¶
UnmarshalJSON implements json.Unmarshaler.
type Scenario ¶
type Scenario struct {
// Comments are notes about the scenario. For example, if a scenario is known to be bugged, deprecated,
// expensive, etc.
Comments string `json:"comments,omitzero"`
// Models is a *non exhaustive* list of models that support this scenario. It can't be exhaustive since
// providers continuouly release new models. It is still valuable to use the first value. Required.
Models []string `json:"models"`
// These mean that the model is automatically selected. There can be multiple SOTA models, one per
// modalities (e.g. text output vs image output). They must be first. Models must be a list of one model in
// this case.
SOTA bool `json:"sota,omitzero"`
Good bool `json:"good,omitzero"`
Cheap bool `json:"cheap,omitzero"`
// Reason means that the model does either explicit chain-of-thought or hidden reasoning. For some
// providers, this is controlled via a OptionsText. For some models (like Qwen3), a token "/no_think" or
// "/think" is used to control. ReasoningTokenStart and ReasoningTokenEnd must only be set on explicit inline
// reasoning models. They often use <think> and </think>.
Reason bool `json:"reason,omitzero"`
ReasoningTokenStart string `json:"reasoningTokenStart,omitzero"`
ReasoningTokenEnd string `json:"reasoningTokenEnd,omitzero"`
In map[Modality]ModalCapability `json:"in,omitzero,omitempty"`
Out map[Modality]ModalCapability `json:"out,omitzero,omitempty"`
// GenSync declares features supported when using Provider.GenSync
GenSync *Functionality `json:"GenSync,omitzero,omitempty"`
// GenStream declares features supported when using Provider.GenStream
GenStream *Functionality `json:"GenStream,omitzero,omitempty"`
// contains filtered or unexported fields
}
Scenario defines one way to use the provider.
func ConsolidateUntestedScenarios ¶
ConsolidateUntestedScenarios merges untested scenarios by comments/reason.
Scenarios with matching Comments and Reason are merged, with their models combined and sorted. Untested scenarios with preference flags (SOTA, Good, Cheap) are not merged with others.
type Score ¶
type Score struct {
// Warnings lists concerns the user should be aware of.
Warnings []string `json:"warnings,omitzero,omitempty"`
// Country where the provider is based, e.g. "US", "CN", "EU". Two exceptions: "Local" for local and "N/A"
// for pure routers.
Country string `json:"country"`
// DashboardURL is the URL to the provider's dashboard, if available.
DashboardURL string `json:"dashboardURL"`
// Scenarios is the list of all known supported and tested scenarios.
//
// A single provider can provide various distinct use cases, like text-to-text, multi-modal-to-text,
// text-to-audio, audio-to-text, etc.
Scenarios []Scenario `json:"scenarios"`
// contains filtered or unexported fields
}
Score is a snapshot of the capabilities of the provider. These are smoke tested to confirm the accuracy.
func (*Score) SortScenarios ¶
func (s *Score) SortScenarios()
SortScenarios sorts the scenarios in place by preference flags. Untested scenarios are sorted last. Tested scenarios are sorted by preference flags: SOTA (0), Good (1), Cheap (2), then others. Within the same priority, reasoning scenarios come before non-reasoning. Within the same priority and reasoning status, scenarios are sorted alphabetically by first model name.
type TriState ¶
type TriState int8
TriState helps describing support when a feature "kinda work", which is frequent with LLM's inherent non-determinism.
const ( // False means the feature is not supported. False TriState = 0 // True means the feature is supported. True TriState = 1 // Flaky means the feature works intermittently. Flaky TriState = -1 )
TriState values for feature support.
func (TriState) MarshalJSON ¶
MarshalJSON implements json.Marshaler.
func (*TriState) UnmarshalJSON ¶
UnmarshalJSON implements json.Unmarshaler.