scoreboard

package
v0.3.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 2, 2026 License: Apache-2.0 Imports: 8 Imported by: 0

Documentation

Overview

Package scoreboard declares the structures to define a scoreboard.

It is in a separate package from genai to reduce noise.

Index

Constants

This section is empty.

Variables

View Source
var TestdataFiles embed.FS

TestdataFiles embeds the testdata/ directory for use in smoke tests.

They are the canonical data to be used to declare the supported modalities.

Functions

func CompareScenarios

func CompareScenarios(a, b Scenario) int

CompareScenarios compares two scenarios for sorting. Scenarios are sorted by preference flags: SOTA (0), Good (1), Cheap (2), then others. Within the same preference, untested scenarios are sorted last. Within the same priority and tested status, reasoning scenarios come before non-reasoning. Within the same priority, tested status, and reasoning, scenarios are sorted alphabetically by first model name.

Types

type Functionality

type Functionality struct {
	// ReportRateLimits means that the provider reports rate limits in its Usage.
	ReportRateLimits bool `json:"reportRateLimits,omitzero"`
	// ReportTokenUsage means that the token usage is correctly reported in all cases. It is flaky if it is not
	// reported in some specific cases. A frequent example is tokens not being reported in JSON output mode.
	ReportTokenUsage TriState `json:"reportTokenUsage,omitzero"`
	// ReportFinishReason means that the finish reason (FinishStop, FinishLength, etc) is not correctly reported.
	ReportFinishReason TriState `json:"reportFinishReason,omitzero"`
	// Seed is set when the provider and model combination supports seed for reproducibility.
	Seed bool `json:"seed,omitzero"`

	// Tools means that tool call is supported. This is a requirement for MCP. Some provider support tool
	// calling but the model is very flaky at actually requesting the calls. This is more frequent on highly
	// quantized models, small models or MoE models.
	Tools TriState `json:"tools,omitzero"`
	// ToolsBiased is true when we ask the LLM to use a tool in an ambiguous biased question, it will always
	// reply with the first readily available answer.
	//
	// This means that when using enum, it is important to understand that the LLM will put heavy weight on the
	// first option.
	//
	// This is affected by two factors: model size and quantization. Quantization affects this dramatically.
	ToolsBiased TriState `json:"toolsBiased,omitzero"`
	// ToolsIndecisive is True when we ask the LLM to use a tool in an ambiguous biased question, it'll call both
	// options. It is Flaky when both can happen.
	//
	// This is actually fine, it means that the LLM will be less opinionated in some cases. The test into which
	// a LLM is indecisive is likely model-specific too.
	ToolsIndecisive TriState `json:"toolsIndecisive,omitzero"`
	// ToolCallRequired is true when the value genai.ToolCallRequired works. Not supporting it significantly
	// increases the risk of flakiness.
	ToolCallRequired bool `json:"toolCallRequired,omitzero"`
	// WebSearch is true if the provider supports web search via its own backend.
	WebSearch bool `json:"webSearch,omitzero"`
	// WebFetch is true if the provider supports fetching content from URLs via its own backend.
	WebFetch bool `json:"webFetch,omitzero"`

	// JSON means that the model supports enforcing that the response is valid JSON but not necessarily with a
	// schema.
	JSON bool `json:"json,omitzero"`
	// JSONSchema means that the model supports enforcing that the response is a specific JSON schema.
	JSONSchema bool `json:"jsonSchema,omitzero"`
	// Citations is set when the provider and model combination supports citations in the response.
	Citations bool `json:"citations,omitzero"`
	// TopLogprobs is set when the provider and model combination supports top_logprobs.
	TopLogprobs bool `json:"topLogprobs,omitzero"`
	// MaxTokens means that the provider supports limiting text output to a specific number of tokens.
	//
	// Tokens are characters nor words. The tokens are embedding specific, and each model family uses a
	// different vocabulary. Thus the number of characters generated varies wildly.
	//
	// It fails more often with model with implicit reasoning.
	MaxTokens bool `json:"maxTokens,omitzero"`
	// StopSequence means that the provider supports stop words. The number of stop words is generally limited,
	// frequently to 5 words. The sequence should be a valid token in the model's vocabulary.
	StopSequence bool `json:"stopSequence,omitzero"`
	// contains filtered or unexported fields
}

Functionality defines which functionalites are supported in a scenario.

The first group is for all models. The remainder is for text models.

The second group is about tool use, is as of 2025-08 is only supported for text models.

The third group is about text specific features.

func (*Functionality) Less

func (f *Functionality) Less(rhs *Functionality) bool

Less returns true if the functionality is less that the other.

func (*Functionality) Validate

func (f *Functionality) Validate() error

Validate returns an error if the Functionality contains invalid values.

type ModalCapability

type ModalCapability struct {
	// Inline means content can be embedded directly (e.g., base64 encoded)
	Inline bool `json:"inline,omitzero"`
	// URL means content can be referenced by URL
	URL bool `json:"url,omitzero"`
	// MaxSize specifies the maximum size in bytes.
	MaxSize int64 `json:"maxSize,omitzero"`
	// SupportedFormats lists supported MIME types for this modality
	SupportedFormats []string `json:"supportedFormats,omitzero"`
}

ModalCapability describes how a modality is supported by a provider.

type Modality

type Modality string

Modality is one of the supported modalities.

const (
	// ModalityAudio is support for audio formats like MP3, WAV, Opus, Flac, etc.
	ModalityAudio Modality = "audio"
	// ModalityDocument is support for PDF with multi-modal comprehension, both images and text. This includes
	// code blocks.
	ModalityDocument Modality = "document"
	// ModalityImage is support for image formats like PNG, JPEG, often single frame GIF, and WEBP.
	ModalityImage Modality = "image"
	// ModalityText is for raw text.
	ModalityText Modality = "text"
	// ModalityVideo is support for video formats like MP4 or MKV.
	ModalityVideo Modality = "video"
)

func (Modality) Validate

func (m Modality) Validate() error

Validate returns an error if the Modality is not a known value.

type Model

type Model struct {
	Model  string
	Reason bool
}

Model specifies a model to test and whether it should run in reasoning mode.

Most models only support one or the other, but some support both. Functionality often differs depending on whether reasoning is enabled.

func (*Model) String

func (m *Model) String() string

type Reason

type Reason int8

Reason specifies if a model Scenario supports reasoning (thinking).

const (
	// ReasonNone means that no reasoning is supported.
	ReasonNone Reason = 0
	// ReasonInline means that the reasoning tokens are inline and must be explicitly parsed from Content.Text
	// with adapters.ProviderReasoning.
	ReasonInline Reason = 1
	// ReasonAuto means that the reasoning tokens are properly generated and handled by the provider and are
	// returned as Content.Reasoning.
	ReasonAuto Reason = -1
)

func (Reason) MarshalJSON

func (t Reason) MarshalJSON() ([]byte, error)

MarshalJSON implements json.Marshaler.

func (*Reason) UnmarshalJSON

func (t *Reason) UnmarshalJSON(b []byte) error

UnmarshalJSON implements json.Unmarshaler.

func (Reason) Validate

func (t Reason) Validate() error

Validate returns an error if the Reason is not a known value.

type Scenario

type Scenario struct {
	// Comments are notes about the scenario. For example, if a scenario is known to be bugged, deprecated,
	// expensive, etc.
	Comments string `json:"comments,omitzero"`
	// Models is a *non exhaustive* list of models that support this scenario. It can't be exhaustive since
	// providers continuouly release new models. It is still valuable to use the first value. Required.
	Models []string `json:"models"`

	// These mean that the model is automatically selected. There can be multiple SOTA models, one per
	// modalities (e.g. text output vs image output). They must be first. Models must be a list of one model in
	// this case.
	SOTA  bool `json:"sota,omitzero"`
	Good  bool `json:"good,omitzero"`
	Cheap bool `json:"cheap,omitzero"`

	// Reason means that the model does either explicit chain-of-thought or hidden reasoning. For some
	// providers, this is controlled via a OptionsText. For some models (like Qwen3), a token "/no_think" or
	// "/think" is used to control. ReasoningTokenStart and ReasoningTokenEnd must only be set on explicit inline
	// reasoning models. They often use <think> and </think>.
	Reason              bool   `json:"reason,omitzero"`
	ReasoningTokenStart string `json:"reasoningTokenStart,omitzero"`
	ReasoningTokenEnd   string `json:"reasoningTokenEnd,omitzero"`

	In  map[Modality]ModalCapability `json:"in,omitzero,omitempty"`
	Out map[Modality]ModalCapability `json:"out,omitzero,omitempty"`

	// GenSync declares features supported when using Provider.GenSync
	GenSync *Functionality `json:"GenSync,omitzero,omitempty"`
	// GenStream declares features supported when using Provider.GenStream
	GenStream *Functionality `json:"GenStream,omitzero,omitempty"`
	// contains filtered or unexported fields
}

Scenario defines one way to use the provider.

func ConsolidateUntestedScenarios

func ConsolidateUntestedScenarios(scenarios []Scenario) []Scenario

ConsolidateUntestedScenarios merges untested scenarios by comments/reason.

Scenarios with matching Comments and Reason are merged, with their models combined and sorted. Untested scenarios with preference flags (SOTA, Good, Cheap) are not merged with others.

func (*Scenario) Untested

func (s *Scenario) Untested() bool

Untested returns true if the scenario has no test results.

func (*Scenario) Validate

func (s *Scenario) Validate() error

Validate returns an error if the Scenario is not correctly configured.

type Score

type Score struct {
	// Warnings lists concerns the user should be aware of.
	Warnings []string `json:"warnings,omitzero,omitempty"`
	// Country where the provider is based, e.g. "US", "CN", "EU". Two exceptions: "Local" for local and "N/A"
	// for pure routers.
	Country string `json:"country"`
	// DashboardURL is the URL to the provider's dashboard, if available.
	DashboardURL string `json:"dashboardURL"`

	// Scenarios is the list of all known supported and tested scenarios.
	//
	// A single provider can provide various distinct use cases, like text-to-text, multi-modal-to-text,
	// text-to-audio, audio-to-text, etc.
	Scenarios []Scenario `json:"scenarios"`
	// contains filtered or unexported fields
}

Score is a snapshot of the capabilities of the provider. These are smoke tested to confirm the accuracy.

func (*Score) SortScenarios

func (s *Score) SortScenarios()

SortScenarios sorts the scenarios in place by preference flags. Untested scenarios are sorted last. Tested scenarios are sorted by preference flags: SOTA (0), Good (1), Cheap (2), then others. Within the same priority, reasoning scenarios come before non-reasoning. Within the same priority and reasoning status, scenarios are sorted alphabetically by first model name.

func (*Score) Validate

func (s *Score) Validate() error

Validate returns an error if the Score is not correctly configured.

type TriState

type TriState int8

TriState helps describing support when a feature "kinda work", which is frequent with LLM's inherent non-determinism.

const (
	// False means the feature is not supported.
	False TriState = 0
	// True means the feature is supported.
	True TriState = 1
	// Flaky means the feature works intermittently.
	Flaky TriState = -1
)

TriState values for feature support.

func (TriState) GoString

func (t TriState) GoString() string

GoString implements fmt.GoStringer.

func (TriState) MarshalJSON

func (t TriState) MarshalJSON() ([]byte, error)

MarshalJSON implements json.Marshaler.

func (TriState) String

func (t TriState) String() string

func (*TriState) UnmarshalJSON

func (t *TriState) UnmarshalJSON(b []byte) error

UnmarshalJSON implements json.Unmarshaler.

func (TriState) Validate

func (t TriState) Validate() error

Validate returns an error if the TriState is not a known value.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL