Documentation
¶
Overview ¶
Package inference provides a high-level API for loading GGUF models and running text generation, chat, embedding, and speculative decoding with minimal boilerplate. (Stability: stable)
Loading Models ¶
There are two entry points for loading a model:
- Load resolves a model by name or HuggingFace repo ID, pulling it from the registry if not already cached, and returns a ready-to-use Model.
- LoadFile loads a model directly from a local GGUF file path.
Both accept functional Option values to configure the compute device, cache directory, sequence length, and other parameters:
m, err := inference.Load("gemma-3-1b-q4",
inference.WithDevice("cuda"),
inference.WithMaxSeqLen(4096),
)
if err != nil {
log.Fatal(err)
}
defer m.Close()
text, err := m.Generate(ctx, "Explain gradient descent briefly.",
inference.WithMaxTokens(256),
inference.WithTemperature(0.7),
)
Model Methods ¶
A loaded Model exposes several generation methods:
- Model.Generate produces text from a prompt and returns the full result.
- Model.GenerateStream delivers tokens incrementally via a callback.
- Model.GenerateBatch processes multiple prompts concurrently.
- Model.Chat formats a slice of Message values using the model's chat template and generates a Response with token usage statistics.
- Model.Embed returns an L2-normalized embedding vector for a text input by mean-pooling the model's token embedding table.
- Model.SpeculativeGenerate runs speculative decoding with a smaller draft model to accelerate generation from a larger target model.
Load Options ¶
The following Option functions configure model loading:
- WithDevice — compute device: "cpu", "cuda", "cuda:N", "rocm", "opencl"
- WithCacheDir — local directory for cached model files
- WithMaxSeqLen — override the model's default maximum sequence length
- WithRegistry — supply a custom model registry
- WithBackend — select "tensorrt" for TensorRT-optimized inference
- WithPrecision — set TensorRT compute precision ("fp16")
- WithDType — set GPU compute precision ("fp16", "fp8")
- WithKVDtype — set KV cache storage precision ("fp16")
- WithMmap — control memory-mapped model loading (default: enabled)
Generate Options ¶
The following GenerateOption functions configure sampling for generation methods:
- WithTemperature — sampling temperature (higher = more random)
- WithTopK — top-K sampling cutoff
- WithTopP — nucleus (top-P) sampling threshold
- WithMaxTokens — maximum number of tokens to generate
- WithRepetitionPenalty — penalize repeated tokens
- WithStopStrings — strings that terminate generation
- WithGrammar — constrained decoding via a grammar state machine
Model Aliases ¶
Short aliases such as "gemma-3-1b-q4" and "llama-3-8b-q4" map to full HuggingFace repository IDs. Use ResolveAlias to look up the mapping and RegisterAlias to add custom aliases.
Related Packages ¶
For lower-level control over text generation, KV caching, and sampling, see the github.com/zerfoo/zerfoo/generate package. For an OpenAI-compatible HTTP server built on top of this package, see github.com/zerfoo/zerfoo/serve. Stability: stable
Index ¶
- func AutoBuild(tensors map[string]*tensor.TensorNumeric[float32], cfg *gguf.ModelConfig, ...) (*graph.Graph[float32], *tensor.TensorNumeric[float32], error)
- func BuildArchGraph(arch string, tensors map[string]*tensor.TensorNumeric[float32], ...) (*graph.Graph[float32], *tensor.TensorNumeric[float32], error)
- func BuildEAGLEHead[T tensor.Numeric](engine compute.Engine[T], ops numeric.Arithmetic[T], config EAGLEConfig) (*core.EAGLEHead[T], error)
- func BuildJamba(jc JambaConfig, tensors map[string]*tensor.TensorNumeric[float32], ...) (*graph.Graph[float32], *tensor.TensorNumeric[float32], error)
- func BuildLLaVAModel(lc LLaVAConfig, tensors map[string]*tensor.TensorNumeric[float32], ...) (*graph.Graph[float32], *tensor.TensorNumeric[float32], error)
- func BuildMamba3(mc MambaConfig, tensors map[string]*tensor.TensorNumeric[float32], ...) (*graph.Graph[float32], *tensor.TensorNumeric[float32], error)
- func BuildMamba3MIMO(mc Mamba3Config, tensors map[string]*tensor.TensorNumeric[float32], ...) (*graph.Graph[float32], *tensor.TensorNumeric[float32], error)
- func BuildNemotronH(nc NemotronHConfig, tensors map[string]*tensor.TensorNumeric[float32], ...) (*graph.Graph[float32], *tensor.TensorNumeric[float32], error)
- func BuildQwenVLModel(qc QwenVLConfig, tensors map[string]*tensor.TensorNumeric[float32], ...) (*graph.Graph[float32], *tensor.TensorNumeric[float32], error)
- func BuildRWKV(rc RWKVConfig, tensors map[string]*tensor.TensorNumeric[float32], ...) (*graph.Graph[float32], *tensor.TensorNumeric[float32], error)
- func BuildResidualConnection[T tensor.Numeric](config ResidualConfig, engine compute.Engine[T]) any
- func BuildVoxtralModel(vc VoxtralConfig, tensors map[string]*tensor.TensorNumeric[float32], ...) (*graph.Graph[float32], *tensor.TensorNumeric[float32], error)
- func BuildWhisperEncoder(wc WhisperConfig, tensors map[string]*tensor.TensorNumeric[float32], ...) (*graph.Graph[float32], *tensor.TensorNumeric[float32], error)
- func ConvertGraphToTRT(g *graph.Graph[float32], workspaceBytes int, fp16 bool, ...) (*trtConversionResult, error)
- func FuseQuaRotWeights(tensors map[string]*tensor.TensorNumeric[float32], numLayers int) error
- func GenerateDraftTokens[T tensor.Numeric](ctx context.Context, eagleHead *core.EAGLEHead[T], engine compute.Engine[T], ...) ([]int, error)
- func HasEAGLEWeights(tensors map[string]*tensor.TensorNumeric[float32]) bool
- func IsEncoderArchitecture(arch string) bool
- func ListArchitectures() []string
- func LoadEAGLEWeights(tensors map[string]*tensor.TensorNumeric[float32], ...) (*core.EAGLEHead[float32], error)
- func LoadTRTEngine(key string) ([]byte, error)
- func RegisterAlias(shortName, repoID string)
- func RegisterArchitecture(name string, builder ArchBuilder)
- func ResolveAlias(name string) string
- func SaveTRTEngine(key string, data []byte) error
- func SerialExpertDispatch(items []ExpertWork) error
- func TRTCacheKey(modelID, precision string) (string, error)
- type ArchBuilder
- type ArchConfigRegistry
- type AsyncExpertDispatcher
- type ConfigParser
- type ConstantValueGetter
- type DTypeSetter
- type DeviceType
- type DynamicShapeConfig
- type EAGLEConfig
- type EncoderModel
- func (m *EncoderModel) Close() error
- func (m *EncoderModel) Config() *gguf.ModelConfig
- func (m *EncoderModel) Engine() compute.Engine[float32]
- func (m *EncoderModel) Forward(ctx context.Context, inputIDs []int) ([]float32, error)
- func (m *EncoderModel) Graph() *graph.Graph[float32]
- func (m *EncoderModel) OutputShape() []int
- type ExpertPlacement
- type ExpertPlacementPolicy
- type ExpertPrefetcher
- type ExpertWork
- type GGUFModel
- type GenerateOption
- func WithAdapter(name string) GenerateOption
- func WithGrammar(g *grammar.Grammar) GenerateOption
- func WithMaxTokens(n int) GenerateOption
- func WithRepetitionPenalty(p float64) GenerateOption
- func WithStopStrings(ss ...string) GenerateOption
- func WithTemperature(t float64) GenerateOption
- func WithTopK(k int) GenerateOption
- func WithTopP(p float64) GenerateOption
- type JambaConfig
- type LLaVAConfig
- type Mamba3Config
- type MambaConfig
- type Message
- type MoEDeviceMap
- type Model
- func (m *Model) Chat(ctx context.Context, messages []Message, opts ...GenerateOption) (Response, error)
- func (m *Model) ChatStream(ctx context.Context, messages []Message, handler generate.TokenStream, ...) error
- func (m *Model) Close() error
- func (m *Model) Config() ModelMetadata
- func (m *Model) Embed(text string) ([]float32, error)
- func (m *Model) EmbeddingWeights() ([]float32, int)
- func (m *Model) FormatMessages(messages []Message) string
- func (m *Model) Generate(ctx context.Context, prompt string, opts ...GenerateOption) (string, error)
- func (m *Model) GenerateBatch(ctx context.Context, prompts []string, opts ...GenerateOption) ([]string, error)
- func (m *Model) GenerateStream(ctx context.Context, prompt string, handler generate.TokenStream, ...) error
- func (m *Model) Generator() *generate.Generator[float32]
- func (m *Model) Info() *registry.ModelInfo
- func (m *Model) SetEmbeddingWeights(weights []float32, hiddenSize int)
- func (m *Model) SetMaxBatchConcurrency(n int)
- func (m *Model) SpeculativeGenerate(ctx context.Context, draft *Model, prompt string, draftLen int, ...) (string, error)
- func (m *Model) Tokenizer() tokenizer.Tokenizer
- func (m *Model) Transcribe(ctx context.Context, audioBytes []byte, opts ...GenerateOption) (string, error)
- type ModelMetadata
- type NemotronHConfig
- type Option
- func WithBackend(backend string) Option
- func WithCacheDir(dir string) Option
- func WithDType(dtype string) Option
- func WithDevice(device string) Option
- func WithKVDtype(dtype string) Option
- func WithMaxBatchConcurrency(n int) Option
- func WithMaxSeqLen(n int) Option
- func WithMmap(enabled bool) Option
- func WithPrecision(precision string) Option
- func WithQuaRot(enabled bool) Option
- func WithRegistry(r registry.ModelRegistry) Option
- func WithSessionPoolSize(n int) Option
- type PlacementOption
- type PrefetchStats
- type QwenVLConfig
- type RWKVConfig
- type ResidualConfig
- type Response
- type RopeScalingConfig
- type ShapeRange
- type TRTInferenceEngine
- type TrainingPair
- type TransferFunc
- type UnsupportedOpError
- type VoxtralConfig
- type WhisperConfig
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func AutoBuild ¶ added in v1.8.0
func AutoBuild( tensors map[string]*tensor.TensorNumeric[float32], cfg *gguf.ModelConfig, engine compute.Engine[float32], ) (*graph.Graph[float32], *tensor.TensorNumeric[float32], error)
AutoBuild reads GGUF metadata from cfg and constructs the appropriate computation graph automatically, without requiring a hand-written per-model builder. It detects architecture features from metadata and delegates to the shared buildTransformerGraph for standard decoder-only transformer architectures.
For non-transformer architectures (Mamba, Whisper, etc.) that have a registered ArchBuilder, AutoBuild falls back to that builder.
For completely unknown architectures with standard decoder-only tensor names, AutoBuild constructs a plain transformer graph.
func BuildArchGraph ¶ added in v1.5.0
func BuildArchGraph( arch string, tensors map[string]*tensor.TensorNumeric[float32], cfg *gguf.ModelConfig, engine compute.Engine[float32], ) (*graph.Graph[float32], *tensor.TensorNumeric[float32], error)
BuildArchGraph dispatches to the appropriate architecture-specific graph builder. Exported for benchmark and integration tests that construct synthetic weight maps without loading from GGUF files.
func BuildEAGLEHead ¶ added in v1.28.0
func BuildEAGLEHead[T tensor.Numeric]( engine compute.Engine[T], ops numeric.Arithmetic[T], config EAGLEConfig, ) (*core.EAGLEHead[T], error)
BuildEAGLEHead constructs an EAGLEHead layer from an engine and config. The returned head can be used with GenerateDraftTokens to produce draft tokens from the penultimate transformer layer's output.
func BuildJamba ¶ added in v1.5.0
func BuildJamba( jc JambaConfig, tensors map[string]*tensor.TensorNumeric[float32], engine compute.Engine[float32], ) (*graph.Graph[float32], *tensor.TensorNumeric[float32], error)
BuildJamba constructs a computation graph for the Jamba hybrid architecture.
Attention layers use tensor names:
blk.{i}.attn_norm.weight
blk.{i}.attn_q.weight, blk.{i}.attn_k.weight, blk.{i}.attn_v.weight, blk.{i}.attn_output.weight
blk.{i}.ffn_norm.weight
blk.{i}.ffn_gate.weight, blk.{i}.ffn_up.weight, blk.{i}.ffn_down.weight
SSM layers use tensor names:
blk.{i}.ssm_norm.weight
blk.{i}.ssm_in_proj.weight, blk.{i}.ssm_conv1d.weight, blk.{i}.ssm_x_proj.weight
blk.{i}.ssm_dt_proj.weight, blk.{i}.ssm_A_log, blk.{i}.ssm_D, blk.{i}.ssm_out_proj.weight
func BuildLLaVAModel ¶ added in v1.7.0
func BuildLLaVAModel( lc LLaVAConfig, tensors map[string]*tensor.TensorNumeric[float32], cfg *gguf.ModelConfig, engine compute.Engine[float32], ) (*graph.Graph[float32], *tensor.TensorNumeric[float32], error)
BuildLLaVAModel constructs the LLaVA computation graph from a weight map. Exported for benchmark and integration tests that construct synthetic weight maps.
func BuildMamba3 ¶ added in v1.5.0
func BuildMamba3( mc MambaConfig, tensors map[string]*tensor.TensorNumeric[float32], engine compute.Engine[float32], ) (*graph.Graph[float32], *tensor.TensorNumeric[float32], error)
BuildMamba3 constructs a computation graph for Mamba-3 from a weight map.
Expected tensor names:
token_embd.weight — [vocab_size, d_model]
output.weight — [vocab_size, d_model]
output_norm.weight — [d_model]
mamba.{i}.norm.weight — [d_model]
mamba.{i}.in_proj.weight — [2*d_inner, d_model]
mamba.{i}.conv1d.weight — [d_inner, 1, d_conv]
mamba.{i}.conv1d.bias — [d_inner] (optional)
mamba.{i}.x_proj.weight — [dt_rank + 2*d_state, d_inner]
mamba.{i}.dt_proj.weight — [d_inner, dt_rank]
mamba.{i}.dt_proj.bias — [d_inner] (optional)
mamba.{i}.A_log — [d_inner, d_state]
mamba.{i}.D — [d_inner]
mamba.{i}.out_proj.weight — [d_model, d_inner]
func BuildMamba3MIMO ¶ added in v1.8.0
func BuildMamba3MIMO( mc Mamba3Config, tensors map[string]*tensor.TensorNumeric[float32], engine compute.Engine[float32], ) (*graph.Graph[float32], *tensor.TensorNumeric[float32], error)
BuildMamba3MIMO constructs a computation graph for Mamba 3 using MIMO SSM blocks with exponential-trapezoidal discretization.
Expected tensor names:
token_embd.weight — [vocab_size, d_model]
output.weight — [vocab_size, d_model]
output_norm.weight — [d_model]
mamba3.{i}.norm.weight — [d_model]
mamba3.{i}.in_proj.weight — [2*d_inner, d_model]
mamba3.{i}.conv1d.weight — [d_inner, 1, d_conv]
mamba3.{i}.x_proj.weight — [dt_rank + 2*d_state*num_heads, d_inner]
mamba3.{i}.dt_proj.weight — [d_inner, dt_rank]
mamba3.{i}.A_log.{h} — [head_dim, d_state] per head
mamba3.{i}.D.{h} — [head_dim] per head
mamba3.{i}.head_mix.weight — [d_inner, d_inner]
mamba3.{i}.out_proj.weight — [d_model, d_inner]
func BuildNemotronH ¶ added in v1.33.0
func BuildNemotronH( nc NemotronHConfig, tensors map[string]*tensor.TensorNumeric[float32], engine compute.Engine[float32], moeEnabled bool, ) (*graph.Graph[float32], *tensor.TensorNumeric[float32], error)
BuildNemotronH constructs a computation graph for the Nemotron-H hybrid architecture. When moeEnabled is true, MoE layers are supported via tensor name probing.
Expected GGUF tensor names:
Global:
token_embd.weight, output_norm.weight, output.weight
Mamba layers (blk.{i}.):
attn_norm.weight, ssm_in.weight, ssm_conv1d.weight, ssm_dt.weight,
ssm_A.weight, ssm_D.weight, ssm_out.weight
Attention layers (blk.{i}.):
attn_norm.weight, attn_q.weight, attn_k.weight, attn_v.weight,
attn_output.weight, ffn_norm.weight, ffn_gate.weight, ffn_up.weight,
ffn_down.weight
Dense FFN layers (blk.{i}.):
attn_norm.weight, ffn_gate.weight, ffn_up.weight, ffn_down.weight
MoE layers (blk.{i}., moeEnabled only):
attn_norm.weight, ffn_gate_inp.weight, ffn_gate_exps.weight,
ffn_up_exps.weight, ffn_down_exps.weight
func BuildQwenVLModel ¶ added in v1.8.0
func BuildQwenVLModel( qc QwenVLConfig, tensors map[string]*tensor.TensorNumeric[float32], cfg *gguf.ModelConfig, engine compute.Engine[float32], ) (*graph.Graph[float32], *tensor.TensorNumeric[float32], error)
BuildQwenVLModel constructs the Qwen-VL computation graph from a weight map. Exported for benchmark and integration tests that construct synthetic weight maps.
func BuildRWKV ¶ added in v1.7.0
func BuildRWKV( rc RWKVConfig, tensors map[string]*tensor.TensorNumeric[float32], engine compute.Engine[float32], ) (*graph.Graph[float32], *tensor.TensorNumeric[float32], error)
BuildRWKV constructs a computation graph for the RWKV-6/7 architecture.
Expected tensor names (GGUF RWKV convention):
token_embd.weight — [vocab_size, hidden_size]
output.weight — [vocab_size, hidden_size]
output_norm.weight — [hidden_size]
output_norm.bias — [hidden_size]
blocks.{i}.ln0.weight — [hidden_size] (layer 0 only, pre-embedding norm)
blocks.{i}.ln0.bias — [hidden_size] (layer 0 only)
blocks.{i}.ln1.weight — [hidden_size] (time mixing norm)
blocks.{i}.ln1.bias — [hidden_size]
blocks.{i}.ln2.weight — [hidden_size] (channel mixing norm)
blocks.{i}.ln2.bias — [hidden_size]
blocks.{i}.att.time_mix_r — [1, 1, hidden_size]
blocks.{i}.att.time_mix_k — [1, 1, hidden_size]
blocks.{i}.att.time_mix_v — [1, 1, hidden_size]
blocks.{i}.att.time_mix_g — [1, 1, hidden_size]
blocks.{i}.att.time_decay — [num_heads, head_size]
blocks.{i}.att.time_faaaa — [num_heads, head_size] (initial state)
blocks.{i}.att.receptance.weight — [hidden_size, hidden_size]
blocks.{i}.att.key.weight — [hidden_size, hidden_size]
blocks.{i}.att.value.weight — [hidden_size, hidden_size]
blocks.{i}.att.gate.weight — [hidden_size, hidden_size]
blocks.{i}.att.output.weight — [hidden_size, hidden_size]
blocks.{i}.att.ln_x.weight — [hidden_size] (group norm)
blocks.{i}.att.ln_x.bias — [hidden_size]
blocks.{i}.ffn.time_mix_k — [1, 1, hidden_size]
blocks.{i}.ffn.time_mix_r — [1, 1, hidden_size]
blocks.{i}.ffn.key.weight — [ffn_size, hidden_size]
blocks.{i}.ffn.value.weight — [hidden_size, ffn_size]
blocks.{i}.ffn.receptance.weight — [hidden_size, hidden_size]
func BuildResidualConnection ¶ added in v1.9.0
BuildResidualConnection returns a residual handler appropriate for the given config. For "standard" mode (the default), it returns nil — callers should fall through to existing residual-add logic. For "attnres" and "block_attnres" modes, it returns a placeholder (nil for now); the actual implementation will be wired once layers/residual/ ships AttnRes types.
func BuildVoxtralModel ¶ added in v1.35.0
func BuildVoxtralModel( vc VoxtralConfig, tensors map[string]*tensor.TensorNumeric[float32], cfg *gguf.ModelConfig, engine compute.Engine[float32], ) (*graph.Graph[float32], *tensor.TensorNumeric[float32], error)
BuildVoxtralModel constructs the Voxtral computation graph from a weight map. Exported for benchmark and integration tests that construct synthetic weight maps.
func BuildWhisperEncoder ¶ added in v1.5.0
func BuildWhisperEncoder( wc WhisperConfig, tensors map[string]*tensor.TensorNumeric[float32], engine compute.Engine[float32], ) (*graph.Graph[float32], *tensor.TensorNumeric[float32], error)
BuildWhisperEncoder constructs a computation graph for Whisper encoder from a weight map. Exported for benchmark and integration tests that construct synthetic weight maps.
func ConvertGraphToTRT ¶
func ConvertGraphToTRT(g *graph.Graph[float32], workspaceBytes int, fp16 bool, dynamicShapes *DynamicShapeConfig) (*trtConversionResult, error)
ConvertGraphToTRT walks a graph in topological order and maps each node to a TensorRT layer. Returns serialized engine bytes or an UnsupportedOpError if the graph contains operations that cannot be converted. If dynamicShapes is non-nil, an optimization profile is created with the specified min/opt/max dimensions for each input.
func FuseQuaRotWeights ¶ added in v1.29.0
FuseQuaRotWeights applies the QuaRot (Quantization with Rotation) technique by fusing a normalized Walsh-Hadamard rotation into model weight matrices. After fusion, inference requires no additional runtime computation — the rotation is baked into the weights.
For each transformer layer, the following projections are rotated:
- Attention: Q, K, V, O projection weights
- FFN: gate, up, down projection weights
The Hadamard matrix H is orthogonal and involutory (H * H = I), so applying it twice recovers the original weights. The rotation improves quantization quality by spreading outlier magnitudes across dimensions (arXiv:2404.00456).
Weight convention: weights are stored as [outDim, inDim] (row-major). The rotation is applied along the input dimension: W_rotated = W * H^T, which is equivalent to rotating the input space. Since H is symmetric (H = H^T for the normalized Hadamard), this simplifies to W_rotated = W * H.
func GenerateDraftTokens ¶ added in v1.28.0
func GenerateDraftTokens[T tensor.Numeric]( ctx context.Context, eagleHead *core.EAGLEHead[T], engine compute.Engine[T], penultimateFeatures *tensor.TensorNumeric[T], lmHeadWeight *tensor.TensorNumeric[T], numDrafts int, ) ([]int, error)
GenerateDraftTokens uses an EAGLEHead to autoregressively generate draft token IDs from the penultimate transformer layer's hidden state.
It feeds penultimateFeatures (shape [1, 1, hidden]) through the EAGLEHead, applies the LM head weight to get logits, takes argmax for a draft token, and repeats numDrafts times. Each iteration feeds the previous EAGLEHead output back as input for the next draft.
Returns a slice of numDrafts draft token IDs.
func HasEAGLEWeights ¶ added in v1.29.0
func HasEAGLEWeights(tensors map[string]*tensor.TensorNumeric[float32]) bool
HasEAGLEWeights returns true if the tensor map contains EAGLE head weights, either under the "eagle." prefix or as unprefixed names.
func IsEncoderArchitecture ¶ added in v1.9.0
IsEncoderArchitecture reports whether the given architecture name is an encoder-only model (e.g., BERT, RoBERTa).
func ListArchitectures ¶ added in v1.5.0
func ListArchitectures() []string
ListArchitectures returns a sorted list of all registered architecture names.
func LoadEAGLEWeights ¶ added in v1.29.0
func LoadEAGLEWeights( tensors map[string]*tensor.TensorNumeric[float32], engine compute.Engine[float32], ops numeric.Arithmetic[float32], ) (*core.EAGLEHead[float32], error)
LoadEAGLEWeights loads EAGLE head weights from a GGUF tensor map and returns a fully constructed EAGLEHead. It looks for tensors under the "eagle." prefix. If the prefix is not found, it tries unprefixed names (for standalone EAGLE GGUF files).
The function validates that all four weight tensors are present and that their shapes are consistent with each other.
func LoadTRTEngine ¶
LoadTRTEngine reads a serialized TensorRT engine from the cache. Returns nil, nil on cache miss (file not found).
func RegisterAlias ¶
func RegisterAlias(shortName, repoID string)
RegisterAlias adds a custom short name -> HuggingFace repo ID mapping.
func RegisterArchitecture ¶ added in v1.5.0
func RegisterArchitecture(name string, builder ArchBuilder)
RegisterArchitecture registers an architecture builder under the given name. Names correspond to GGUF general.architecture values (e.g. "llama", "gemma"). Multiple names can map to the same builder (e.g. "gemma" and "gemma3"). Panics if name is empty or a builder is already registered for that name.
func ResolveAlias ¶
ResolveAlias returns the HuggingFace repo ID for a short alias. If the name is not an alias, it is returned unchanged.
func SaveTRTEngine ¶
SaveTRTEngine writes a serialized TensorRT engine to the cache directory.
func SerialExpertDispatch ¶ added in v1.29.0
func SerialExpertDispatch(items []ExpertWork) error
SerialExpertDispatch runs the same expert GEMM work items serially on the calling goroutine. This is the reference implementation used to verify that AsyncExpertDispatcher produces identical results.
func TRTCacheKey ¶
TRTCacheKey builds a deterministic cache key from model ID, precision, and GPU architecture. The key is a hex SHA-256 hash to avoid filesystem issues with long or special-character model IDs.
Types ¶
type ArchBuilder ¶ added in v1.5.0
type ArchBuilder func( tensors map[string]*tensor.TensorNumeric[float32], cfg *gguf.ModelConfig, engine compute.Engine[float32], ) (*graph.Graph[float32], *tensor.TensorNumeric[float32], error)
ArchBuilder builds a computation graph for a model architecture from pre-loaded GGUF tensors. It returns the graph and the embedding table tensor (needed by the generator for token lookup).
func GetArchitecture ¶ added in v1.5.0
func GetArchitecture(name string) (ArchBuilder, bool)
GetArchitecture returns the builder registered for the given architecture name. Returns nil, false if no builder is registered.
type ArchConfigRegistry ¶
type ArchConfigRegistry struct {
// contains filtered or unexported fields
}
ArchConfigRegistry maps model_type strings to config parsers.
func DefaultArchConfigRegistry ¶
func DefaultArchConfigRegistry() *ArchConfigRegistry
DefaultArchConfigRegistry returns a registry with all built-in parsers registered.
func (*ArchConfigRegistry) Parse ¶
func (r *ArchConfigRegistry) Parse(raw map[string]interface{}) (*ModelMetadata, error)
Parse dispatches to the registered parser for the model_type in raw, or falls back to generic field extraction for unknown types.
func (*ArchConfigRegistry) Register ¶
func (r *ArchConfigRegistry) Register(modelType string, parser ConfigParser)
Register adds a parser for the given model type.
type AsyncExpertDispatcher ¶ added in v1.29.0
type AsyncExpertDispatcher struct {
// contains filtered or unexported fields
}
AsyncExpertDispatcher dispatches CPU-resident expert GEMMs to a goroutine pool so that they execute concurrently with GPU work on shared experts.
Usage:
- Create with NewAsyncExpertDispatcher.
- Call [Dispatch] with CPU expert work items — returns immediately.
- Run GPU shared-expert work concurrently in the calling goroutine.
- Call [Wait] to block until all CPU experts finish and collect errors.
- Call [Shutdown] when the dispatcher is no longer needed.
func NewAsyncExpertDispatcher ¶ added in v1.29.0
func NewAsyncExpertDispatcher(workers int) *AsyncExpertDispatcher
NewAsyncExpertDispatcher creates a dispatcher with the given number of worker goroutines. If workers <= 0, it defaults to 4.
func (*AsyncExpertDispatcher) Dispatch ¶ added in v1.29.0
func (d *AsyncExpertDispatcher) Dispatch(ctx context.Context, items []ExpertWork)
Dispatch submits CPU expert work items for asynchronous execution. It returns immediately. Call [Wait] to block until all items complete. The context is checked before launching each work item; if already cancelled, remaining items are skipped.
func (*AsyncExpertDispatcher) Shutdown ¶ added in v1.29.0
func (d *AsyncExpertDispatcher) Shutdown()
Shutdown releases resources. After Shutdown, the dispatcher must not be reused.
func (*AsyncExpertDispatcher) Wait ¶ added in v1.29.0
func (d *AsyncExpertDispatcher) Wait() error
Wait blocks until all dispatched work items complete. It returns the first error encountered, if any.
type ConfigParser ¶
type ConfigParser func(raw map[string]interface{}) (*ModelMetadata, error)
ConfigParser parses a raw JSON map (from config.json) into ModelMetadata.
type ConstantValueGetter ¶
type ConstantValueGetter interface {
GetValue() *tensor.TensorNumeric[float32]
}
ConstantValueGetter is an interface for nodes that hold constant tensor data.
type DTypeSetter ¶
DTypeSetter is implemented by engines that support setting compute precision.
type DeviceType ¶ added in v1.28.0
type DeviceType int
DeviceType represents a compute device for expert placement.
const ( // CPU indicates the expert should run on the CPU. CPU DeviceType = iota // GPU indicates the expert should run on the GPU. GPU )
func (DeviceType) String ¶ added in v1.28.0
func (d DeviceType) String() string
type DynamicShapeConfig ¶
type DynamicShapeConfig struct {
// InputShapes maps input index (0-based) to its shape range.
InputShapes []ShapeRange
}
DynamicShapeConfig specifies per-input shape ranges for TensorRT optimization profiles. When non-nil, the converter creates an optimization profile that allows variable-size inputs within the specified ranges.
type EAGLEConfig ¶ added in v1.28.0
type EAGLEConfig struct {
NumDraftTokens int // number of draft tokens to generate per step
HiddenDim int // hidden dimension of the model
}
EAGLEConfig holds configuration for EAGLE-style self-speculative decoding.
type EncoderModel ¶ added in v1.9.0
type EncoderModel struct {
// contains filtered or unexported fields
}
EncoderModel represents a loaded encoder-only model (BERT, RoBERTa, etc.). Unlike the decoder Model type, EncoderModel has no KV cache, no generator, and no autoregressive decoding loop. It runs a single forward pass over the full input sequence and returns classification logits.
func LoadEncoderFile ¶ added in v1.9.0
func LoadEncoderFile(path string, opts ...Option) (*EncoderModel, error)
LoadEncoderFile loads an encoder-only model from a GGUF file. It verifies the architecture is encoder-only and returns an EncoderModel instead of a Generator-based Model. Returns an error if the architecture is not an encoder type.
func (*EncoderModel) Close ¶ added in v1.9.0
func (m *EncoderModel) Close() error
Close releases resources held by the encoder model.
func (*EncoderModel) Config ¶ added in v1.9.0
func (m *EncoderModel) Config() *gguf.ModelConfig
Config returns the underlying model configuration.
func (*EncoderModel) Engine ¶ added in v1.9.0
func (m *EncoderModel) Engine() compute.Engine[float32]
Engine returns the compute engine.
func (*EncoderModel) Forward ¶ added in v1.9.0
Forward runs the encoder on input token IDs and returns classification logits. The input is a slice of integer token IDs. The returned slice contains logits of shape [1, numClasses] flattened to []float32.
func (*EncoderModel) Graph ¶ added in v1.9.0
func (m *EncoderModel) Graph() *graph.Graph[float32]
Graph returns the computation graph.
func (*EncoderModel) OutputShape ¶ added in v1.9.0
func (m *EncoderModel) OutputShape() []int
OutputShape returns the expected output shape [batch, numClasses].
type ExpertPlacement ¶ added in v1.28.0
type ExpertPlacement struct {
ExpertID int
Device DeviceType
Reason string
}
ExpertPlacement describes the device assignment for a single expert.
type ExpertPlacementPolicy ¶ added in v1.28.0
type ExpertPlacementPolicy struct {
// contains filtered or unexported fields
}
ExpertPlacementPolicy decides which MoE experts run on GPU vs CPU based on routing frequency statistics. Shared experts (frequency 1.0) are always placed on GPU. Routed experts with frequency >= threshold go to GPU; the rest go to CPU.
func NewExpertPlacementPolicy ¶ added in v1.28.0
func NewExpertPlacementPolicy(numExperts int, opts ...PlacementOption) *ExpertPlacementPolicy
NewExpertPlacementPolicy creates a policy for numExperts experts. The default threshold is 0.5; use WithThreshold to override.
func (*ExpertPlacementPolicy) Assign ¶ added in v1.28.0
func (p *ExpertPlacementPolicy) Assign(routingStats map[int]float64) []ExpertPlacement
Assign computes device placements for all experts based on routingStats, which maps expert ID to activation frequency in [0.0, 1.0]. Experts not present in routingStats are assigned to CPU.
func (*ExpertPlacementPolicy) DeviceMap ¶ added in v1.28.0
func (p *ExpertPlacementPolicy) DeviceMap() map[int]DeviceType
DeviceMap returns the current device assignments as a map from expert ID to DeviceType. Returns nil if [Assign] has not been called.
type ExpertPrefetcher ¶ added in v1.29.0
type ExpertPrefetcher struct {
Stats PrefetchStats
// contains filtered or unexported fields
}
ExpertPrefetcher predicts which experts will be needed in the next layer based on routing history and initiates async CPU-to-GPU weight transfers.
The prediction heuristic exploits expert stickiness: tokens tend to route to the same experts across consecutive layers. The prefetcher records routing decisions per layer and predicts that the next layer will use the same set of experts as the current layer.
func NewExpertPrefetcher ¶ added in v1.29.0
func NewExpertPrefetcher(deviceMap *MoEDeviceMap, transfer TransferFunc) *ExpertPrefetcher
NewExpertPrefetcher creates a prefetcher that uses the given device map to determine which experts are CPU-resident and the transfer function to initiate async weight uploads.
func (*ExpertPrefetcher) CheckPrediction ¶ added in v1.29.0
func (p *ExpertPrefetcher) CheckPrediction(layer int, actualIDs []int)
CheckPrediction evaluates how well the prefetch prediction matched the actual routing decision for the given layer. Call this when the real routing for a layer becomes known.
For each expert in actualIDs: if it was predicted (present in the previous layer's routing set), count a hit; otherwise count a miss.
func (*ExpertPrefetcher) ClearHistory ¶ added in v1.29.0
func (p *ExpertPrefetcher) ClearHistory()
ClearHistory removes all stored routing history. Useful between sequences.
func (*ExpertPrefetcher) RecordAndPrefetch ¶ added in v1.29.0
RecordAndPrefetch records the routing decision for the given layer and initiates prefetch transfers for predicted next-layer experts.
The prediction is simple: experts routed in layer L are predicted to also be routed in layer L+1 (sticky routing). Only CPU-resident experts trigger a transfer; GPU-resident experts are already available.
Returns the list of expert IDs for which prefetch was initiated.
type ExpertWork ¶ added in v1.29.0
type ExpertWork struct {
ExpertID int
// Weight is the expert FFN weight matrix, row-major [outDim, inDim].
Weight []float32
// Input is the token hidden state, row-major [numTokens, inDim].
Input []float32
// Output is pre-allocated by the caller, row-major [numTokens, outDim].
Output []float32
// M, N, K are the GEMM dimensions: M=numTokens, N=outDim, K=inDim.
M, N, K int
}
ExpertWork describes a single CPU expert GEMM to be dispatched asynchronously. The caller provides the expert weight matrix and input; the dispatcher writes results into Output.
type GGUFModel ¶
type GGUFModel struct {
Config *gguf.ModelConfig
Tensors map[string]*tensor.TensorNumeric[float32]
File *gguf.File
}
GGUFModel holds a loaded GGUF model's configuration and tensors. This is an intermediate representation; full inference requires an architecture-specific graph builder to convert these into a computation graph.
func LoadGGUF ¶
LoadGGUF loads a GGUF model file and returns its configuration and tensors. Tensor names are mapped from GGUF convention (blk.N.attn_q.weight) to Zerfoo canonical names (model.layers.N.self_attn.q_proj.weight).
Supports both single-file and split (multi-shard) GGUF files. Split files are detected automatically from the filename pattern (e.g., Model-00001-of-00003.gguf).
func LoadGGUFMmap ¶ added in v1.26.0
LoadGGUFMmap loads a GGUF model file using memory-mapped I/O. Instead of reading tensor data into heap-allocated Go slices, the entire file is mmap'd and tensors reference slices of the mapped region via MmapStorage. This gives near-instant startup and keeps tensor data out of the Go heap.
Supports both single-file and split (multi-shard) GGUF files. For split files, each shard is independently mmap'd, allowing models larger than physical RAM to be loaded — the OS pages data in from disk on demand.
The returned io.Closer must be kept alive and closed when the model is no longer needed -- it releases the memory mapping(s).
func (*GGUFModel) ToModelMetadata ¶
func (m *GGUFModel) ToModelMetadata() *ModelMetadata
ToModelMetadata converts a GGUF model config to inference.ModelMetadata.
type GenerateOption ¶
type GenerateOption func(*generate.SamplingConfig)
GenerateOption configures a generation call.
func WithAdapter ¶ added in v1.29.0
func WithAdapter(name string) GenerateOption
WithAdapter sets the LoRA adapter name for per-request adapter selection.
func WithGrammar ¶
func WithGrammar(g *grammar.Grammar) GenerateOption
WithGrammar sets a grammar state machine for constrained decoding. When set, a token mask is applied at each sampling step to restrict output to tokens that are valid according to the grammar.
func WithMaxTokens ¶
func WithMaxTokens(n int) GenerateOption
WithMaxTokens sets the maximum number of tokens to generate.
func WithRepetitionPenalty ¶
func WithRepetitionPenalty(p float64) GenerateOption
WithRepetitionPenalty sets the repetition penalty factor.
func WithStopStrings ¶
func WithStopStrings(ss ...string) GenerateOption
WithStopStrings sets strings that stop generation.
func WithTemperature ¶
func WithTemperature(t float64) GenerateOption
WithTemperature sets the sampling temperature.
func WithTopP ¶
func WithTopP(p float64) GenerateOption
WithTopP sets the top-P (nucleus) sampling parameter.
type JambaConfig ¶ added in v1.5.0
type JambaConfig struct {
NumLayers int
HiddenSize int
IntermediateSize int
AttnHeads int
KVHeads int
SSMHeads int // number of SSM heads (maps to DState)
AttentionLayerOffset int // attention layers at indices that are multiples of this value
RMSEps float32
VocabSize int
MaxSeqLen int
RopeTheta float64
DConv int // SSM convolution width (default 4)
}
JambaConfig holds Jamba-specific hybrid model configuration.
func JambaConfigFromGGUF ¶ added in v1.5.0
func JambaConfigFromGGUF(cfg *gguf.ModelConfig) JambaConfig
JambaConfigFromGGUF extracts Jamba configuration from GGUF ModelConfig.
type LLaVAConfig ¶ added in v1.7.0
type LLaVAConfig struct {
// Vision encoder config.
ImageSize int
PatchSize int
VisionHiddenDim int
VisionNumHeads int
VisionNumLayers int
NumChannels int
// Multi-modal projector config.
ProjectorType string // "linear" or "mlp" (2-layer MLP is default for LLaVA 1.5+)
}
LLaVAConfig holds LLaVA-specific model configuration.
func LLaVAConfigFromGGUF ¶ added in v1.7.0
func LLaVAConfigFromGGUF(cfg *gguf.ModelConfig) LLaVAConfig
LLaVAConfigFromGGUF extracts LLaVA configuration from GGUF ModelConfig.
type Mamba3Config ¶ added in v1.8.0
type Mamba3Config struct {
NumLayers int
DModel int
DState int
DConv int
DInner int
NumHeads int
VocabSize int
EOSTokenID int
RMSNormEps float32
}
Mamba3Config holds Mamba 3-specific model configuration. Mamba 3 extends Mamba with multi-head MIMO SSM, exponential-trapezoidal discretization, and cross-head mixing.
func Mamba3ConfigFromGGUF ¶ added in v1.8.0
func Mamba3ConfigFromGGUF(cfg *gguf.ModelConfig) Mamba3Config
Mamba3ConfigFromGGUF extracts Mamba 3 configuration from GGUF ModelConfig. Fields are mapped as: HiddenSize -> DModel, NumKVHeads -> DState, IntermediateSize -> DInner, NumHeads -> NumHeads. DConv defaults to 4.
func Mamba3ConfigFromMetadata ¶ added in v1.8.0
func Mamba3ConfigFromMetadata(meta map[string]interface{}) Mamba3Config
Mamba3ConfigFromMetadata extracts Mamba 3 configuration from a raw metadata map.
type MambaConfig ¶ added in v1.5.0
type MambaConfig struct {
NumLayers int
DModel int
DState int
DConv int
DInner int
VocabSize int
EOSTokenID int
RMSNormEps float32
}
MambaConfig holds Mamba-specific model configuration.
func MambaConfigFromGGUF ¶ added in v1.5.0
func MambaConfigFromGGUF(cfg *gguf.ModelConfig) MambaConfig
MambaConfigFromGGUF extracts Mamba configuration from GGUF ModelConfig. Fields are mapped as: HiddenSize -> DModel, NumKVHeads -> DState, IntermediateSize -> DInner. DConv defaults to 4 if not specified.
func MambaConfigFromMetadata ¶ added in v1.5.0
func MambaConfigFromMetadata(meta map[string]interface{}) MambaConfig
MambaConfigFromMetadata extracts Mamba configuration from a raw metadata map.
type Message ¶
type Message struct {
Role string // "system", "user", or "assistant"
Content string
Images [][]byte // optional raw image data for vision models
}
Message represents a chat message.
type MoEDeviceMap ¶ added in v1.29.0
type MoEDeviceMap struct {
// Experts maps expert ID to its assigned device.
Experts map[int]DeviceType
SharedExperts []int
// RoutedExperts lists expert IDs that are conditionally routed.
RoutedExperts []int
}
MoEDeviceMap holds the device assignment for each expert in a model with Mixture of Experts layers. It is built during GGUF loading by SplitMoEWeights and consumed by the graph builder to decide whether expert FFN weights should be uploaded to GPU or kept in CPU memory.
func SplitMoEWeights ¶ added in v1.29.0
func SplitMoEWeights( tensors map[string]*tensor.TensorNumeric[float32], cfg *gguf.ModelConfig, ) (*MoEDeviceMap, map[string]*tensor.TensorNumeric[float32], map[string]*tensor.TensorNumeric[float32], error)
SplitMoEWeights partitions expert weights between GPU and CPU based on the model configuration. Shared experts (always active) are assigned to GPU. Routed experts are assigned to CPU by default.
The function scans the GGUF tensor map for expert weight patterns (stacked tensors like "blk.N.ffn_gate_exps.weight") and builds a device map that the graph builder can use for placement decisions.
Parameters:
- tensors: the full GGUF tensor map
- cfg: model configuration with NumExperts and NumSharedExperts
Returns the device map and two tensor subsets: gpuTensors contains tensor names that should be uploaded to GPU, cpuTensors contains names that should remain in CPU memory. Non-expert tensors are not included in either subset.
func (*MoEDeviceMap) CPUExperts ¶ added in v1.29.0
func (m *MoEDeviceMap) CPUExperts() []int
CPUExperts returns the expert IDs assigned to CPU.
func (*MoEDeviceMap) DeviceForExpert ¶ added in v1.29.0
func (m *MoEDeviceMap) DeviceForExpert(expertID int) DeviceType
DeviceForExpert returns the device assignment for the given expert ID. If the expert is not in the map, it returns CPU as a safe default.
func (*MoEDeviceMap) GPUExperts ¶ added in v1.29.0
func (m *MoEDeviceMap) GPUExperts() []int
GPUExperts returns the expert IDs assigned to GPU.
type Model ¶
type Model struct {
// contains filtered or unexported fields
}
Model is a loaded model ready for generation.
func NewTestModel ¶
func NewTestModel( gen *generate.Generator[float32], tok tokenizer.Tokenizer, eng compute.Engine[float32], meta ModelMetadata, info *registry.ModelInfo, ) *Model
NewTestModel constructs a Model from pre-built components. Intended for use in external test packages that need a Model without going through the full Load pipeline.
func (*Model) Chat ¶
func (m *Model) Chat(ctx context.Context, messages []Message, opts ...GenerateOption) (Response, error)
Chat formats messages using the model's chat template and generates a response. Sessions are pooled to preserve CUDA graph replay.
func (*Model) ChatStream ¶ added in v1.11.0
func (m *Model) ChatStream(ctx context.Context, messages []Message, handler generate.TokenStream, opts ...GenerateOption) error
ChatStream formats messages using the model's chat template and streams the response token-by-token via the provided handler. This is the streaming counterpart of Chat and ensures the same prompt formatting is applied.
func (*Model) Close ¶
Close releases resources held by the model. If the model was loaded on a GPU, this frees the CUDA engine's handles, pool, and stream. If loaded with mmap, this releases the memory mapping.
func (*Model) Embed ¶
Embed returns an L2-normalized embedding vector for the given text by looking up token embeddings from the model's embedding table and mean-pooling them.
func (*Model) EmbeddingWeights ¶
EmbeddingWeights returns the flattened token embedding table and the hidden dimension. Returns nil, 0 if embeddings are not available.
func (*Model) FormatMessages ¶ added in v1.11.0
FormatMessages converts messages to the model's chat template format. This is useful when callers need the formatted prompt without running inference, e.g. for streaming paths that call GenerateStream separately.
func (*Model) Generate ¶
func (m *Model) Generate(ctx context.Context, prompt string, opts ...GenerateOption) (string, error)
Generate produces text from a prompt. Sessions are pooled to reuse GPU memory addresses, enabling CUDA graph replay across calls. Concurrent Generate calls get separate sessions from the pool.
func (*Model) GenerateBatch ¶
func (m *Model) GenerateBatch(ctx context.Context, prompts []string, opts ...GenerateOption) ([]string, error)
GenerateBatch processes multiple prompts concurrently and returns the generated text for each prompt. Results are returned in the same order as the input prompts. If a prompt fails, its corresponding error is non-nil.
Concurrency is capped at maxBatchConcurrency (default 8) to prevent resource exhaustion on GPU-backed models.
[Deviation: Architectural] Used parallel goroutines instead of shared PagedKV decode — full multi-seq requires deeper Generator refactor.
func (*Model) GenerateStream ¶
func (m *Model) GenerateStream(ctx context.Context, prompt string, handler generate.TokenStream, opts ...GenerateOption) error
GenerateStream delivers tokens one at a time via a callback. Sessions are pooled to preserve GPU memory addresses for CUDA graph replay.
func (*Model) SetEmbeddingWeights ¶
SetEmbeddingWeights sets the token embedding table for Embed(). weights is a flattened [vocabSize, hiddenSize] matrix.
func (*Model) SetMaxBatchConcurrency ¶ added in v1.11.0
SetMaxBatchConcurrency sets the maximum number of concurrent goroutines that GenerateBatch will use. Values <= 0 are ignored.
func (*Model) SpeculativeGenerate ¶
func (m *Model) SpeculativeGenerate( ctx context.Context, draft *Model, prompt string, draftLen int, opts ...GenerateOption, ) (string, error)
SpeculativeGenerate runs speculative decoding using this model as the target and the draft model for token proposal. draftLen controls how many tokens are proposed per verification step.
func (*Model) Transcribe ¶ added in v1.36.0
func (m *Model) Transcribe(ctx context.Context, audioBytes []byte, opts ...GenerateOption) (string, error)
Transcribe converts audio bytes (WAV format) to text using a speech-to-text model such as Voxtral. It extracts a mel spectrogram from the audio, runs a single forward pass through the model graph, and greedily decodes the output logits at each temporal position until an EOS token is encountered.
This is a parallel (non-autoregressive) decode: the mel spectrogram is processed once through the full encoder+adapter+decoder stack, and each temporal position independently predicts the most likely token. For a causal decoder model this is an approximation — full autoregressive transcription requires a two-phase graph (encode once, decode autoregressively) which is not yet wired.
type ModelMetadata ¶
type ModelMetadata struct {
Architecture string `json:"architecture"`
VocabSize int `json:"vocab_size"`
HiddenSize int `json:"hidden_size"`
NumLayers int `json:"num_layers"`
MaxPositionEmbeddings int `json:"max_position_embeddings"`
EOSTokenID int `json:"eos_token_id"`
BOSTokenID int `json:"bos_token_id"`
ChatTemplate string `json:"chat_template"`
// Extended fields for multi-architecture support.
IntermediateSize int `json:"intermediate_size"`
NumQueryHeads int `json:"num_attention_heads"`
NumKeyValueHeads int `json:"num_key_value_heads"`
RopeTheta float64 `json:"rope_theta"`
RopeScaling *RopeScalingConfig `json:"rope_scaling,omitempty"`
TieWordEmbeddings bool `json:"tie_word_embeddings"`
SlidingWindow int `json:"sliding_window"`
AttentionBias bool `json:"attention_bias"`
PartialRotaryFactor float64 `json:"partial_rotary_factor"`
LayerNormEps float64 `json:"layer_norm_eps,omitempty"`
// Granite-specific fields.
EmbeddingMultiplier float64 `json:"embedding_multiplier,omitempty"`
ResidualMultiplier float64 `json:"residual_multiplier,omitempty"`
LogitScale float64 `json:"logit_scale,omitempty"`
// Audio model fields.
AudioNumMels int `json:"audio_num_mels,omitempty"` // Number of mel bins; 0 means use architecture default
// DeepSeek MLA and MoE fields.
KVLoRADim int `json:"kv_lora_rank"`
QLoRADim int `json:"q_lora_rank"`
QKRopeHeadDim int `json:"qk_rope_head_dim"`
NumExperts int `json:"num_experts"`
NumExpertsPerToken int `json:"num_experts_per_tok"`
}
ModelMetadata holds model configuration loaded from config.json.
type NemotronHConfig ¶ added in v1.33.0
type NemotronHConfig struct {
NumLayers int
HiddenSize int
IntermediateSize int
AttnHeads int
KVHeads int
SSMStateSize int // SSM state dimension per head
SSMConvKernel int // SSM convolution kernel width (default 4)
SSMNumHeads int // SSM number of heads
RMSEps float32
VocabSize int
MaxSeqLen int
RopeTheta float64
// MoE fields (only used by nemotron_h_moe).
NumExperts int
NumExpertsPerToken int
}
NemotronHConfig holds Nemotron-H-specific hybrid model configuration.
func NemotronHConfigFromGGUF ¶ added in v1.33.0
func NemotronHConfigFromGGUF(cfg *gguf.ModelConfig) NemotronHConfig
NemotronHConfigFromGGUF extracts Nemotron-H configuration from GGUF ModelConfig.
type Option ¶
type Option func(*loadOptions)
Option configures model loading.
func WithBackend ¶
WithBackend selects the inference backend. Supported values: "" or "default" for the standard Engine path, "tensorrt" for TensorRT-optimized inference. TensorRT requires the cuda build tag and a CUDA device.
func WithCacheDir ¶
WithCacheDir sets the model cache directory.
func WithDType ¶
WithDType sets the compute precision for the GPU engine. Supported values: "" or "fp32" for full precision, "fp16" for FP16 compute. FP16 mode converts activations F32->FP16 before GPU kernels and back after. Has no effect on CPU engines.
func WithDevice ¶
WithDevice sets the compute device ("cpu" or "cuda").
func WithKVDtype ¶
WithKVDtype sets the KV cache storage dtype. Supported: "fp32" (default), "fp16". FP16 halves KV cache bandwidth by storing keys/values in half precision.
func WithMaxBatchConcurrency ¶ added in v1.11.0
WithMaxBatchConcurrency sets the maximum number of concurrent goroutines that GenerateBatch will use. Values <= 0 are ignored (the default of 8 is used).
func WithMaxSeqLen ¶
WithMaxSeqLen overrides the model's default max sequence length.
func WithMmap ¶
WithMmap controls memory-mapped model loading. When true (the default), the GGUF file is mapped into memory using mmap instead of reading into heap-allocated slices. This gives near-instant startup, keeps tensor data off the Go heap, and allows loading models larger than physical RAM — the OS pages data in from disk on demand.
func WithPrecision ¶
WithPrecision sets the compute precision for the TensorRT backend. Supported values: "" or "fp32" for full precision, "fp16" for half precision. Has no effect when the backend is not "tensorrt".
func WithQuaRot ¶ added in v1.29.0
WithQuaRot enables QuaRot (Quantization with Rotation) weight fusion. When enabled, a normalized Walsh-Hadamard rotation is fused into Q/K/V/O and FFN gate/up/down weight matrices after loading. This improves quantization quality at zero runtime cost (arXiv:2404.00456).
func WithRegistry ¶
func WithRegistry(r registry.ModelRegistry) Option
WithRegistry provides a custom model registry.
func WithSessionPoolSize ¶ added in v1.12.0
WithSessionPoolSize sets the session pool capacity. The pool buffers inference sessions for reuse so that CUDA graph-captured GPU pointers remain valid across calls. Minimum value is 1; values below 1 are clamped to 1.
type PlacementOption ¶ added in v1.28.0
type PlacementOption func(*ExpertPlacementPolicy)
PlacementOption configures an ExpertPlacementPolicy.
func WithThreshold ¶ added in v1.28.0
func WithThreshold(t float64) PlacementOption
WithThreshold sets the routing frequency threshold above which an expert is placed on GPU. The default threshold is 0.5.
type PrefetchStats ¶ added in v1.29.0
PrefetchStats tracks prediction hit/miss rates for monitoring.
func (*PrefetchStats) HitRate ¶ added in v1.29.0
func (s *PrefetchStats) HitRate() float64
HitRate returns the fraction of predictions that were correct. Returns 0 if no predictions have been evaluated.
func (*PrefetchStats) Reset ¶ added in v1.29.0
func (s *PrefetchStats) Reset()
Reset zeroes the counters.
func (*PrefetchStats) Total ¶ added in v1.29.0
func (s *PrefetchStats) Total() int64
Total returns the total number of predictions evaluated.
type QwenVLConfig ¶ added in v1.8.0
type QwenVLConfig struct {
// Vision encoder config.
ImageSize int
PatchSize int
VisionHiddenDim int
VisionNumHeads int
VisionNumLayers int
NumChannels int
// Multi-modal projector config.
ProjectorType string // "linear" or "mlp"
}
QwenVLConfig holds Qwen-VL-specific model configuration.
func QwenVLConfigFromGGUF ¶ added in v1.8.0
func QwenVLConfigFromGGUF(cfg *gguf.ModelConfig) QwenVLConfig
QwenVLConfigFromGGUF extracts Qwen-VL configuration from GGUF ModelConfig.
type RWKVConfig ¶ added in v1.7.0
type RWKVConfig struct {
NumLayers int
HiddenSize int
VocabSize int
HeadSize int // WKV head size (default 64)
NumHeads int // HiddenSize / HeadSize
LayerNormEps float32
}
RWKVConfig holds RWKV-specific model configuration.
func RWKVConfigFromGGUF ¶ added in v1.7.0
func RWKVConfigFromGGUF(cfg *gguf.ModelConfig) RWKVConfig
RWKVConfigFromGGUF extracts RWKV configuration from GGUF ModelConfig.
type ResidualConfig ¶ added in v1.9.0
type ResidualConfig struct {
Mode string // "standard" (default), "attnres", or "block_attnres"
NumBlocks int // block count for "block_attnres" mode (default 8)
}
ResidualConfig controls the residual connection strategy used by architecture graph builders. The default mode ("standard" or "") preserves existing behaviour. "attnres" and "block_attnres" enable attention-weighted residual connections from the layers/residual package (arXiv:2603.15031).
GGUF metadata convention ¶
Models opt into attention residuals via two GGUF general-metadata keys:
general.residual_mode (string): one of "standard" (default), "attnres", or "block_attnres". When absent or empty, standard additive residuals are used and no extra memory is allocated.
general.attnres_blocks (uint32): number of blocks for the block_attnres variant. Ignored when residual_mode is not "block_attnres". Defaults to 8 when unset, which recovers most of the benefit of full AttnRes.
These keys follow the GGUF general metadata namespace convention (see https://github.com/ggerganov/ggml/blob/master/docs/gguf.md). ResidualConfigFromGGUF parses these values into a ResidualConfig.
func DefaultResidualConfig ¶ added in v1.9.0
func DefaultResidualConfig() ResidualConfig
DefaultResidualConfig returns a ResidualConfig with standard (no-op) residuals.
func ResidualConfigFromGGUF ¶ added in v1.9.0
func ResidualConfigFromGGUF(mode string, numBlocks int) ResidualConfig
ResidualConfigFromGGUF builds a ResidualConfig from GGUF model metadata. Missing keys produce the backward-compatible "standard" default.
type RopeScalingConfig ¶
type RopeScalingConfig struct {
Type string `json:"type"`
Factor float64 `json:"factor"`
OriginalMaxPositionEmbeddings int `json:"original_max_position_embeddings"`
}
RopeScalingConfig holds configuration for RoPE scaling methods (e.g., YaRN).
type ShapeRange ¶
ShapeRange defines min/opt/max dimensions for a single input tensor. Used with DynamicShapeConfig to support variable-size inputs.
type TRTInferenceEngine ¶
type TRTInferenceEngine struct {
// contains filtered or unexported fields
}
TRTInferenceEngine holds a TensorRT engine and execution context for inference. It wraps the serialized engine, providing a Forward method that mirrors the graph forward pass but runs through TensorRT.
func (*TRTInferenceEngine) Close ¶
func (e *TRTInferenceEngine) Close() error
Close releases all TensorRT resources.
func (*TRTInferenceEngine) Forward ¶
func (e *TRTInferenceEngine) Forward(inputs []*tensor.TensorNumeric[float32], outputSize int) (*tensor.TensorNumeric[float32], error)
Forward runs inference through TensorRT with the given input tensors. Input tensors must already be on GPU.
type TrainingPair ¶ added in v1.31.0
type TrainingPair struct {
Input *tensor.TensorNumeric[float32] // [1, 1, hidden]
Target *tensor.TensorNumeric[float32] // [1, 1, hidden]
}
TrainingPair holds a consecutive (input[t], target[t+1]) pair of hidden states from the penultimate transformer layer. These pairs are used to train the EAGLE head to predict the next hidden state.
func CollectPenultimateFeatures ¶ added in v1.31.0
func CollectPenultimateFeatures( modelPath string, corpusTokens []int, maxSamples int, ) ([]TrainingPair, error)
CollectPenultimateFeatures loads a GGUF model, tokenizes corpus text, runs a forward pass on token chunks, and extracts consecutive penultimate-layer hidden state pairs for EAGLE head training.
The function returns (input[t], target[t+1]) pairs where input[t] is the penultimate hidden state at position t and target[t+1] is the penultimate hidden state at position t+1. maxSamples limits the number of pairs returned.
This requires graph-level instrumentation to capture intermediate node outputs which is not yet implemented. Use GenerateSyntheticPairs for training loop validation and GGUF export testing.
func GenerateSyntheticPairs ¶ added in v1.31.0
func GenerateSyntheticPairs(hiddenDim, count int) ([]TrainingPair, error)
GenerateSyntheticPairs creates random training pairs for EAGLE head training loop validation. Each pair contains random [1, 1, hiddenDim] tensors.
type TransferFunc ¶ added in v1.29.0
TransferFunc is called to begin an async CPU-to-GPU transfer of expert weights. The implementation should copy the expert's weight data into a GPU staging buffer. It receives the expert ID and returns when the transfer has been initiated (not necessarily completed).
type UnsupportedOpError ¶
type UnsupportedOpError struct {
Ops []string
}
UnsupportedOpError lists the operations that cannot be converted to TensorRT.
func (*UnsupportedOpError) Error ¶
func (e *UnsupportedOpError) Error() string
type VoxtralConfig ¶ added in v1.35.0
type VoxtralConfig struct {
// Audio encoder config (Whisper-large-v3 style).
AudioHiddenDim int
AudioNumLayers int
AudioNumHeads int
AudioNumMels int
AudioIntermediateSize int
AudioKernelSize int
// MLP adapter config.
StackFactor int // number of consecutive encoder frames to concatenate (e.g. 4)
}
VoxtralConfig holds Voxtral-specific model configuration.
func VoxtralConfigFromGGUF ¶ added in v1.35.0
func VoxtralConfigFromGGUF(cfg *gguf.ModelConfig) VoxtralConfig
VoxtralConfigFromGGUF extracts Voxtral configuration from GGUF ModelConfig.
type WhisperConfig ¶ added in v1.5.0
WhisperConfig holds Whisper-specific model configuration.
func WhisperConfigFromGGUF ¶ added in v1.5.0
func WhisperConfigFromGGUF(cfg *gguf.ModelConfig) WhisperConfig
WhisperConfigFromGGUF extracts Whisper configuration from GGUF ModelConfig. Fields are mapped as: HiddenSize -> HiddenDim, NumHeads -> NumHeads, NumLayers -> NumLayers. NumMels defaults to 80, KernelSize defaults to 3.
Source Files
¶
- arch_bert.go
- arch_commandr.go
- arch_common.go
- arch_config.go
- arch_dbrx.go
- arch_deepseek.go
- arch_exaone.go
- arch_falcon.go
- arch_gemma.go
- arch_gemma3n.go
- arch_glm.go
- arch_gpt2.go
- arch_granite.go
- arch_internlm2.go
- arch_jamba.go
- arch_kimi.go
- arch_lfm2.go
- arch_llama.go
- arch_llama4.go
- arch_llava.go
- arch_mamba.go
- arch_mamba3.go
- arch_minimax_m2.go
- arch_mistral.go
- arch_mixtral.go
- arch_nemotron_h.go
- arch_olmo2.go
- arch_phi.go
- arch_qwen.go
- arch_qwenvl.go
- arch_rwkv.go
- arch_starcoder2.go
- arch_voxtral.go
- arch_whisper.go
- auto_builder.go
- doc.go
- eagle.go
- eagle_collect.go
- encoder.go
- engine.go
- fused_add_rmsnorm_node.go
- fused_norm_add_node.go
- gguf.go
- inference.go
- load_gguf.go
- mmap_unix.go
- moe_async.go
- moe_loader.go
- moe_placement.go
- moe_prefetch.go
- quarot.go
- registry.go
- registry_init.go
- tensorrt_cache.go
- tensorrt_convert.go
- tensorrt_pipeline.go
- transcribe.go
Directories
¶
| Path | Synopsis |
|---|---|
|
Package guardian implements prompt template rendering for IBM Granite Guardian safety risk evaluation across 13 pre-defined risk categories.
|
Package guardian implements prompt template rendering for IBM Granite Guardian safety risk evaluation across 13 pre-defined risk categories. |
|
Package lora implements loading and validation of LoRA (Low-Rank Adaptation) adapter weights from GGUF files.
|
Package lora implements loading and validation of LoRA (Low-Rank Adaptation) adapter weights from GGUF files. |
|
Package multimodal provides audio preprocessing for audio-language model inference.
|
Package multimodal provides audio preprocessing for audio-language model inference. |
|
Package parallel provides tensor and pipeline parallelism for distributing inference across multiple GPUs.
|
Package parallel provides tensor and pipeline parallelism for distributing inference across multiple GPUs. |
|
Package sentiment provides a high-level sentiment classification pipeline that wraps encoder model loading and inference.
|
Package sentiment provides a high-level sentiment classification pipeline that wraps encoder model loading and inference. |
|
Package timeseries implements time-series model builders.
|
Package timeseries implements time-series model builders. |
|
features
Package features provides a feature store for the Wolf time-series ML platform.
|
Package features provides a feature store for the Wolf time-series ML platform. |
|
Package transmla converts standard multi-head attention (MHA) weights into multi-head latent attention (MLA) form via truncated SVD.
|
Package transmla converts standard multi-head attention (MHA) weights into multi-head latent attention (MLA) form via truncated SVD. |