inference

package
v1.4.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 17, 2026 License: Apache-2.0 Imports: 29 Imported by: 0

Documentation

Overview

Package inference provides a high-level API for loading GGUF models and running text generation, chat, embedding, and speculative decoding with minimal boilerplate.

Loading Models

There are two entry points for loading a model:

  • Load resolves a model by name or HuggingFace repo ID, pulling it from the registry if not already cached, and returns a ready-to-use Model.
  • LoadFile loads a model directly from a local GGUF file path.

Both accept functional Option values to configure the compute device, cache directory, sequence length, and other parameters:

m, err := inference.Load("gemma-3-1b-q4",
	inference.WithDevice("cuda"),
	inference.WithMaxSeqLen(4096),
)
if err != nil {
	log.Fatal(err)
}
defer m.Close()

text, err := m.Generate(ctx, "Explain gradient descent briefly.",
	inference.WithMaxTokens(256),
	inference.WithTemperature(0.7),
)

Model Methods

A loaded Model exposes several generation methods:

  • Model.Generate produces text from a prompt and returns the full result.
  • Model.GenerateStream delivers tokens incrementally via a callback.
  • Model.GenerateBatch processes multiple prompts concurrently.
  • Model.Chat formats a slice of Message values using the model's chat template and generates a Response with token usage statistics.
  • Model.Embed returns an L2-normalized embedding vector for a text input by mean-pooling the model's token embedding table.
  • Model.SpeculativeGenerate runs speculative decoding with a smaller draft model to accelerate generation from a larger target model.

Load Options

The following Option functions configure model loading:

  • WithDevice — compute device: "cpu", "cuda", "cuda:N", "rocm", "opencl"
  • WithCacheDir — local directory for cached model files
  • WithMaxSeqLen — override the model's default maximum sequence length
  • WithRegistry — supply a custom model registry
  • WithBackend — select "tensorrt" for TensorRT-optimized inference
  • WithPrecision — set TensorRT compute precision ("fp16")
  • WithDType — set GPU compute precision ("fp16", "fp8")
  • WithKVDtype — set KV cache storage precision ("fp16")
  • WithMmap — enable memory-mapped model loading on unix

Generate Options

The following GenerateOption functions configure sampling for generation methods:

Model Aliases

Short aliases such as "gemma-3-1b-q4" and "llama-3-8b-q4" map to full HuggingFace repository IDs. Use ResolveAlias to look up the mapping and RegisterAlias to add custom aliases.

For lower-level control over text generation, KV caching, and sampling, see the github.com/zerfoo/zerfoo/generate package. For an OpenAI-compatible HTTP server built on top of this package, see github.com/zerfoo/zerfoo/serve.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ConvertGraphToTRT

func ConvertGraphToTRT(g *graph.Graph[float32], workspaceBytes int, fp16 bool, dynamicShapes *DynamicShapeConfig) (*trtConversionResult, error)

ConvertGraphToTRT walks a graph in topological order and maps each node to a TensorRT layer. Returns serialized engine bytes or an UnsupportedOpError if the graph contains operations that cannot be converted. If dynamicShapes is non-nil, an optimization profile is created with the specified min/opt/max dimensions for each input.

func LoadTRTEngine

func LoadTRTEngine(key string) ([]byte, error)

LoadTRTEngine reads a serialized TensorRT engine from the cache. Returns nil, nil on cache miss (file not found).

func RegisterAlias

func RegisterAlias(shortName, repoID string)

RegisterAlias adds a custom short name -> HuggingFace repo ID mapping.

func ResolveAlias

func ResolveAlias(name string) string

ResolveAlias returns the HuggingFace repo ID for a short alias. If the name is not an alias, it is returned unchanged.

func SaveTRTEngine

func SaveTRTEngine(key string, data []byte) error

SaveTRTEngine writes a serialized TensorRT engine to the cache directory.

func TRTCacheKey

func TRTCacheKey(modelID, precision string) (string, error)

TRTCacheKey builds a deterministic cache key from model ID, precision, and GPU architecture. The key is a hex SHA-256 hash to avoid filesystem issues with long or special-character model IDs.

Types

type ArchConfigRegistry

type ArchConfigRegistry struct {
	// contains filtered or unexported fields
}

ArchConfigRegistry maps model_type strings to config parsers.

func DefaultArchConfigRegistry

func DefaultArchConfigRegistry() *ArchConfigRegistry

DefaultArchConfigRegistry returns a registry with all built-in parsers registered.

func (*ArchConfigRegistry) Parse

func (r *ArchConfigRegistry) Parse(raw map[string]interface{}) (*ModelMetadata, error)

Parse dispatches to the registered parser for the model_type in raw, or falls back to generic field extraction for unknown types.

func (*ArchConfigRegistry) Register

func (r *ArchConfigRegistry) Register(modelType string, parser ConfigParser)

Register adds a parser for the given model type.

type ConfigParser

type ConfigParser func(raw map[string]interface{}) (*ModelMetadata, error)

ConfigParser parses a raw JSON map (from config.json) into ModelMetadata.

type ConstantValueGetter

type ConstantValueGetter interface {
	GetValue() *tensor.TensorNumeric[float32]
}

ConstantValueGetter is an interface for nodes that hold constant tensor data.

type DTypeSetter

type DTypeSetter interface {
	SetDType(compute.DType)
}

DTypeSetter is implemented by engines that support setting compute precision.

type DynamicShapeConfig

type DynamicShapeConfig struct {
	// InputShapes maps input index (0-based) to its shape range.
	InputShapes []ShapeRange
}

DynamicShapeConfig specifies per-input shape ranges for TensorRT optimization profiles. When non-nil, the converter creates an optimization profile that allows variable-size inputs within the specified ranges.

type GGUFModel

type GGUFModel struct {
	Config  *gguf.ModelConfig
	Tensors map[string]*tensor.TensorNumeric[float32]
	File    *gguf.File
}

GGUFModel holds a loaded GGUF model's configuration and tensors. This is an intermediate representation; full inference requires an architecture-specific graph builder to convert these into a computation graph.

func LoadGGUF

func LoadGGUF(path string) (*GGUFModel, error)

LoadGGUF loads a GGUF model file and returns its configuration and tensors. Tensor names are mapped from GGUF convention (blk.N.attn_q.weight) to Zerfoo canonical names (model.layers.N.self_attn.q_proj.weight).

func (*GGUFModel) ToModelMetadata

func (m *GGUFModel) ToModelMetadata() *ModelMetadata

ToModelMetadata converts a GGUF model config to inference.ModelMetadata.

type GenerateOption

type GenerateOption func(*generate.SamplingConfig)

GenerateOption configures a generation call.

func WithGrammar

func WithGrammar(g *grammar.Grammar) GenerateOption

WithGrammar sets a grammar state machine for constrained decoding. When set, a token mask is applied at each sampling step to restrict output to tokens that are valid according to the grammar.

func WithMaxTokens

func WithMaxTokens(n int) GenerateOption

WithMaxTokens sets the maximum number of tokens to generate.

func WithRepetitionPenalty

func WithRepetitionPenalty(p float64) GenerateOption

WithRepetitionPenalty sets the repetition penalty factor.

func WithStopStrings

func WithStopStrings(ss ...string) GenerateOption

WithStopStrings sets strings that stop generation.

func WithTemperature

func WithTemperature(t float64) GenerateOption

WithTemperature sets the sampling temperature.

func WithTopK

func WithTopK(k int) GenerateOption

WithTopK sets the top-K sampling parameter.

func WithTopP

func WithTopP(p float64) GenerateOption

WithTopP sets the top-P (nucleus) sampling parameter.

type Message

type Message struct {
	Role    string // "system", "user", or "assistant"
	Content string
}

Message represents a chat message.

type Model

type Model struct {
	// contains filtered or unexported fields
}

Model is a loaded model ready for generation.

func Load

func Load(modelID string, opts ...Option) (*Model, error)

Load loads a model by ID, pulling it if not cached.

func LoadFile

func LoadFile(path string, opts ...Option) (*Model, error)

LoadFile loads a model from a local GGUF file and returns a ready-to-use Model.

func NewTestModel

func NewTestModel(
	gen *generate.Generator[float32],
	tok tokenizer.Tokenizer,
	eng compute.Engine[float32],
	meta ModelMetadata,
	info *registry.ModelInfo,
) *Model

NewTestModel constructs a Model from pre-built components. Intended for use in external test packages that need a Model without going through the full Load pipeline.

func (*Model) Chat

func (m *Model) Chat(ctx context.Context, messages []Message, opts ...GenerateOption) (Response, error)

Chat formats messages using the model's chat template and generates a response. Sessions are pooled to preserve CUDA graph replay.

func (*Model) Close

func (m *Model) Close() error

Close releases resources held by the model. If the model was loaded on a GPU, this frees the CUDA engine's handles, pool, and stream. If loaded with mmap, this releases the memory mapping.

func (*Model) Config

func (m *Model) Config() ModelMetadata

Config returns the model metadata.

func (*Model) Embed

func (m *Model) Embed(text string) ([]float32, error)

Embed returns an L2-normalized embedding vector for the given text by looking up token embeddings from the model's embedding table and mean-pooling them.

func (*Model) EmbeddingWeights

func (m *Model) EmbeddingWeights() ([]float32, int)

EmbeddingWeights returns the flattened token embedding table and the hidden dimension. Returns nil, 0 if embeddings are not available.

func (*Model) Generate

func (m *Model) Generate(ctx context.Context, prompt string, opts ...GenerateOption) (string, error)

Generate produces text from a prompt. Sessions are pooled to reuse GPU memory addresses, enabling CUDA graph replay across calls. Concurrent Generate calls get separate sessions from the pool.

func (*Model) GenerateBatch

func (m *Model) GenerateBatch(ctx context.Context, prompts []string, opts ...GenerateOption) ([]string, error)

GenerateBatch processes multiple prompts concurrently and returns the generated text for each prompt. Results are returned in the same order as the input prompts. If a prompt fails, its corresponding error is non-nil.

[Deviation: Architectural] Used parallel goroutines instead of shared PagedKV decode — full multi-seq requires deeper Generator refactor.

func (*Model) GenerateStream

func (m *Model) GenerateStream(ctx context.Context, prompt string, handler generate.TokenStream, opts ...GenerateOption) error

GenerateStream delivers tokens one at a time via a callback. Sessions are pooled to preserve GPU memory addresses for CUDA graph replay.

func (*Model) Generator

func (m *Model) Generator() *generate.Generator[float32]

Generator returns the underlying generator.

func (*Model) Info

func (m *Model) Info() *registry.ModelInfo

Info returns the registry info for this model.

func (*Model) SetEmbeddingWeights

func (m *Model) SetEmbeddingWeights(weights []float32, hiddenSize int)

SetEmbeddingWeights sets the token embedding table for Embed(). weights is a flattened [vocabSize, hiddenSize] matrix.

func (*Model) SpeculativeGenerate

func (m *Model) SpeculativeGenerate(
	ctx context.Context,
	draft *Model,
	prompt string,
	draftLen int,
	opts ...GenerateOption,
) (string, error)

SpeculativeGenerate runs speculative decoding using this model as the target and the draft model for token proposal. draftLen controls how many tokens are proposed per verification step.

func (*Model) Tokenizer

func (m *Model) Tokenizer() tokenizer.Tokenizer

Tokenizer returns the model's tokenizer for token counting.

type ModelMetadata

type ModelMetadata struct {
	Architecture          string `json:"architecture"`
	VocabSize             int    `json:"vocab_size"`
	HiddenSize            int    `json:"hidden_size"`
	NumLayers             int    `json:"num_layers"`
	MaxPositionEmbeddings int    `json:"max_position_embeddings"`
	EOSTokenID            int    `json:"eos_token_id"`
	BOSTokenID            int    `json:"bos_token_id"`
	ChatTemplate          string `json:"chat_template"`

	// Extended fields for multi-architecture support.
	IntermediateSize    int                `json:"intermediate_size"`
	NumQueryHeads       int                `json:"num_attention_heads"`
	NumKeyValueHeads    int                `json:"num_key_value_heads"`
	RopeTheta           float64            `json:"rope_theta"`
	RopeScaling         *RopeScalingConfig `json:"rope_scaling,omitempty"`
	TieWordEmbeddings   bool               `json:"tie_word_embeddings"`
	SlidingWindow       int                `json:"sliding_window"`
	AttentionBias       bool               `json:"attention_bias"`
	PartialRotaryFactor float64            `json:"partial_rotary_factor"`

	// DeepSeek MLA and MoE fields.
	KVLoRADim          int `json:"kv_lora_rank"`
	QLoRADim           int `json:"q_lora_rank"`
	QKRopeHeadDim      int `json:"qk_rope_head_dim"`
	NumExperts         int `json:"num_experts"`
	NumExpertsPerToken int `json:"num_experts_per_tok"`
	NumSharedExperts   int `json:"n_shared_experts"`
}

ModelMetadata holds model configuration loaded from config.json.

type Option

type Option func(*loadOptions)

Option configures model loading.

func WithBackend

func WithBackend(backend string) Option

WithBackend selects the inference backend. Supported values: "" or "default" for the standard Engine path, "tensorrt" for TensorRT-optimized inference. TensorRT requires the cuda build tag and a CUDA device.

func WithCacheDir

func WithCacheDir(dir string) Option

WithCacheDir sets the model cache directory.

func WithDType

func WithDType(dtype string) Option

WithDType sets the compute precision for the GPU engine. Supported values: "" or "fp32" for full precision, "fp16" for FP16 compute. FP16 mode converts activations F32->FP16 before GPU kernels and back after. Has no effect on CPU engines.

func WithDevice

func WithDevice(device string) Option

WithDevice sets the compute device ("cpu" or "cuda").

func WithKVDtype

func WithKVDtype(dtype string) Option

WithKVDtype sets the KV cache storage dtype. Supported: "fp32" (default), "fp16". FP16 halves KV cache bandwidth by storing keys/values in half precision.

func WithMaxSeqLen

func WithMaxSeqLen(n int) Option

WithMaxSeqLen overrides the model's default max sequence length.

func WithMmap

func WithMmap(enabled bool) Option

WithMmap enables memory-mapped model loading. When true, the ZMF file is mapped into memory using syscall.Mmap instead of os.ReadFile, avoiding heap allocation for model weights. Only supported on unix platforms.

func WithPrecision

func WithPrecision(precision string) Option

WithPrecision sets the compute precision for the TensorRT backend. Supported values: "" or "fp32" for full precision, "fp16" for half precision. Has no effect when the backend is not "tensorrt".

func WithRegistry

func WithRegistry(r registry.ModelRegistry) Option

WithRegistry provides a custom model registry.

type Response

type Response struct {
	Content          string
	TokensUsed       int
	PromptTokens     int
	CompletionTokens int
}

Response holds the result of a chat completion.

type RopeScalingConfig

type RopeScalingConfig struct {
	Type                          string  `json:"type"`
	Factor                        float64 `json:"factor"`
	OriginalMaxPositionEmbeddings int     `json:"original_max_position_embeddings"`
}

RopeScalingConfig holds configuration for RoPE scaling methods (e.g., YaRN).

type ShapeRange

type ShapeRange struct {
	Min []int32
	Opt []int32
	Max []int32
}

ShapeRange defines min/opt/max dimensions for a single input tensor. Used with DynamicShapeConfig to support variable-size inputs.

type TRTInferenceEngine

type TRTInferenceEngine struct {
	// contains filtered or unexported fields
}

TRTInferenceEngine holds a TensorRT engine and execution context for inference. It wraps the serialized engine, providing a Forward method that mirrors the graph forward pass but runs through TensorRT.

func (*TRTInferenceEngine) Close

func (e *TRTInferenceEngine) Close() error

Close releases all TensorRT resources.

func (*TRTInferenceEngine) Forward

func (e *TRTInferenceEngine) Forward(inputs []*tensor.TensorNumeric[float32], outputSize int) (*tensor.TensorNumeric[float32], error)

Forward runs inference through TensorRT with the given input tensors. Input tensors must already be on GPU.

type UnsupportedOpError

type UnsupportedOpError struct {
	Ops []string
}

UnsupportedOpError lists the operations that cannot be converted to TensorRT.

func (*UnsupportedOpError) Error

func (e *UnsupportedOpError) Error() string

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL