Documentation
¶
Overview ¶
Package gguf provides GGUF file format parsing and writing. (Stability: stable)
Package gguf implements a pure-Go parser for the GGUF v3 model format used by llama.cpp. It reads metadata key-value pairs and tensor descriptors from a GGUF file without loading tensor data into memory.
Index ¶
- Constants
- func DetectActualArchitecture(f *File, declared string) string
- func ExtractTokenizer(f *File) (*tokenizer.BPETokenizer, error)
- func LoadTensors(f *File, r io.ReadSeeker) (map[string]*tensor.TensorNumeric[float32], error)
- func LoadTensorsMmap(f *File, mapped []byte) (map[string]*tensor.TensorNumeric[float32], error)
- func LoadTensorsMmapSplit(sf *SplitFile, mappedShards [][]byte) (map[string]*tensor.TensorNumeric[float32], error)
- func LoadTensorsSplit(sf *SplitFile, readers []*os.File) (map[string]*tensor.TensorNumeric[float32], error)
- func MapTensorName(arch string, ggufName string) string
- func QuantizeToFP8E4M3(tensors map[string]*tensor.TensorNumeric[float32]) (map[string]*tensor.TensorNumeric[float32], error)
- func SplitMergedGateUp(tensors map[string]*tensor.TensorNumeric[float32], cfg *ModelConfig) error
- func SplitMergedQKV(tensors map[string]*tensor.TensorNumeric[float32], cfg *ModelConfig) error
- func TensorByteSize(typ GGMLType, numElements int) (int, error)
- type File
- type GGMLType
- type ModelConfig
- type SplitFile
- type TensorInfo
Constants ¶
const ( TypeUint8 uint32 = 0 TypeInt8 uint32 = 1 TypeUint16 uint32 = 2 TypeInt16 uint32 = 3 TypeUint32 uint32 = 4 TypeInt32 uint32 = 5 TypeFloat32 uint32 = 6 TypeBool uint32 = 7 TypeString uint32 = 8 TypeArray uint32 = 9 TypeUint64 uint32 = 10 TypeInt64 uint32 = 11 TypeFloat64 uint32 = 12 )
GGUF metadata value types.
const Magic uint32 = 0x46554747 // "GGUF" in little-endian
Magic is the GGUF file magic number ("GGUF" in little-endian).
Variables ¶
This section is empty.
Functions ¶
func DetectActualArchitecture ¶ added in v1.23.0
DetectActualArchitecture checks GGUF metadata to detect models that declare one architecture but are actually a different model family. For example, Mistral GGUF files declare general.architecture = "llama" but are identified by their model name, tokenizer pre-processor, or vocabulary size.
func ExtractTokenizer ¶
func ExtractTokenizer(f *File) (*tokenizer.BPETokenizer, error)
ExtractTokenizer builds a BPETokenizer from GGUF metadata. GGUF files store tokenizer data under the "tokenizer.ggml.*" metadata keys.
func LoadTensors ¶
func LoadTensors(f *File, r io.ReadSeeker) (map[string]*tensor.TensorNumeric[float32], error)
LoadTensors reads tensor data from a parsed GGUF file and returns them as float32 tensors keyed by name. Quantized tensors (Q4_0, Q8_0) are stored using their native quantized storage types for memory efficiency.
func LoadTensorsMmap ¶ added in v1.26.0
LoadTensorsMmap creates tensors backed by MmapStorage that reference slices of the memory-mapped GGUF file data. No tensor data is copied -- each MmapStorage points directly into the mapped region. Dequantization happens lazily on first access via MmapStorage.Slice().
mapped must be the entire GGUF file memory-mapped into a byte slice. The caller must keep the mapping alive for the lifetime of the returned tensors.
func LoadTensorsMmapSplit ¶ added in v1.34.0
func LoadTensorsMmapSplit(sf *SplitFile, mappedShards [][]byte) (map[string]*tensor.TensorNumeric[float32], error)
LoadTensorsMmapSplit creates tensors backed by mmap'd regions from multiple shard files. Each shard is independently mmap'd. Tensor data references the correct shard's mapped region.
func LoadTensorsSplit ¶ added in v1.34.0
func LoadTensorsSplit(sf *SplitFile, readers []*os.File) (map[string]*tensor.TensorNumeric[float32], error)
LoadTensorsSplit reads tensor data from all shards using heap allocation.
func MapTensorName ¶
MapTensorName converts a GGUF tensor name to the Zerfoo/HuggingFace canonical name. The arch parameter selects architecture-specific name mappings (e.g., "gemma3" uses different norm names than "llama"). Unknown names pass through unchanged.
func QuantizeToFP8E4M3 ¶
func QuantizeToFP8E4M3(tensors map[string]*tensor.TensorNumeric[float32]) (map[string]*tensor.TensorNumeric[float32], error)
QuantizeToFP8E4M3 converts all tensors in the map from their current storage to FP8 E4M3 format with per-tensor absmax scaling. This reduces memory to 1 byte per element (1/4 of F32) at the cost of reduced precision. The tensors are modified in place — the returned map is the same object.
func SplitMergedGateUp ¶ added in v1.4.0
func SplitMergedGateUp(tensors map[string]*tensor.TensorNumeric[float32], cfg *ModelConfig) error
SplitMergedGateUp finds merged gate+up MLP tensors (*.mlp.up_proj.weight) where gate_proj is absent and up_proj has double the expected intermediate size. This handles architectures like Phi that concatenate gate and up projections into a single tensor: ffn_up has shape [2 * intermediate_size, hidden_size]. The first half of rows is the gate projection, the second half is the up projection.
func SplitMergedQKV ¶ added in v1.4.0
func SplitMergedQKV(tensors map[string]*tensor.TensorNumeric[float32], cfg *ModelConfig) error
SplitMergedQKV finds merged QKV projection tensors (*.self_attn.qkv_proj.weight) in the tensor map and splits each into separate Q, K, V projection tensors. This handles architectures like Phi that store merged QKV weights in GGUF.
For MHA (num_heads == num_kv_heads): each projection gets 1/3 of rows. For GQA (num_heads > num_kv_heads): Q gets num_heads*head_dim rows, K and V each get num_kv_heads*head_dim rows.
Types ¶
type File ¶
type File struct {
Version uint32
Metadata map[string]any
Tensors []TensorInfo
DataOffset int64 // byte offset where tensor data begins
}
File represents a parsed GGUF file.
func Parse ¶
func Parse(r io.ReadSeeker) (*File, error)
Parse reads a GGUF file header, metadata, and tensor info from r. It does not read tensor data. The returned File.DataOffset indicates where tensor data begins in the file.
func (*File) GetFloat32 ¶
GetFloat32 returns a metadata value as float32. Handles float32 natively and also converts float64 values (common in some GGUF converters).
type GGMLType ¶
type GGMLType uint32
GGMLType identifies the quantization type of a tensor.
const ( GGMLTypeF32 GGMLType = 0 GGMLTypeF16 GGMLType = 1 GGMLTypeQ4_0 GGMLType = 2 GGMLTypeQ4_1 GGMLType = 3 GGMLTypeQ5_0 GGMLType = 6 GGMLTypeQ5_1 GGMLType = 7 GGMLTypeQ8_0 GGMLType = 8 GGMLTypeQ8_1 GGMLType = 9 GGMLTypeQ2_K GGMLType = 10 GGMLTypeQ3_K GGMLType = 11 GGMLTypeQ4_K GGMLType = 12 GGMLTypeQ5_K GGMLType = 13 GGMLTypeQ6_K GGMLType = 14 GGMLTypeQ8_K GGMLType = 15 GGMLTypeIQ2_XXS GGMLType = 16 // Importance-weighted 2-bit (E8 lattice codebook) GGMLTypeIQ3_S GGMLType = 21 // Importance-weighted 3-bit with sub-block scales GGMLTypeIQ4_NL GGMLType = 25 // Non-linear 4-bit with lookup table GGMLTypeBF16 GGMLType = 30 GGMLTypeTQ2_0 GGMLType = 35 // Ternary 2-bit: 4 values per byte {-1, 0, 1} )
Common GGML tensor types.
type ModelConfig ¶
type ModelConfig struct {
Architecture string
Name string
VocabSize int
HiddenSize int
NumLayers int
NumHeads int
NumKVHeads int
IntermediateSize int
MaxSeqLen int
RopeTheta float64
HeadDim int // explicit head dimension (0 = use HiddenSize/NumHeads)
LogitSoftcap float32 // if > 0, apply logit softcapping: cap * tanh(logit/cap)
LocalRopeTheta float64 // RoPE base for local/sliding-window layers (0 = use RopeTheta)
SlidingWindow int // sliding window size for local attention layers
SlidingWindowPattern int // every Nth layer is global (0 = all global)
RMSNormEps float32 // RMSNorm epsilon (0 = use default 1e-5)
PartialRotaryFactor float32 // fraction of head dims to apply RoPE (0 = full rotation)
// DeepSeek MLA (Multi-head Latent Attention) fields.
KVLoRADim int // KV compression rank (attention.kv_lora_rank)
QLoRADim int // Q compression rank (attention.q_lora_rank)
QKRopeHeadDim int // RoPE head dimension for Q/K (attention.qk_rope_head_dim)
// DeepSeek MoE (Mixture of Experts) fields.
NumExperts int // number of routed experts (expert_count)
NumExpertsPerToken int // experts activated per token (expert_used_count)
// TransMLA: converted MHA-to-MLA models (see ADR-069).
TransMLAKVLoraDim int // KV LoRA rank from transmla.kv_lora_dim metadata (0 = not a TransMLA model)
// Residual connection configuration.
ResidualMode string // "standard", "attnres", or "block_attnres" (default: "standard")
AttnResNumBlocks int // number of blocks for block_attnres mode (default: 8)
// BERT encoder-only fields.
NumLabels int // number of output classes for sequence classification
PoolerType string // pooling strategy ("cls" or "mean")
LayerNormEps float32 // LayerNorm epsilon (0 = use default 1e-12)
// Granite-specific fields.
EmbeddingMultiplier float32 // multiply embeddings by this factor (0 = no scaling)
ResidualMultiplier float32 // multiply residual connections by this factor (0 = no scaling)
// Nemotron-H SSM (Mamba-2) fields.
SSMStateSize int // SSM state dimension (ssm.state_size)
SSMConvKernel int // SSM convolution kernel width (ssm.conv_kernel)
SSMNumHeads int // SSM number of heads (ssm.num_heads)
ExpertSharedCount int
// MoE expert gating configuration.
ScoringFunc string // expert gating scoring function ("softmax" or "sigmoid"; default: "softmax")
// Vision encoder fields (LLaVA, multimodal models).
VisionImageSize int // vision encoder input image size (e.g. 336)
VisionPatchSize int // vision encoder patch size (e.g. 14)
VisionHiddenSize int // vision encoder hidden dimension
VisionNumHeads int // vision encoder attention heads
VisionNumLayers int // vision encoder transformer layers
ProjectorType string // multi-modal projector type ("linear" or "mlp")
// Audio encoder fields (Voxtral, multimodal speech-to-text models).
AudioHiddenSize int // audio encoder hidden dimension
AudioNumLayers int // audio encoder transformer layers
AudioNumHeads int // audio encoder attention heads
AudioNumMels int // number of mel spectrogram bins (e.g. 128)
AudioIntermediateSize int // audio encoder FFN intermediate size
AudioProjectorType string // audio projector type (e.g. "mlp")
AudioProjectorStackFactor int // number of consecutive frames to stack (e.g. 4)
}
ModelConfig holds model configuration extracted from GGUF metadata.
func ExtractModelConfig ¶
func ExtractModelConfig(f *File) (*ModelConfig, error)
ExtractModelConfig reads GGUF metadata and returns a ModelConfig. The architecture field (general.architecture) determines which metadata key prefix to use (e.g., "llama." or "gemma."). After extracting config using the declared architecture's key prefix, the actual architecture is detected via DetectActualArchitecture to handle models like Mistral that declare "llama" but need different runtime behavior.
type SplitFile ¶ added in v1.34.0
type SplitFile struct {
// File is the merged view: metadata from shard 0, tensors from all shards.
// DataOffset is not meaningful for split files — use ShardIndex instead.
File *File
// Shards holds the parsed header for each shard, indexed by shard number.
Shards []*File
// ShardPaths holds the file paths for each shard.
ShardPaths []string
// ShardIndex maps tensor name to the shard index that contains its data.
ShardIndex map[string]int
}
SplitFile represents a collection of parsed GGUF shards that together form a single model. Metadata comes from shard 0; tensors are distributed across all shards.
func ParseSplit ¶ added in v1.34.0
ParseSplit detects whether path is a split GGUF file and parses all shards. If path is a single (non-split) file, it returns nil, nil and the caller should fall back to Parse. Split files follow the naming convention:
Model-00001-of-00003.gguf Model-00002-of-00003.gguf Model-00003-of-00003.gguf
The path may point to any shard; all sibling shards are discovered automatically.