Documentation
¶
Overview ¶
Package gguf provides GGUF file format parsing and writing. (Stability: stable)
Package gguf implements a pure-Go parser for the GGUF v3 model format used by llama.cpp. It reads metadata key-value pairs and tensor descriptors from a GGUF file without loading tensor data into memory.
Index ¶
- Constants
- func DetectActualArchitecture(f *File, declared string) string
- func ExtractTokenizer(f *File) (*tokenizer.BPETokenizer, error)
- func LoadTensors(f *File, r io.ReadSeeker) (map[string]*tensor.TensorNumeric[float32], error)
- func LoadTensorsMmap(f *File, mapped []byte) (map[string]*tensor.TensorNumeric[float32], error)
- func MapTensorName(arch string, ggufName string) string
- func QuantizeToFP8E4M3(tensors map[string]*tensor.TensorNumeric[float32]) (map[string]*tensor.TensorNumeric[float32], error)
- func SplitMergedGateUp(tensors map[string]*tensor.TensorNumeric[float32], cfg *ModelConfig) error
- func SplitMergedQKV(tensors map[string]*tensor.TensorNumeric[float32], cfg *ModelConfig) error
- type File
- type GGMLType
- type ModelConfig
- type TensorInfo
Constants ¶
const ( TypeUint8 uint32 = 0 TypeInt8 uint32 = 1 TypeUint16 uint32 = 2 TypeInt16 uint32 = 3 TypeUint32 uint32 = 4 TypeInt32 uint32 = 5 TypeFloat32 uint32 = 6 TypeBool uint32 = 7 TypeString uint32 = 8 TypeArray uint32 = 9 TypeUint64 uint32 = 10 TypeInt64 uint32 = 11 TypeFloat64 uint32 = 12 )
GGUF metadata value types.
const Magic uint32 = 0x46554747 // "GGUF" in little-endian
Magic is the GGUF file magic number ("GGUF" in little-endian).
Variables ¶
This section is empty.
Functions ¶
func DetectActualArchitecture ¶ added in v1.23.0
DetectActualArchitecture checks GGUF metadata to detect models that declare one architecture but are actually a different model family. For example, Mistral GGUF files declare general.architecture = "llama" but are identified by their model name, tokenizer pre-processor, or vocabulary size.
func ExtractTokenizer ¶
func ExtractTokenizer(f *File) (*tokenizer.BPETokenizer, error)
ExtractTokenizer builds a BPETokenizer from GGUF metadata. GGUF files store tokenizer data under the "tokenizer.ggml.*" metadata keys.
func LoadTensors ¶
func LoadTensors(f *File, r io.ReadSeeker) (map[string]*tensor.TensorNumeric[float32], error)
LoadTensors reads tensor data from a parsed GGUF file and returns them as float32 tensors keyed by name. Quantized tensors (Q4_0, Q8_0) are stored using their native quantized storage types for memory efficiency.
func LoadTensorsMmap ¶ added in v1.26.0
LoadTensorsMmap creates tensors backed by MmapStorage that reference slices of the memory-mapped GGUF file data. No tensor data is copied -- each MmapStorage points directly into the mapped region. Dequantization happens lazily on first access via MmapStorage.Slice().
mapped must be the entire GGUF file memory-mapped into a byte slice. The caller must keep the mapping alive for the lifetime of the returned tensors.
func MapTensorName ¶
MapTensorName converts a GGUF tensor name to the Zerfoo/HuggingFace canonical name. The arch parameter selects architecture-specific name mappings (e.g., "gemma3" uses different norm names than "llama"). Unknown names pass through unchanged.
func QuantizeToFP8E4M3 ¶
func QuantizeToFP8E4M3(tensors map[string]*tensor.TensorNumeric[float32]) (map[string]*tensor.TensorNumeric[float32], error)
QuantizeToFP8E4M3 converts all tensors in the map from their current storage to FP8 E4M3 format with per-tensor absmax scaling. This reduces memory to 1 byte per element (1/4 of F32) at the cost of reduced precision. The tensors are modified in place — the returned map is the same object.
func SplitMergedGateUp ¶ added in v1.4.0
func SplitMergedGateUp(tensors map[string]*tensor.TensorNumeric[float32], cfg *ModelConfig) error
SplitMergedGateUp finds merged gate+up MLP tensors (*.mlp.up_proj.weight) where gate_proj is absent and up_proj has double the expected intermediate size. This handles architectures like Phi that concatenate gate and up projections into a single tensor: ffn_up has shape [2 * intermediate_size, hidden_size]. The first half of rows is the gate projection, the second half is the up projection.
func SplitMergedQKV ¶ added in v1.4.0
func SplitMergedQKV(tensors map[string]*tensor.TensorNumeric[float32], cfg *ModelConfig) error
SplitMergedQKV finds merged QKV projection tensors (*.self_attn.qkv_proj.weight) in the tensor map and splits each into separate Q, K, V projection tensors. This handles architectures like Phi that store merged QKV weights in GGUF.
For MHA (num_heads == num_kv_heads): each projection gets 1/3 of rows. For GQA (num_heads > num_kv_heads): Q gets num_heads*head_dim rows, K and V each get num_kv_heads*head_dim rows.
Types ¶
type File ¶
type File struct {
Version uint32
Metadata map[string]any
Tensors []TensorInfo
DataOffset int64 // byte offset where tensor data begins
}
File represents a parsed GGUF file.
func Parse ¶
func Parse(r io.ReadSeeker) (*File, error)
Parse reads a GGUF file header, metadata, and tensor info from r. It does not read tensor data. The returned File.DataOffset indicates where tensor data begins in the file.
func (*File) GetFloat32 ¶
GetFloat32 returns a metadata value as float32. Handles float32 natively and also converts float64 values (common in some GGUF converters).
type GGMLType ¶
type GGMLType uint32
GGMLType identifies the quantization type of a tensor.
const ( GGMLTypeF32 GGMLType = 0 GGMLTypeF16 GGMLType = 1 GGMLTypeQ4_0 GGMLType = 2 GGMLTypeQ4_1 GGMLType = 3 GGMLTypeQ5_0 GGMLType = 6 GGMLTypeQ5_1 GGMLType = 7 GGMLTypeQ8_0 GGMLType = 8 GGMLTypeQ8_1 GGMLType = 9 GGMLTypeQ2_K GGMLType = 10 GGMLTypeQ3_K GGMLType = 11 GGMLTypeQ4_K GGMLType = 12 GGMLTypeQ5_K GGMLType = 13 GGMLTypeQ6_K GGMLType = 14 GGMLTypeQ8_K GGMLType = 15 GGMLTypeBF16 GGMLType = 30 )
Common GGML tensor types.
type ModelConfig ¶
type ModelConfig struct {
Architecture string
Name string
VocabSize int
HiddenSize int
NumLayers int
NumHeads int
NumKVHeads int
IntermediateSize int
MaxSeqLen int
RopeTheta float64
HeadDim int // explicit head dimension (0 = use HiddenSize/NumHeads)
LogitSoftcap float32 // if > 0, apply logit softcapping: cap * tanh(logit/cap)
LocalRopeTheta float64 // RoPE base for local/sliding-window layers (0 = use RopeTheta)
SlidingWindow int // sliding window size for local attention layers
SlidingWindowPattern int // every Nth layer is global (0 = all global)
RMSNormEps float32 // RMSNorm epsilon (0 = use default 1e-5)
PartialRotaryFactor float32 // fraction of head dims to apply RoPE (0 = full rotation)
// DeepSeek MLA (Multi-head Latent Attention) fields.
KVLoRADim int // KV compression rank (attention.kv_lora_rank)
QLoRADim int // Q compression rank (attention.q_lora_rank)
QKRopeHeadDim int // RoPE head dimension for Q/K (attention.qk_rope_head_dim)
// DeepSeek MoE (Mixture of Experts) fields.
NumExperts int // number of routed experts (expert_count)
NumExpertsPerToken int // experts activated per token (expert_used_count)
// Residual connection configuration.
ResidualMode string // "standard", "attnres", or "block_attnres" (default: "standard")
AttnResNumBlocks int // number of blocks for block_attnres mode (default: 8)
// BERT encoder-only fields.
NumLabels int // number of output classes for sequence classification
PoolerType string // pooling strategy ("cls" or "mean")
LayerNormEps float32 // LayerNorm epsilon (0 = use default 1e-12)
// Granite-specific fields.
EmbeddingMultiplier float32 // multiply embeddings by this factor (0 = no scaling)
ResidualMultiplier float32 // multiply residual connections by this factor (0 = no scaling)
// Vision encoder fields (LLaVA, multimodal models).
VisionImageSize int // vision encoder input image size (e.g. 336)
VisionPatchSize int // vision encoder patch size (e.g. 14)
VisionHiddenSize int // vision encoder hidden dimension
VisionNumHeads int // vision encoder attention heads
VisionNumLayers int // vision encoder transformer layers
ProjectorType string // multi-modal projector type ("linear" or "mlp")
}
ModelConfig holds model configuration extracted from GGUF metadata.
func ExtractModelConfig ¶
func ExtractModelConfig(f *File) (*ModelConfig, error)
ExtractModelConfig reads GGUF metadata and returns a ModelConfig. The architecture field (general.architecture) determines which metadata key prefix to use (e.g., "llama." or "gemma."). After extracting config using the declared architecture's key prefix, the actual architecture is detected via DetectActualArchitecture to handle models like Mistral that declare "llama" but need different runtime behavior.