Documentation
¶
Overview ¶
Package gguf provides GGUF file format parsing and writing. (Stability: stable)
Package gguf implements a pure-Go parser for the GGUF v3 model format used by llama.cpp. It reads metadata key-value pairs and tensor descriptors from a GGUF file without loading tensor data into memory.
Index ¶
- Constants
- func ExtractTokenizer(f *File) (*tokenizer.BPETokenizer, error)
- func LoadTensors(f *File, r io.ReadSeeker) (map[string]*tensor.TensorNumeric[float32], error)
- func MapTensorName(arch string, ggufName string) string
- func QuantizeToFP8E4M3(tensors map[string]*tensor.TensorNumeric[float32]) (map[string]*tensor.TensorNumeric[float32], error)
- func SplitMergedGateUp(tensors map[string]*tensor.TensorNumeric[float32], cfg *ModelConfig) error
- func SplitMergedQKV(tensors map[string]*tensor.TensorNumeric[float32], cfg *ModelConfig) error
- type File
- type GGMLType
- type ModelConfig
- type TensorInfo
Constants ¶
const ( TypeUint8 uint32 = 0 TypeInt8 uint32 = 1 TypeUint16 uint32 = 2 TypeInt16 uint32 = 3 TypeUint32 uint32 = 4 TypeInt32 uint32 = 5 TypeFloat32 uint32 = 6 TypeBool uint32 = 7 TypeString uint32 = 8 TypeArray uint32 = 9 TypeUint64 uint32 = 10 TypeInt64 uint32 = 11 TypeFloat64 uint32 = 12 )
GGUF metadata value types.
const Magic uint32 = 0x46554747 // "GGUF" in little-endian
Magic is the GGUF file magic number ("GGUF" in little-endian).
Variables ¶
This section is empty.
Functions ¶
func ExtractTokenizer ¶
func ExtractTokenizer(f *File) (*tokenizer.BPETokenizer, error)
ExtractTokenizer builds a BPETokenizer from GGUF metadata. GGUF files store tokenizer data under the "tokenizer.ggml.*" metadata keys.
func LoadTensors ¶
func LoadTensors(f *File, r io.ReadSeeker) (map[string]*tensor.TensorNumeric[float32], error)
LoadTensors reads tensor data from a parsed GGUF file and returns them as float32 tensors keyed by name. Quantized tensors (Q4_0, Q8_0) are stored using their native quantized storage types for memory efficiency.
func MapTensorName ¶
MapTensorName converts a GGUF tensor name to the Zerfoo/HuggingFace canonical name. The arch parameter selects architecture-specific name mappings (e.g., "gemma3" uses different norm names than "llama"). Unknown names pass through unchanged.
func QuantizeToFP8E4M3 ¶
func QuantizeToFP8E4M3(tensors map[string]*tensor.TensorNumeric[float32]) (map[string]*tensor.TensorNumeric[float32], error)
QuantizeToFP8E4M3 converts all tensors in the map from their current storage to FP8 E4M3 format with per-tensor absmax scaling. This reduces memory to 1 byte per element (1/4 of F32) at the cost of reduced precision. The tensors are modified in place — the returned map is the same object.
func SplitMergedGateUp ¶ added in v1.4.0
func SplitMergedGateUp(tensors map[string]*tensor.TensorNumeric[float32], cfg *ModelConfig) error
SplitMergedGateUp finds merged gate+up MLP tensors (*.mlp.up_proj.weight) where gate_proj is absent and up_proj has double the expected intermediate size. This handles architectures like Phi that concatenate gate and up projections into a single tensor: ffn_up has shape [2 * intermediate_size, hidden_size]. The first half of rows is the gate projection, the second half is the up projection.
func SplitMergedQKV ¶ added in v1.4.0
func SplitMergedQKV(tensors map[string]*tensor.TensorNumeric[float32], cfg *ModelConfig) error
SplitMergedQKV finds merged QKV projection tensors (*.self_attn.qkv_proj.weight) in the tensor map and splits each into separate Q, K, V projection tensors. This handles architectures like Phi that store merged QKV weights in GGUF.
For MHA (num_heads == num_kv_heads): each projection gets 1/3 of rows. For GQA (num_heads > num_kv_heads): Q gets num_heads*head_dim rows, K and V each get num_kv_heads*head_dim rows.
Types ¶
type File ¶
type File struct {
Version uint32
Metadata map[string]any
Tensors []TensorInfo
DataOffset int64 // byte offset where tensor data begins
}
File represents a parsed GGUF file.
func Parse ¶
func Parse(r io.ReadSeeker) (*File, error)
Parse reads a GGUF file header, metadata, and tensor info from r. It does not read tensor data. The returned File.DataOffset indicates where tensor data begins in the file.
func (*File) GetFloat32 ¶
GetFloat32 returns a metadata float32 value.
type GGMLType ¶
type GGMLType uint32
GGMLType identifies the quantization type of a tensor.
const ( GGMLTypeF32 GGMLType = 0 GGMLTypeF16 GGMLType = 1 GGMLTypeQ4_0 GGMLType = 2 GGMLTypeQ4_1 GGMLType = 3 GGMLTypeQ5_0 GGMLType = 6 GGMLTypeQ5_1 GGMLType = 7 GGMLTypeQ8_0 GGMLType = 8 GGMLTypeQ8_1 GGMLType = 9 GGMLTypeQ2_K GGMLType = 10 GGMLTypeQ3_K GGMLType = 11 GGMLTypeQ4_K GGMLType = 12 GGMLTypeQ5_K GGMLType = 13 GGMLTypeQ6_K GGMLType = 14 GGMLTypeQ8_K GGMLType = 15 GGMLTypeBF16 GGMLType = 30 )
Common GGML tensor types.
type ModelConfig ¶
type ModelConfig struct {
Architecture string
Name string
VocabSize int
HiddenSize int
NumLayers int
NumHeads int
NumKVHeads int
IntermediateSize int
MaxSeqLen int
RopeTheta float64
HeadDim int // explicit head dimension (0 = use HiddenSize/NumHeads)
LogitSoftcap float32 // if > 0, apply logit softcapping: cap * tanh(logit/cap)
LocalRopeTheta float64 // RoPE base for local/sliding-window layers (0 = use RopeTheta)
SlidingWindow int // sliding window size for local attention layers
SlidingWindowPattern int // every Nth layer is global (0 = all global)
RMSNormEps float32 // RMSNorm epsilon (0 = use default 1e-5)
PartialRotaryFactor float32 // fraction of head dims to apply RoPE (0 = full rotation)
// DeepSeek MLA (Multi-head Latent Attention) fields.
KVLoRADim int // KV compression rank (attention.kv_lora_rank)
QLoRADim int // Q compression rank (attention.q_lora_rank)
QKRopeHeadDim int // RoPE head dimension for Q/K (attention.qk_rope_head_dim)
// DeepSeek MoE (Mixture of Experts) fields.
NumExperts int // number of routed experts (expert_count)
NumExpertsPerToken int // experts activated per token (expert_used_count)
// Residual connection configuration.
ResidualMode string // "standard", "attnres", or "block_attnres" (default: "standard")
AttnResNumBlocks int // number of blocks for block_attnres mode (default: 8)
// BERT encoder-only fields.
NumLabels int // number of output classes for sequence classification
PoolerType string // pooling strategy ("cls" or "mean")
LayerNormEps float32 // LayerNorm epsilon (0 = use default 1e-12)
// Vision encoder fields (LLaVA, multimodal models).
VisionImageSize int // vision encoder input image size (e.g. 336)
VisionPatchSize int // vision encoder patch size (e.g. 14)
VisionHiddenSize int // vision encoder hidden dimension
VisionNumHeads int // vision encoder attention heads
VisionNumLayers int // vision encoder transformer layers
ProjectorType string // multi-modal projector type ("linear" or "mlp")
}
ModelConfig holds model configuration extracted from GGUF metadata.
func ExtractModelConfig ¶
func ExtractModelConfig(f *File) (*ModelConfig, error)
ExtractModelConfig reads GGUF metadata and returns a ModelConfig. The architecture field (general.architecture) determines which metadata key prefix to use (e.g., "llama." or "gemma.").