gguf

package
v1.38.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 30, 2026 License: Apache-2.0 Imports: 12 Imported by: 0

Documentation

Overview

Package gguf provides GGUF file format parsing and writing. (Stability: stable)

Package gguf implements a pure-Go parser for the GGUF v3 model format used by llama.cpp. It reads metadata key-value pairs and tensor descriptors from a GGUF file without loading tensor data into memory.

Index

Constants

View Source
const (
	TypeUint8   uint32 = 0
	TypeInt8    uint32 = 1
	TypeUint16  uint32 = 2
	TypeInt16   uint32 = 3
	TypeUint32  uint32 = 4
	TypeInt32   uint32 = 5
	TypeFloat32 uint32 = 6
	TypeBool    uint32 = 7
	TypeString  uint32 = 8
	TypeArray   uint32 = 9
	TypeUint64  uint32 = 10
	TypeInt64   uint32 = 11
	TypeFloat64 uint32 = 12
)

GGUF metadata value types.

View Source
const Magic uint32 = 0x46554747 // "GGUF" in little-endian

Magic is the GGUF file magic number ("GGUF" in little-endian).

Variables

This section is empty.

Functions

func DetectActualArchitecture added in v1.23.0

func DetectActualArchitecture(f *File, declared string) string

DetectActualArchitecture checks GGUF metadata to detect models that declare one architecture but are actually a different model family. For example, Mistral GGUF files declare general.architecture = "llama" but are identified by their model name, tokenizer pre-processor, or vocabulary size.

func ExtractTokenizer

func ExtractTokenizer(f *File) (*tokenizer.BPETokenizer, error)

ExtractTokenizer builds a BPETokenizer from GGUF metadata. GGUF files store tokenizer data under the "tokenizer.ggml.*" metadata keys.

func LoadTensors

func LoadTensors(f *File, r io.ReadSeeker) (map[string]*tensor.TensorNumeric[float32], error)

LoadTensors reads tensor data from a parsed GGUF file and returns them as float32 tensors keyed by name. Quantized tensors (Q4_0, Q8_0) are stored using their native quantized storage types for memory efficiency.

func LoadTensorsMmap added in v1.26.0

func LoadTensorsMmap(f *File, mapped []byte) (map[string]*tensor.TensorNumeric[float32], error)

LoadTensorsMmap creates tensors backed by MmapStorage that reference slices of the memory-mapped GGUF file data. No tensor data is copied -- each MmapStorage points directly into the mapped region. Dequantization happens lazily on first access via MmapStorage.Slice().

mapped must be the entire GGUF file memory-mapped into a byte slice. The caller must keep the mapping alive for the lifetime of the returned tensors.

func LoadTensorsMmapSplit added in v1.34.0

func LoadTensorsMmapSplit(sf *SplitFile, mappedShards [][]byte) (map[string]*tensor.TensorNumeric[float32], error)

LoadTensorsMmapSplit creates tensors backed by mmap'd regions from multiple shard files. Each shard is independently mmap'd. Tensor data references the correct shard's mapped region.

func LoadTensorsSplit added in v1.34.0

func LoadTensorsSplit(sf *SplitFile, readers []*os.File) (map[string]*tensor.TensorNumeric[float32], error)

LoadTensorsSplit reads tensor data from all shards using heap allocation.

func MapTensorName

func MapTensorName(arch string, ggufName string) string

MapTensorName converts a GGUF tensor name to the Zerfoo/HuggingFace canonical name. The arch parameter selects architecture-specific name mappings (e.g., "gemma3" uses different norm names than "llama"). Unknown names pass through unchanged.

func QuantizeToFP8E4M3

func QuantizeToFP8E4M3(tensors map[string]*tensor.TensorNumeric[float32]) (map[string]*tensor.TensorNumeric[float32], error)

QuantizeToFP8E4M3 converts all tensors in the map from their current storage to FP8 E4M3 format with per-tensor absmax scaling. This reduces memory to 1 byte per element (1/4 of F32) at the cost of reduced precision. The tensors are modified in place — the returned map is the same object.

func SplitMergedGateUp added in v1.4.0

func SplitMergedGateUp(tensors map[string]*tensor.TensorNumeric[float32], cfg *ModelConfig) error

SplitMergedGateUp finds merged gate+up MLP tensors (*.mlp.up_proj.weight) where gate_proj is absent and up_proj has double the expected intermediate size. This handles architectures like Phi that concatenate gate and up projections into a single tensor: ffn_up has shape [2 * intermediate_size, hidden_size]. The first half of rows is the gate projection, the second half is the up projection.

func SplitMergedQKV added in v1.4.0

func SplitMergedQKV(tensors map[string]*tensor.TensorNumeric[float32], cfg *ModelConfig) error

SplitMergedQKV finds merged QKV projection tensors (*.self_attn.qkv_proj.weight) in the tensor map and splits each into separate Q, K, V projection tensors. This handles architectures like Phi that store merged QKV weights in GGUF.

For MHA (num_heads == num_kv_heads): each projection gets 1/3 of rows. For GQA (num_heads > num_kv_heads): Q gets num_heads*head_dim rows, K and V each get num_kv_heads*head_dim rows.

func TensorByteSize added in v1.29.0

func TensorByteSize(typ GGMLType, numElements int) (int, error)

TensorByteSize returns the number of bytes needed for a tensor of the given type and element count.

Types

type File

type File struct {
	Version    uint32
	Metadata   map[string]any
	Tensors    []TensorInfo
	DataOffset int64 // byte offset where tensor data begins
}

File represents a parsed GGUF file.

func Parse

func Parse(r io.ReadSeeker) (*File, error)

Parse reads a GGUF file header, metadata, and tensor info from r. It does not read tensor data. The returned File.DataOffset indicates where tensor data begins in the file.

func (*File) GetFloat32

func (f *File) GetFloat32(key string) (float32, bool)

GetFloat32 returns a metadata value as float32. Handles float32 natively and also converts float64 values (common in some GGUF converters).

func (*File) GetString

func (f *File) GetString(key string) (string, bool)

GetString returns a metadata string value.

func (*File) GetUint32

func (f *File) GetUint32(key string) (uint32, bool)

GetUint32 returns a metadata value as uint32. Handles uint32 natively and also converts uint64, int32, and int64 values (common in HuggingFace GGUFs for Phi3 and Llama 3.1 which store model dimensions as uint64).

type GGMLType

type GGMLType uint32

GGMLType identifies the quantization type of a tensor.

const (
	GGMLTypeF32     GGMLType = 0
	GGMLTypeF16     GGMLType = 1
	GGMLTypeQ4_0    GGMLType = 2
	GGMLTypeQ4_1    GGMLType = 3
	GGMLTypeQ5_0    GGMLType = 6
	GGMLTypeQ5_1    GGMLType = 7
	GGMLTypeQ8_0    GGMLType = 8
	GGMLTypeQ8_1    GGMLType = 9
	GGMLTypeQ2_K    GGMLType = 10
	GGMLTypeQ3_K    GGMLType = 11
	GGMLTypeQ4_K    GGMLType = 12
	GGMLTypeQ5_K    GGMLType = 13
	GGMLTypeQ6_K    GGMLType = 14
	GGMLTypeQ8_K    GGMLType = 15
	GGMLTypeIQ2_XXS GGMLType = 16 // Importance-weighted 2-bit (E8 lattice codebook)
	GGMLTypeIQ3_S   GGMLType = 21 // Importance-weighted 3-bit with sub-block scales
	GGMLTypeIQ4_NL  GGMLType = 25 // Non-linear 4-bit with lookup table
	GGMLTypeBF16    GGMLType = 30
	GGMLTypeTQ2_0   GGMLType = 35 // Ternary 2-bit: 4 values per byte {-1, 0, 1}
)

Common GGML tensor types.

type ModelConfig

type ModelConfig struct {
	Architecture         string
	Name                 string
	VocabSize            int
	HiddenSize           int
	NumLayers            int
	NumHeads             int
	NumKVHeads           int
	IntermediateSize     int
	MaxSeqLen            int
	RopeTheta            float64
	HeadDim              int     // explicit head dimension (0 = use HiddenSize/NumHeads)
	LogitSoftcap         float32 // if > 0, apply logit softcapping: cap * tanh(logit/cap)
	LocalRopeTheta       float64 // RoPE base for local/sliding-window layers (0 = use RopeTheta)
	SlidingWindow        int     // sliding window size for local attention layers
	SlidingWindowPattern int     // every Nth layer is global (0 = all global)
	RMSNormEps           float32 // RMSNorm epsilon (0 = use default 1e-5)
	PartialRotaryFactor  float32 // fraction of head dims to apply RoPE (0 = full rotation)

	// DeepSeek MLA (Multi-head Latent Attention) fields.
	KVLoRADim     int // KV compression rank (attention.kv_lora_rank)
	QLoRADim      int // Q compression rank (attention.q_lora_rank)
	QKRopeHeadDim int // RoPE head dimension for Q/K (attention.qk_rope_head_dim)

	// DeepSeek MoE (Mixture of Experts) fields.
	NumExperts         int // number of routed experts (expert_count)
	NumExpertsPerToken int // experts activated per token (expert_used_count)
	NumSharedExperts   int // number of shared experts (expert_shared_count)

	// TransMLA: converted MHA-to-MLA models (see ADR-069).
	TransMLAKVLoraDim int // KV LoRA rank from transmla.kv_lora_dim metadata (0 = not a TransMLA model)

	// Residual connection configuration.
	ResidualMode     string // "standard", "attnres", or "block_attnres" (default: "standard")
	AttnResNumBlocks int    // number of blocks for block_attnres mode (default: 8)

	// BERT encoder-only fields.
	NumLabels    int     // number of output classes for sequence classification
	PoolerType   string  // pooling strategy ("cls" or "mean")
	LayerNormEps float32 // LayerNorm epsilon (0 = use default 1e-12)

	// Granite-specific fields.
	EmbeddingMultiplier float32 // multiply embeddings by this factor (0 = no scaling)
	ResidualMultiplier  float32 // multiply residual connections by this factor (0 = no scaling)

	// Nemotron-H SSM (Mamba-2) fields.
	SSMStateSize  int // SSM state dimension (ssm.state_size)
	SSMConvKernel int // SSM convolution kernel width (ssm.conv_kernel)
	SSMNumHeads   int // SSM number of heads (ssm.num_heads)

	// Nemotron-H MoE shared expert count (expert_shared_count in nemotron_h_moe).
	ExpertSharedCount int
	// MoE expert gating configuration.
	ScoringFunc string // expert gating scoring function ("softmax" or "sigmoid"; default: "softmax")

	// Vision encoder fields (LLaVA, multimodal models).
	VisionImageSize  int    // vision encoder input image size (e.g. 336)
	VisionPatchSize  int    // vision encoder patch size (e.g. 14)
	VisionHiddenSize int    // vision encoder hidden dimension
	VisionNumHeads   int    // vision encoder attention heads
	VisionNumLayers  int    // vision encoder transformer layers
	ProjectorType    string // multi-modal projector type ("linear" or "mlp")

	// Audio encoder fields (Voxtral, multimodal speech-to-text models).
	AudioHiddenSize           int    // audio encoder hidden dimension
	AudioNumLayers            int    // audio encoder transformer layers
	AudioNumHeads             int    // audio encoder attention heads
	AudioNumMels              int    // number of mel spectrogram bins (e.g. 128)
	AudioIntermediateSize     int    // audio encoder FFN intermediate size
	AudioProjectorType        string // audio projector type (e.g. "mlp")
	AudioProjectorStackFactor int    // number of consecutive frames to stack (e.g. 4)
}

ModelConfig holds model configuration extracted from GGUF metadata.

func ExtractModelConfig

func ExtractModelConfig(f *File) (*ModelConfig, error)

ExtractModelConfig reads GGUF metadata and returns a ModelConfig. The architecture field (general.architecture) determines which metadata key prefix to use (e.g., "llama." or "gemma."). After extracting config using the declared architecture's key prefix, the actual architecture is detected via DetectActualArchitecture to handle models like Mistral that declare "llama" but need different runtime behavior.

type SplitFile added in v1.34.0

type SplitFile struct {
	// File is the merged view: metadata from shard 0, tensors from all shards.
	// DataOffset is not meaningful for split files — use ShardIndex instead.
	File *File

	// Shards holds the parsed header for each shard, indexed by shard number.
	Shards []*File

	// ShardPaths holds the file paths for each shard.
	ShardPaths []string

	// ShardIndex maps tensor name to the shard index that contains its data.
	ShardIndex map[string]int
}

SplitFile represents a collection of parsed GGUF shards that together form a single model. Metadata comes from shard 0; tensors are distributed across all shards.

func ParseSplit added in v1.34.0

func ParseSplit(path string) (*SplitFile, error)

ParseSplit detects whether path is a split GGUF file and parses all shards. If path is a single (non-split) file, it returns nil, nil and the caller should fall back to Parse. Split files follow the naming convention:

Model-00001-of-00003.gguf
Model-00002-of-00003.gguf
Model-00003-of-00003.gguf

The path may point to any shard; all sibling shards are discovered automatically.

type TensorInfo

type TensorInfo struct {
	Name       string
	Dimensions []uint64
	Type       GGMLType
	Offset     uint64 // relative to DataOffset
}

TensorInfo describes a single tensor in the GGUF file.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL