gguf

package

v1.26.1 Latest Latest Go to latest Published: Mar 27, 2026 License: Apache-2.0 Imports: 10 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/zerfoo/zerfoo

Links

Open Source Insights

Documentation ¶

Overview ¶

Package gguf provides GGUF file format parsing and writing. (Stability: stable)

Package gguf implements a pure-Go parser for the GGUF v3 model format used by llama.cpp. It reads metadata key-value pairs and tensor descriptors from a GGUF file without loading tensor data into memory.

Index ¶

Constants
func DetectActualArchitecture(f *File, declared string) string
func ExtractTokenizer(f *File) (*tokenizer.BPETokenizer, error)
func LoadTensors(f *File, r io.ReadSeeker) (map[string]*tensor.TensorNumeric[float32], error)
func LoadTensorsMmap(f *File, mapped []byte) (map[string]*tensor.TensorNumeric[float32], error)
func MapTensorName(arch string, ggufName string) string
func QuantizeToFP8E4M3(tensors map[string]*tensor.TensorNumeric[float32]) (map[string]*tensor.TensorNumeric[float32], error)
func SplitMergedGateUp(tensors map[string]*tensor.TensorNumeric[float32], cfg *ModelConfig) error
func SplitMergedQKV(tensors map[string]*tensor.TensorNumeric[float32], cfg *ModelConfig) error
type File
- func Parse(r io.ReadSeeker) (*File, error)
type GGMLType
type ModelConfig
- func ExtractModelConfig(f *File) (*ModelConfig, error)
type TensorInfo

Constants ¶

View Source

const (
	TypeUint8   uint32 = 0
	TypeInt8    uint32 = 1
	TypeUint16  uint32 = 2
	TypeInt16   uint32 = 3
	TypeUint32  uint32 = 4
	TypeInt32   uint32 = 5
	TypeFloat32 uint32 = 6
	TypeBool    uint32 = 7
	TypeString  uint32 = 8
	TypeArray   uint32 = 9
	TypeUint64  uint32 = 10
	TypeInt64   uint32 = 11
	TypeFloat64 uint32 = 12
)

GGUF metadata value types.

View Source

const Magic uint32 = 0x46554747 // "GGUF" in little-endian

Magic is the GGUF file magic number ("GGUF" in little-endian).

Variables ¶

This section is empty.

Functions ¶

func DetectActualArchitecture ¶ added in v1.23.0

func DetectActualArchitecture(f *File, declared string) string

DetectActualArchitecture checks GGUF metadata to detect models that declare one architecture but are actually a different model family. For example, Mistral GGUF files declare general.architecture = "llama" but are identified by their model name, tokenizer pre-processor, or vocabulary size.

func ExtractTokenizer ¶

func ExtractTokenizer(f *File) (*tokenizer.BPETokenizer, error)

ExtractTokenizer builds a BPETokenizer from GGUF metadata. GGUF files store tokenizer data under the "tokenizer.ggml.*" metadata keys.

func LoadTensors ¶

func LoadTensors(f *File, r io.ReadSeeker) (map[string]*tensor.TensorNumeric[float32], error)

LoadTensors reads tensor data from a parsed GGUF file and returns them as float32 tensors keyed by name. Quantized tensors (Q4_0, Q8_0) are stored using their native quantized storage types for memory efficiency.

func LoadTensorsMmap ¶ added in v1.26.0

func LoadTensorsMmap(f *File, mapped []byte) (map[string]*tensor.TensorNumeric[float32], error)

LoadTensorsMmap creates tensors backed by MmapStorage that reference slices of the memory-mapped GGUF file data. No tensor data is copied -- each MmapStorage points directly into the mapped region. Dequantization happens lazily on first access via MmapStorage.Slice().

mapped must be the entire GGUF file memory-mapped into a byte slice. The caller must keep the mapping alive for the lifetime of the returned tensors.

func MapTensorName ¶

func MapTensorName(arch string, ggufName string) string

MapTensorName converts a GGUF tensor name to the Zerfoo/HuggingFace canonical name. The arch parameter selects architecture-specific name mappings (e.g., "gemma3" uses different norm names than "llama"). Unknown names pass through unchanged.

func QuantizeToFP8E4M3 ¶

func QuantizeToFP8E4M3(tensors map[string]*tensor.TensorNumeric[float32]) (map[string]*tensor.TensorNumeric[float32], error)

QuantizeToFP8E4M3 converts all tensors in the map from their current storage to FP8 E4M3 format with per-tensor absmax scaling. This reduces memory to 1 byte per element (1/4 of F32) at the cost of reduced precision. The tensors are modified in place — the returned map is the same object.

func SplitMergedGateUp ¶ added in v1.4.0

func SplitMergedGateUp(tensors map[string]*tensor.TensorNumeric[float32], cfg *ModelConfig) error

SplitMergedGateUp finds merged gate+up MLP tensors (*.mlp.up_proj.weight) where gate_proj is absent and up_proj has double the expected intermediate size. This handles architectures like Phi that concatenate gate and up projections into a single tensor: ffn_up has shape [2 * intermediate_size, hidden_size]. The first half of rows is the gate projection, the second half is the up projection.

func SplitMergedQKV ¶ added in v1.4.0

func SplitMergedQKV(tensors map[string]*tensor.TensorNumeric[float32], cfg *ModelConfig) error

SplitMergedQKV finds merged QKV projection tensors (*.self_attn.qkv_proj.weight) in the tensor map and splits each into separate Q, K, V projection tensors. This handles architectures like Phi that store merged QKV weights in GGUF.

For MHA (num_heads == num_kv_heads): each projection gets 1/3 of rows. For GQA (num_heads > num_kv_heads): Q gets num_heads*head_dim rows, K and V each get num_kv_heads*head_dim rows.

Types ¶

type File ¶

type File struct {
	Version    uint32
	Metadata   map[string]any
	Tensors    []TensorInfo
	DataOffset int64 // byte offset where tensor data begins
}

File represents a parsed GGUF file.

func Parse ¶

func Parse(r io.ReadSeeker) (*File, error)

Parse reads a GGUF file header, metadata, and tensor info from r. It does not read tensor data. The returned File.DataOffset indicates where tensor data begins in the file.

func (*File) GetFloat32 ¶

func (f *File) GetFloat32(key string) (float32, bool)

GetFloat32 returns a metadata float32 value.

func (*File) GetString ¶

func (f *File) GetString(key string) (string, bool)

GetString returns a metadata string value.

func (*File) GetUint32 ¶

func (f *File) GetUint32(key string) (uint32, bool)

GetUint32 returns a metadata uint32 value.

type GGMLType ¶

type GGMLType uint32

GGMLType identifies the quantization type of a tensor.

const (
	GGMLTypeF32  GGMLType = 0
	GGMLTypeF16  GGMLType = 1
	GGMLTypeQ4_0 GGMLType = 2
	GGMLTypeQ4_1 GGMLType = 3
	GGMLTypeQ5_0 GGMLType = 6
	GGMLTypeQ5_1 GGMLType = 7
	GGMLTypeQ8_0 GGMLType = 8
	GGMLTypeQ8_1 GGMLType = 9
	GGMLTypeQ2_K GGMLType = 10
	GGMLTypeQ3_K GGMLType = 11
	GGMLTypeQ4_K GGMLType = 12
	GGMLTypeQ5_K GGMLType = 13
	GGMLTypeQ6_K GGMLType = 14
	GGMLTypeQ8_K GGMLType = 15
	GGMLTypeBF16 GGMLType = 30
)

Common GGML tensor types.

type ModelConfig ¶

type ModelConfig struct {
	Architecture         string
	Name                 string
	VocabSize            int
	HiddenSize           int
	NumLayers            int
	NumHeads             int
	NumKVHeads           int
	IntermediateSize     int
	MaxSeqLen            int
	RopeTheta            float64
	HeadDim              int     // explicit head dimension (0 = use HiddenSize/NumHeads)
	LogitSoftcap         float32 // if > 0, apply logit softcapping: cap * tanh(logit/cap)
	LocalRopeTheta       float64 // RoPE base for local/sliding-window layers (0 = use RopeTheta)
	SlidingWindow        int     // sliding window size for local attention layers
	SlidingWindowPattern int     // every Nth layer is global (0 = all global)
	RMSNormEps           float32 // RMSNorm epsilon (0 = use default 1e-5)
	PartialRotaryFactor  float32 // fraction of head dims to apply RoPE (0 = full rotation)

	// DeepSeek MLA (Multi-head Latent Attention) fields.
	KVLoRADim     int // KV compression rank (attention.kv_lora_rank)
	QLoRADim      int // Q compression rank (attention.q_lora_rank)
	QKRopeHeadDim int // RoPE head dimension for Q/K (attention.qk_rope_head_dim)

	// DeepSeek MoE (Mixture of Experts) fields.
	NumExperts         int // number of routed experts (expert_count)
	NumExpertsPerToken int // experts activated per token (expert_used_count)
	NumSharedExperts   int // number of shared experts (expert_shared_count)

	// Residual connection configuration.
	ResidualMode     string // "standard", "attnres", or "block_attnres" (default: "standard")
	AttnResNumBlocks int    // number of blocks for block_attnres mode (default: 8)

	// BERT encoder-only fields.
	NumLabels    int     // number of output classes for sequence classification
	PoolerType   string  // pooling strategy ("cls" or "mean")
	LayerNormEps float32 // LayerNorm epsilon (0 = use default 1e-12)

	// Granite-specific fields.
	EmbeddingMultiplier float32 // multiply embeddings by this factor (0 = no scaling)
	ResidualMultiplier  float32 // multiply residual connections by this factor (0 = no scaling)

	// Vision encoder fields (LLaVA, multimodal models).
	VisionImageSize  int    // vision encoder input image size (e.g. 336)
	VisionPatchSize  int    // vision encoder patch size (e.g. 14)
	VisionHiddenSize int    // vision encoder hidden dimension
	VisionNumHeads   int    // vision encoder attention heads
	VisionNumLayers  int    // vision encoder transformer layers
	ProjectorType    string // multi-modal projector type ("linear" or "mlp")
}

ModelConfig holds model configuration extracted from GGUF metadata.

func ExtractModelConfig ¶

func ExtractModelConfig(f *File) (*ModelConfig, error)

ExtractModelConfig reads GGUF metadata and returns a ModelConfig. The architecture field (general.architecture) determines which metadata key prefix to use (e.g., "llama." or "gemma."). After extracting config using the declared architecture's key prefix, the actual architecture is detected via DetectActualArchitecture to handle models like Mistral that declare "llama" but need different runtime behavior.

type TensorInfo ¶

type TensorInfo struct {
	Name       string
	Dimensions []uint64
	Type       GGMLType
	Offset     uint64 // relative to DataOffset
}

TensorInfo describes a single tensor in the GGUF file.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL