gguf

package
v1.16.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 25, 2026 License: Apache-2.0 Imports: 10 Imported by: 0

Documentation

Overview

Package gguf provides GGUF file format parsing and writing. (Stability: stable)

Package gguf implements a pure-Go parser for the GGUF v3 model format used by llama.cpp. It reads metadata key-value pairs and tensor descriptors from a GGUF file without loading tensor data into memory.

Index

Constants

View Source
const (
	TypeUint8   uint32 = 0
	TypeInt8    uint32 = 1
	TypeUint16  uint32 = 2
	TypeInt16   uint32 = 3
	TypeUint32  uint32 = 4
	TypeInt32   uint32 = 5
	TypeFloat32 uint32 = 6
	TypeBool    uint32 = 7
	TypeString  uint32 = 8
	TypeArray   uint32 = 9
	TypeUint64  uint32 = 10
	TypeInt64   uint32 = 11
	TypeFloat64 uint32 = 12
)

GGUF metadata value types.

View Source
const Magic uint32 = 0x46554747 // "GGUF" in little-endian

Magic is the GGUF file magic number ("GGUF" in little-endian).

Variables

This section is empty.

Functions

func ExtractTokenizer

func ExtractTokenizer(f *File) (*tokenizer.BPETokenizer, error)

ExtractTokenizer builds a BPETokenizer from GGUF metadata. GGUF files store tokenizer data under the "tokenizer.ggml.*" metadata keys.

func LoadTensors

func LoadTensors(f *File, r io.ReadSeeker) (map[string]*tensor.TensorNumeric[float32], error)

LoadTensors reads tensor data from a parsed GGUF file and returns them as float32 tensors keyed by name. Quantized tensors (Q4_0, Q8_0) are stored using their native quantized storage types for memory efficiency.

func MapTensorName

func MapTensorName(arch string, ggufName string) string

MapTensorName converts a GGUF tensor name to the Zerfoo/HuggingFace canonical name. The arch parameter selects architecture-specific name mappings (e.g., "gemma3" uses different norm names than "llama"). Unknown names pass through unchanged.

func QuantizeToFP8E4M3

func QuantizeToFP8E4M3(tensors map[string]*tensor.TensorNumeric[float32]) (map[string]*tensor.TensorNumeric[float32], error)

QuantizeToFP8E4M3 converts all tensors in the map from their current storage to FP8 E4M3 format with per-tensor absmax scaling. This reduces memory to 1 byte per element (1/4 of F32) at the cost of reduced precision. The tensors are modified in place — the returned map is the same object.

func SplitMergedGateUp added in v1.4.0

func SplitMergedGateUp(tensors map[string]*tensor.TensorNumeric[float32], cfg *ModelConfig) error

SplitMergedGateUp finds merged gate+up MLP tensors (*.mlp.up_proj.weight) where gate_proj is absent and up_proj has double the expected intermediate size. This handles architectures like Phi that concatenate gate and up projections into a single tensor: ffn_up has shape [2 * intermediate_size, hidden_size]. The first half of rows is the gate projection, the second half is the up projection.

func SplitMergedQKV added in v1.4.0

func SplitMergedQKV(tensors map[string]*tensor.TensorNumeric[float32], cfg *ModelConfig) error

SplitMergedQKV finds merged QKV projection tensors (*.self_attn.qkv_proj.weight) in the tensor map and splits each into separate Q, K, V projection tensors. This handles architectures like Phi that store merged QKV weights in GGUF.

For MHA (num_heads == num_kv_heads): each projection gets 1/3 of rows. For GQA (num_heads > num_kv_heads): Q gets num_heads*head_dim rows, K and V each get num_kv_heads*head_dim rows.

Types

type File

type File struct {
	Version    uint32
	Metadata   map[string]any
	Tensors    []TensorInfo
	DataOffset int64 // byte offset where tensor data begins
}

File represents a parsed GGUF file.

func Parse

func Parse(r io.ReadSeeker) (*File, error)

Parse reads a GGUF file header, metadata, and tensor info from r. It does not read tensor data. The returned File.DataOffset indicates where tensor data begins in the file.

func (*File) GetFloat32

func (f *File) GetFloat32(key string) (float32, bool)

GetFloat32 returns a metadata float32 value.

func (*File) GetString

func (f *File) GetString(key string) (string, bool)

GetString returns a metadata string value.

func (*File) GetUint32

func (f *File) GetUint32(key string) (uint32, bool)

GetUint32 returns a metadata uint32 value.

type GGMLType

type GGMLType uint32

GGMLType identifies the quantization type of a tensor.

const (
	GGMLTypeF32  GGMLType = 0
	GGMLTypeF16  GGMLType = 1
	GGMLTypeQ4_0 GGMLType = 2
	GGMLTypeQ4_1 GGMLType = 3
	GGMLTypeQ5_0 GGMLType = 6
	GGMLTypeQ5_1 GGMLType = 7
	GGMLTypeQ8_0 GGMLType = 8
	GGMLTypeQ8_1 GGMLType = 9
	GGMLTypeQ2_K GGMLType = 10
	GGMLTypeQ3_K GGMLType = 11
	GGMLTypeQ4_K GGMLType = 12
	GGMLTypeQ5_K GGMLType = 13
	GGMLTypeQ6_K GGMLType = 14
	GGMLTypeQ8_K GGMLType = 15
	GGMLTypeBF16 GGMLType = 30
)

Common GGML tensor types.

type ModelConfig

type ModelConfig struct {
	Architecture         string
	Name                 string
	VocabSize            int
	HiddenSize           int
	NumLayers            int
	NumHeads             int
	NumKVHeads           int
	IntermediateSize     int
	MaxSeqLen            int
	RopeTheta            float64
	HeadDim              int     // explicit head dimension (0 = use HiddenSize/NumHeads)
	LogitSoftcap         float32 // if > 0, apply logit softcapping: cap * tanh(logit/cap)
	LocalRopeTheta       float64 // RoPE base for local/sliding-window layers (0 = use RopeTheta)
	SlidingWindow        int     // sliding window size for local attention layers
	SlidingWindowPattern int     // every Nth layer is global (0 = all global)
	RMSNormEps           float32 // RMSNorm epsilon (0 = use default 1e-5)
	PartialRotaryFactor  float32 // fraction of head dims to apply RoPE (0 = full rotation)

	// DeepSeek MLA (Multi-head Latent Attention) fields.
	KVLoRADim     int // KV compression rank (attention.kv_lora_rank)
	QLoRADim      int // Q compression rank (attention.q_lora_rank)
	QKRopeHeadDim int // RoPE head dimension for Q/K (attention.qk_rope_head_dim)

	// DeepSeek MoE (Mixture of Experts) fields.
	NumExperts         int // number of routed experts (expert_count)
	NumExpertsPerToken int // experts activated per token (expert_used_count)
	NumSharedExperts   int // number of shared experts (expert_shared_count)

	// Residual connection configuration.
	ResidualMode     string // "standard", "attnres", or "block_attnres" (default: "standard")
	AttnResNumBlocks int    // number of blocks for block_attnres mode (default: 8)

	// BERT encoder-only fields.
	NumLabels    int     // number of output classes for sequence classification
	PoolerType   string  // pooling strategy ("cls" or "mean")
	LayerNormEps float32 // LayerNorm epsilon (0 = use default 1e-12)

	// Vision encoder fields (LLaVA, multimodal models).
	VisionImageSize  int    // vision encoder input image size (e.g. 336)
	VisionPatchSize  int    // vision encoder patch size (e.g. 14)
	VisionHiddenSize int    // vision encoder hidden dimension
	VisionNumHeads   int    // vision encoder attention heads
	VisionNumLayers  int    // vision encoder transformer layers
	ProjectorType    string // multi-modal projector type ("linear" or "mlp")
}

ModelConfig holds model configuration extracted from GGUF metadata.

func ExtractModelConfig

func ExtractModelConfig(f *File) (*ModelConfig, error)

ExtractModelConfig reads GGUF metadata and returns a ModelConfig. The architecture field (general.architecture) determines which metadata key prefix to use (e.g., "llama." or "gemma.").

type TensorInfo

type TensorInfo struct {
	Name       string
	Dimensions []uint64
	Type       GGMLType
	Offset     uint64 // relative to DataOffset
}

TensorInfo describes a single tensor in the GGUF file.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL