Documentation
¶
Overview ¶
Package llama provides Go bindings for llama.cpp, enabling efficient LLM inference with GPU acceleration and advanced features like prefix caching and speculative decoding.
This package wraps llama.cpp's C++ API whilst maintaining Go idioms and safety. Heavy computation stays in optimised C++ code, whilst the Go API provides clean concurrency primitives and resource management.
Quick Start ¶
Load a GGUF model and generate text:
model, err := llama.LoadModel("model.gguf")
if err != nil {
log.Fatal(err)
}
defer model.Close()
result, err := model.Generate("Once upon a time")
if err != nil {
log.Fatal(err)
}
fmt.Println(result)
GPU Acceleration ¶
GPU offloading is enabled by default, automatically using CUDA, ROCm, or Metal depending on your build configuration. The library falls back to CPU if GPU resources are unavailable:
// Uses GPU by default (all layers offloaded)
model, err := llama.LoadModel("model.gguf")
// Limit GPU usage (useful for large models)
model, err := llama.LoadModel("model.gguf",
llama.WithGPULayers(20),
)
// Force CPU-only inference
model, err := llama.LoadModel("model.gguf",
llama.WithGPULayers(0),
)
Context Management ¶
The library automatically uses each model's native maximum context length from GGUF metadata, giving you full model capabilities without artificial limits:
// Uses model's native context (e.g. 40960 for Qwen3, 128000 for Gemma 3)
model, err := llama.LoadModel("model.gguf")
// Override for memory savings
model, err := llama.LoadModel("model.gguf",
llama.WithContext(8192),
)
Concurrent Inference ¶
Models are thread-safe and support concurrent generation requests through an internal context pool:
var wg sync.WaitGroup
for i := 0; i < 10; i++ {
wg.Add(1)
go func(prompt string) {
defer wg.Done()
result, _ := model.Generate(prompt)
fmt.Println(result)
}(fmt.Sprintf("Question %d:", i))
}
wg.Wait()
The pool automatically scales between minimum and maximum contexts based on demand, reusing contexts when possible and cleaning up idle ones.
Streaming Generation ¶
Stream tokens as they're generated using a callback:
err := model.GenerateStream("Tell me a story",
func(token string) bool {
fmt.Print(token)
return true // Continue generation
},
)
Return false from the callback to stop generation early.
Prefix Caching ¶
The library automatically reuses KV cache entries for matching prompt prefixes, significantly improving performance for conversation-style usage:
// First call processes full prompt
model.Generate("You are a helpful assistant.\n\nUser: Hello")
// Second call reuses cached system prompt
model.Generate("You are a helpful assistant.\n\nUser: How are you?")
Prefix caching is enabled by default and includes a last-token refresh optimisation to maintain deterministic generation with minimal overhead (~0.1-0.5ms per call).
Speculative Decoding ¶
Accelerate generation using a smaller draft model:
target, _ := llama.LoadModel("large-model.gguf")
draft, _ := llama.LoadModel("small-model.gguf")
defer target.Close()
defer draft.Close()
result, err := target.GenerateWithDraft(
"Once upon a time",
draft,
llama.WithDraftTokens(5),
)
The draft model generates candidate tokens that the target model verifies in parallel, reducing overall latency whilst maintaining quality.
Advanced Configuration ¶
Fine-tune generation behaviour with sampling parameters:
result, err := model.Generate("Explain quantum computing",
llama.WithMaxTokens(500),
llama.WithTemperature(0.7),
llama.WithTopP(0.9),
llama.WithTopK(40),
llama.WithSeed(42),
llama.WithStopWords("</answer>", "\n\n"),
)
Thread Safety ¶
All public methods are thread-safe. The Model type uses an internal RWMutex to protect shared state and coordinates access to the context pool. Multiple goroutines can safely call Generate() concurrently.
Resource Cleanup ¶
Always call Close() when finished with a model to free GPU memory and other resources:
model, err := llama.LoadModel("model.gguf")
if err != nil {
return err
}
defer model.Close()
Close() is safe to call multiple times and will block until all active generation requests complete.
Build Requirements ¶
This package requires CGO and a C++ compiler. Pre-built llama.cpp libraries are included in the repository for convenience. See the project README for detailed build instructions and GPU acceleration setup.
Index ¶
- func Bool(v bool) *bool
- func Float32(v float32) *float32
- func InitLogging()
- func Int(v int) *int
- type ChatDelta
- type ChatMessage
- type ChatOptions
- type ChatResponse
- type Context
- func (c *Context) Chat(ctx gocontext.Context, messages []ChatMessage, opts ChatOptions) (*ChatResponse, error)
- func (c *Context) ChatStream(ctx gocontext.Context, messages []ChatMessage, opts ChatOptions) (<-chan ChatDelta, <-chan error)
- func (c *Context) Close() error
- func (c *Context) Generate(prompt string, opts ...GenerateOption) (string, error)
- func (c *Context) GenerateChannel(ctx gocontext.Context, prompt string, opts ...GenerateOption) (<-chan string, <-chan error)
- func (c *Context) GenerateStream(prompt string, callback func(token string) bool, opts ...GenerateOption) error
- func (c *Context) GenerateWithDraft(prompt string, draft *Context, opts ...GenerateOption) (string, error)
- func (c *Context) GenerateWithDraftChannel(ctx gocontext.Context, prompt string, draft *Context, opts ...GenerateOption) (<-chan string, <-chan error)
- func (c *Context) GenerateWithDraftStream(prompt string, draft *Context, callback func(token string) bool, ...) error
- func (c *Context) GenerateWithTokens(tokens []int32, opts ...GenerateOption) (string, error)
- func (c *Context) GenerateWithTokensStream(tokens []int32, callback func(token string) bool, opts ...GenerateOption) error
- func (c *Context) GetCachedTokenCount() (int, error)
- func (c *Context) GetEmbeddings(text string) ([]float32, error)
- func (c *Context) GetEmbeddingsBatch(texts []string) ([][]float32, error)
- func (c *Context) Tokenize(text string) ([]int32, error)
- type ContextOption
- func WithBatch(size int) ContextOption
- func WithContext(size int) ContextOption
- func WithEmbeddings() ContextOption
- func WithF16Memory() ContextOption
- func WithFlashAttn(mode string) ContextOption
- func WithKVCacheType(cacheType string) ContextOption
- func WithParallel(n int) ContextOption
- func WithPrefixCaching(enabled bool) ContextOption
- func WithThreads(n int) ContextOption
- func WithThreadsBatch(n int) ContextOption
- type GPUInfo
- type GenerateOption
- func WithDRYAllowedLength(length int) GenerateOption
- func WithDRYBase(base float32) GenerateOption
- func WithDRYMultiplier(mult float32) GenerateOption
- func WithDRYPenaltyLastN(n int) GenerateOption
- func WithDRYSequenceBreakers(breakers ...string) GenerateOption
- func WithDebug() GenerateOption
- func WithDraftTokens(n int) GenerateOption
- func WithDynamicTemperature(tempRange, exponent float32) GenerateOption
- func WithFrequencyPenalty(penalty float32) GenerateOption
- func WithIgnoreEOS(ignore bool) GenerateOption
- func WithMaxTokens(n int) GenerateOption
- func WithMinKeep(n int) GenerateOption
- func WithMinP(p float32) GenerateOption
- func WithMirostat(version int) GenerateOption
- func WithMirostatEta(eta float32) GenerateOption
- func WithMirostatTau(tau float32) GenerateOption
- func WithNPrev(n int) GenerateOption
- func WithNProbs(n int) GenerateOption
- func WithPenaltyLastN(n int) GenerateOption
- func WithPresencePenalty(penalty float32) GenerateOption
- func WithRepeatPenalty(penalty float32) GenerateOption
- func WithSeed(seed int) GenerateOption
- func WithStopWords(words ...string) GenerateOption
- func WithTemperature(t float32) GenerateOption
- func WithTopK(k int) GenerateOption
- func WithTopNSigma(sigma float32) GenerateOption
- func WithTopP(p float32) GenerateOption
- func WithTypicalP(p float32) GenerateOption
- func WithXTC(probability, threshold float32) GenerateOption
- type Model
- type ModelMetadata
- type ModelOption
- type ModelStats
- type ProgressCallback
- type ReasoningFormat
- type RuntimeInfo
- type Tool
- type ToolCall
- type ToolFunction
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func Bool ¶
Bool returns a pointer to the given bool value. This is a convenience helper for setting optional ChatOptions fields.
Example:
opts := llama.ChatOptions{
EnableThinking: llama.Bool(true), // Instead of &true
}
func Float32 ¶
Float32 returns a pointer to the given float32 value. This is a convenience helper for setting optional ChatOptions fields.
Example:
opts := llama.ChatOptions{
Temperature: llama.Float32(0.7), // Instead of &0.7
}
func InitLogging ¶
func InitLogging()
InitLogging (re)initializes llama.cpp logging system based on LLAMA_LOG environment variable.
This function is called automatically when the package loads, but can be called again to reconfigure logging after changing the LLAMA_LOG environment variable.
Supported LLAMA_LOG values:
- "none" - No logging
- "error" - Only errors
- "warn" - Warnings and errors (recommended for production)
- "info" - Informational messages (default)
- "debug" - Verbose debug output
Example:
os.Setenv("LLAMA_LOG", "warn") // Quiet mode
llama.InitLogging() // Apply the change
Types ¶
type ChatDelta ¶
type ChatDelta struct {
Content string // Regular content token(s)
ReasoningContent string // Reasoning token(s)
}
ChatDelta represents a streaming chunk from chat completion.
During streaming, deltas arrive progressively. For standard models, only Content is populated with token(s). For reasoning models with extraction enabled, tokens may appear in either Content or ReasoningContent depending on whether they're inside reasoning tags.
Example:
deltaCh, errCh := model.ChatStream(ctx, messages, opts)
for {
select {
case delta, ok := <-deltaCh:
if !ok {
return
}
if delta.Content != "" {
fmt.Print(delta.Content)
}
if delta.ReasoningContent != "" {
fmt.Print("[thinking: ", delta.ReasoningContent, "]")
}
case err := <-errCh:
if err != nil {
log.Fatal(err)
}
}
}
type ChatMessage ¶
type ChatMessage struct {
Role string // Message role (e.g., "system", "user", "assistant")
Content string // Message content
}
ChatMessage represents a message in a chat conversation.
Common roles include "system", "user", "assistant", "tool", and "function". The role is not validated by this library - the model's chat template will handle role interpretation and any unknown roles.
Example:
messages := []llama.ChatMessage{
{Role: "system", Content: "You are a helpful assistant."},
{Role: "user", Content: "What is the capital of France?"},
}
type ChatOptions ¶
type ChatOptions struct {
// Base generation options
MaxTokens *int // Maximum tokens to generate (nil = model default)
Temperature *float32 // Sampling temperature (nil = model default, typically 0.8)
TopP *float32 // Nucleus sampling threshold (nil = model default, typically 0.95)
TopK *int // Top-K sampling (nil = model default, typically 40)
Seed *int // Random seed for reproducible generation (nil = random)
StopWords []string // Additional stop sequences beyond model defaults
// Chat template (Jinja2 template string)
// If empty, uses model's GGUF template. If model has no template, returns error.
// Supports 40+ formats: chatml, llama2, llama3, mistral, gemma, phi3, etc.
// See: https://github.com/ggerganov/llama.cpp/blob/master/common/chat.cpp
ChatTemplate string
// Chat template variables (arbitrary JSON-compatible key-value pairs)
// These are passed to the model's Jinja2 chat template for customisation.
// Common examples: {"add_generation_prompt": true, "tools": [...]}
ChatTemplateKwargs map[string]interface{}
// Reasoning model options (for models like DeepSeek-R1)
EnableThinking *bool // Enable/disable thinking output (nil = model default)
ReasoningBudget *int // Token limit for reasoning (-1 = unlimited, 0 = disabled)
ReasoningFormat ReasoningFormat // How to handle reasoning content
// Streaming configuration
StreamBufferSize int // Buffer size for streaming channels (default: 256)
}
ChatOptions configures chat completion behaviour.
This extends the base generation options with chat-specific settings like template variables and reasoning parameters. All generation options (temperature, top_p, etc.) can be set here, or left nil to use defaults.
Example:
opts := llama.ChatOptions{
MaxTokens: llama.Int(100),
Temperature: llama.Float32(0.7),
TopP: llama.Float32(0.9),
}
type ChatResponse ¶
type ChatResponse struct {
Content string // Regular response content
ReasoningContent string // Extracted reasoning/thinking (if reasoning model)
}
ChatResponse represents the complete response from a chat completion.
For standard models, only Content is populated. For reasoning models (like DeepSeek-R1), ReasoningContent may contain extracted thinking/ reasoning tokens that were separated from the main response.
Example:
response, err := model.Chat(ctx, messages, opts)
if err != nil {
log.Fatal(err)
}
fmt.Println("Response:", response.Content)
if response.ReasoningContent != "" {
fmt.Println("Reasoning:", response.ReasoningContent)
}
type Context ¶
type Context struct {
// contains filtered or unexported fields
}
Context represents an execution context for inference operations.
Context instances maintain their own KV cache and state, allowing independent inference operations. Contexts are NOT thread-safe - each context should be used by only one goroutine at a time. For concurrent inference, create multiple contexts from the same model.
Multiple contexts share model weights, making concurrent inference VRAM-efficient (e.g., one 7GB model + 100MB per context vs 7GB per instance).
Resources should be freed with Close() when finished:
ctx, _ := model.NewContext(llama.WithContext(8192)) defer ctx.Close()
See also: Model.NewContext for creating contexts.
func (*Context) Chat ¶
func (c *Context) Chat(ctx gocontext.Context, messages []ChatMessage, opts ChatOptions) (*ChatResponse, error)
Chat performs conversational generation using chat messages.
This method formats messages using a chat template and generates a response. The template can be provided in opts or will be read from the model's GGUF metadata. Supports 40+ template formats including ChatML, Llama-2, Llama-3, Mistral, Gemma, and Phi-3.
See also: ChatStream for streaming responses, Generate for raw prompt completion.
Example:
messages := []llama.ChatMessage{
{Role: "system", Content: "You are a helpful assistant."},
{Role: "user", Content: "Hello!"},
}
response, err := ctx.Chat(context.Background(), messages, llama.ChatOptions{})
func (*Context) ChatStream ¶
func (c *Context) ChatStream(ctx gocontext.Context, messages []ChatMessage, opts ChatOptions) (<-chan ChatDelta, <-chan error)
ChatStream performs conversational generation with streaming output.
Returns channels for chat deltas and errors, similar to GenerateChannel. Supports context cancellation for early termination.
See also: Chat for synchronous chat completion.
Example:
deltas, errs := ctx.ChatStream(context.Background(), messages, llama.ChatOptions{})
for delta := range deltas {
fmt.Print(delta.Content)
}
func (*Context) Close ¶
Close frees the context and its associated resources.
This method is idempotent - multiple calls are safe and subsequent calls return immediately without error.
After Close() is called, all other methods return an error.
Example:
ctx, _ := model.NewContext() defer ctx.Close()
func (*Context) Generate ¶
func (c *Context) Generate(prompt string, opts ...GenerateOption) (string, error)
Generate generates text from the given prompt.
This method performs synchronous text generation, returning the complete result when finished. The context automatically reuses KV cache entries for matching prompt prefixes (prefix caching), significantly improving performance for conversation-style usage.
Thread safety: Context is NOT thread-safe. Use separate contexts for concurrent generation requests (create multiple contexts from the same Model).
See also: GenerateStream for streaming output, Chat for structured conversations.
Examples:
// Basic generation
result, err := ctx.Generate("Once upon a time")
// With custom parameters
result, err := ctx.Generate("Explain quantum physics",
llama.WithMaxTokens(512),
llama.WithTemperature(0.7),
)
func (*Context) GenerateChannel ¶
func (c *Context) GenerateChannel(ctx gocontext.Context, prompt string, opts ...GenerateOption) (<-chan string, <-chan error)
GenerateChannel generates text with streaming output via channel.
Returns two channels: one for tokens and one for errors. The token channel is closed when generation completes. The error channel receives at most one error before closing.
This method supports context cancellation for stopping generation early.
See also: GenerateStream for callback-based streaming, Generate for synchronous generation.
Example:
tokens, errs := ctx.GenerateChannel(context.Background(), "Write a story")
for token := range tokens {
fmt.Print(token)
}
if err := <-errs; err != nil {
log.Fatal(err)
}
func (*Context) GenerateStream ¶
func (c *Context) GenerateStream(prompt string, callback func(token string) bool, opts ...GenerateOption) error
GenerateStream generates text with streaming output via callback.
The callback receives each generated token as it's produced. Return true to continue generation, or false to stop early.
See also: Generate for synchronous generation, GenerateChannel for channel-based streaming with context cancellation support.
Examples:
// Stream to stdout
err := ctx.GenerateStream("Tell me a story",
func(token string) bool {
fmt.Print(token)
return true
},
)
func (*Context) GenerateWithDraft ¶
func (c *Context) GenerateWithDraft(prompt string, draft *Context, opts ...GenerateOption) (string, error)
GenerateWithDraft performs speculative generation using a draft model.
Speculative decoding uses a smaller draft model to generate candidate tokens that the target model verifies in parallel. This reduces latency whilst maintaining the target model's quality.
Best results when draft model is 5-10x smaller than target and models share similar vocabularies. Typical speedup: 1.5-3x.
See also: GenerateWithDraftStream for streaming speculative generation.
Example:
target, _ := llama.LoadModel("large-model.gguf")
draft, _ := llama.LoadModel("small-model.gguf")
targetCtx, _ := target.NewContext()
draftCtx, _ := draft.NewContext()
result, err := targetCtx.GenerateWithDraft("Once upon a time", draftCtx,
llama.WithDraftTokens(8),
)
func (*Context) GenerateWithDraftChannel ¶
func (c *Context) GenerateWithDraftChannel(ctx gocontext.Context, prompt string, draft *Context, opts ...GenerateOption) (<-chan string, <-chan error)
GenerateWithDraftChannel generates text with streaming via channel using a draft model.
Combines GenerateWithDraft and GenerateChannel.
Example:
tokens, errs := targetCtx.GenerateWithDraftChannel(context.Background(),
"Write a story", draftCtx, llama.WithDraftTokens(8))
for token := range tokens {
fmt.Print(token)
}
func (*Context) GenerateWithDraftStream ¶
func (c *Context) GenerateWithDraftStream(prompt string, draft *Context, callback func(token string) bool, opts ...GenerateOption) error
GenerateWithDraftStream performs speculative generation with streaming output.
Combines GenerateWithDraft and GenerateStream.
Example:
targetCtx.GenerateWithDraftStream("Write a story", draftCtx,
func(token string) bool {
fmt.Print(token)
return true
},
llama.WithDraftTokens(8),
)
func (*Context) GenerateWithTokens ¶
func (c *Context) GenerateWithTokens(tokens []int32, opts ...GenerateOption) (string, error)
GenerateWithTokens generates text starting from the given tokens.
This is an advanced method for cases where you've already tokenized the prompt or want to use cached tokens. For normal usage, use Generate() instead.
Example:
tokens, _ := ctx.Tokenize("Once upon a time")
result, _ := ctx.GenerateWithTokens(tokens)
func (*Context) GenerateWithTokensStream ¶
func (c *Context) GenerateWithTokensStream(tokens []int32, callback func(token string) bool, opts ...GenerateOption) error
GenerateWithTokensStream generates text with streaming from tokens.
Combines GenerateWithTokens and GenerateStream.
Example:
tokens, _ := ctx.Tokenize("Write a story")
err := ctx.GenerateWithTokensStream(tokens,
func(token string) bool {
fmt.Print(token)
return true
},
)
func (*Context) GetCachedTokenCount ¶
GetCachedTokenCount returns the number of cached tokens (for debugging/metrics).
This method provides insight into prefix caching behaviour, showing how many tokens from previous prompts are cached.
Example:
ctx.Generate("System prompt: You are helpful.\n\nUser: Hello")
cached, _ := ctx.GetCachedTokenCount()
fmt.Printf("Cached tokens: %d\n", cached)
func (*Context) GetEmbeddings ¶
GetEmbeddings computes embeddings for the given text.
Embeddings are vector representations useful for semantic search, clustering, or similarity tasks. The context must be created with WithEmbeddings() to use this method.
See also: GetEmbeddingsBatch for efficient batch processing of multiple texts.
Example:
ctx, _ := model.NewContext(llama.WithEmbeddings())
emb1, _ := ctx.GetEmbeddings("Hello world")
emb2, _ := ctx.GetEmbeddings("Hi there")
func (*Context) GetEmbeddingsBatch ¶
GetEmbeddingsBatch computes embeddings for multiple texts efficiently.
This method processes multiple texts in a single batch operation, which is significantly more efficient than calling GetEmbeddings repeatedly. Uses parallel sequence processing (configured via WithParallel) to maximise throughput.
The context must be created with WithEmbeddings() to use this method. Batch size is limited by WithParallel setting (default 8 for embedding contexts).
See also: GetEmbeddings for single text processing.
Example:
ctx, _ := model.NewContext(llama.WithEmbeddings())
texts := []string{"First", "Second", "Third"}
embeddings, _ := ctx.GetEmbeddingsBatch(texts)
func (*Context) Tokenize ¶
Tokenize converts text to tokens.
Tokens are integer IDs representing subword units in the model's vocabulary. This method is useful for advanced use cases like manual prompt construction, token counting, or analysis.
Examples:
// Count tokens in a prompt
tokens, _ := ctx.Tokenize("Hello world")
fmt.Printf("Token count: %d\n", len(tokens))
type ContextOption ¶
type ContextOption func(*contextConfig)
ContextOption configures context creation (context-level settings).
func WithBatch ¶
func WithBatch(size int) ContextOption
WithBatch sets the batch size for prompt processing.
Larger batch sizes improve throughput for long prompts but increase memory usage. The batch size determines how many tokens are processed in parallel during the prompt evaluation phase.
Default: 512
Example:
// Process 1024 tokens at once for faster prompt handling ctx, err := model.NewContext(llama.WithBatch(1024))
func WithContext ¶
func WithContext(size int) ContextOption
WithContext sets the context window size in tokens.
The context size determines how many tokens (prompt + generation) the context can process. By default, the library uses the model's native maximum context length (e.g. 32768 for Qwen3, 128000 for Gemma 3 models >4B).
Override this if you need to limit memory usage or have specific requirements.
IMPORTANT: Very small context sizes (< 64 tokens) may cause llama.cpp to crash internally. The library provides defensive checks but cannot prevent all edge cases with absurdly small contexts.
Default: 0 (uses model's native maximum from GGUF metadata)
Examples:
// Use model's full capability (default) ctx, err := model.NewContext() // Limit to 8K for memory savings ctx, err := model.NewContext(llama.WithContext(8192))
func WithEmbeddings ¶
func WithEmbeddings() ContextOption
WithEmbeddings enables embedding extraction mode.
When enabled, the context can compute text embeddings via GetEmbeddings(). This mode is required for semantic search, clustering, or similarity tasks. Note that not all models support embeddings - check model documentation.
Default: false (text generation mode)
Example:
ctx, err := model.NewContext(llama.WithEmbeddings())
embeddings, err := ctx.GetEmbeddings("Hello world")
func WithF16Memory ¶
func WithF16Memory() ContextOption
WithF16Memory enables 16-bit floating point memory mode.
When enabled, the context uses FP16 precision for KV cache storage, reducing memory usage at the cost of slight accuracy loss. Most useful when working with very long contexts or memory-constrained environments.
Default: false (uses FP32 for KV cache)
Example:
ctx, err := model.NewContext(llama.WithF16Memory())
func WithFlashAttn ¶
func WithFlashAttn(mode string) ContextOption
WithFlashAttn controls Flash Attention kernel usage for attention computation.
Flash Attention is a GPU-optimized attention implementation that significantly reduces VRAM usage and improves performance, especially for longer contexts. It's required when using quantized KV cache types (q8_0, q4_0).
Available modes:
- "auto" (default): llama.cpp decides based on hardware and model config
- "enabled": Force Flash Attention on (fails if hardware doesn't support it)
- "disabled": Use traditional attention (incompatible with quantized KV cache)
Technical details:
- Requires CUDA compute capability 7.0+ (Volta/Turing or newer)
- With GGML_CUDA_FA_ALL_QUANTS: Supports all KV cache quantization types
- Without flag: Only supports f16, q4_0, and q8_0 (matching K/V types)
- AUTO mode detects if backend scheduler supports the Flash Attention ops
Default: "auto" (llama.cpp chooses optimal path)
Examples:
// Use default auto-detection (recommended)
ctx, err := model.NewContext(llama.WithKVCacheType("q8_0"))
// Force Flash Attention on (errors if unsupported)
ctx, err := model.NewContext(llama.WithFlashAttn("enabled"))
// Disable Flash Attention (requires f16 KV cache)
ctx, err := model.NewContext(
llama.WithKVCacheType("f16"),
llama.WithFlashAttn("disabled"),
)
func WithKVCacheType ¶
func WithKVCacheType(cacheType string) ContextOption
WithKVCacheType sets the quantization type for KV cache storage.
The KV (key-value) cache stores attention states during generation and grows with context length. Quantizing this cache dramatically reduces VRAM usage with minimal quality impact:
- "q8_0" (default): 50% VRAM savings, ~0.1% quality loss (imperceptible)
- "f16": Full precision, no savings, maximum quality
- "q4_0": 75% VRAM savings, noticeable quality loss (models become forgetful)
Memory scaling example for 131K context (DeepSeek-R1 trained capacity):
- f16: 18 GB
- q8_0: 9 GB (recommended)
- q4_0: 4.5 GB (use only for extreme VRAM constraints)
Default: "q8_0" (best balance of memory and quality)
Examples:
// Use default Q8 quantization (recommended)
ctx, err := model.NewContext()
// Maximum quality for VRAM-rich systems
ctx, err := model.NewContext(llama.WithKVCacheType("f16"))
// Extreme memory savings (accept quality loss)
ctx, err := model.NewContext(llama.WithKVCacheType("q4_0"))
func WithParallel ¶
func WithParallel(n int) ContextOption
WithParallel sets the number of parallel sequences for batch processing.
This option controls how many independent sequences can be processed simultaneously in a single batch. Higher values enable larger batch sizes for operations like GetEmbeddingsBatch() but consume more VRAM.
For embedding contexts, the library defaults to n_parallel=8 if not explicitly set. This option allows tuning this value for your specific VRAM constraints and batch sizes.
VRAM usage scales approximately as:
base_model_size + (n_parallel × context_size × kv_cache_bytes)
For example, a 4B Q8 embedding model with 8192 context and q8_0 cache:
- n_parallel=8: ~12 GB VRAM
- n_parallel=4: ~8 GB VRAM
- n_parallel=2: ~6 GB VRAM
- n_parallel=1: ~5 GB VRAM (disables batch processing)
Trade-offs:
- Lower values: Less VRAM usage, slower batch processing, smaller max batch size
- Higher values: More VRAM usage, faster batch processing, larger max batch size
Default: 1 for generation contexts, 8 for embedding contexts (auto-set)
Examples:
// Use default (8 for embeddings, 1 for generation)
ctx, err := model.NewContext(llama.WithEmbeddings())
// Tune down for large embedding model with limited VRAM
ctx, err := model.NewContext(
llama.WithEmbeddings(),
llama.WithParallel(4),
)
// Single sequence (minimal VRAM, no batching)
ctx, err := model.NewContext(
llama.WithEmbeddings(),
llama.WithParallel(1),
)
func WithPrefixCaching ¶
func WithPrefixCaching(enabled bool) ContextOption
WithPrefixCaching enables or disables KV cache prefix reuse.
When enabled (default), the context automatically reuses cached KV entries for matching prompt prefixes, significantly improving performance for conversation-style usage where prompts share common beginnings.
Default: true (enabled)
Example:
// Disable prefix caching (not recommended for most use cases) ctx, err := model.NewContext(llama.WithPrefixCaching(false))
func WithThreads ¶
func WithThreads(n int) ContextOption
WithThreads sets the number of threads for token generation. If not specified, defaults to runtime.NumCPU(). This also sets threadsBatch to the same value unless WithThreadsBatch is used.
func WithThreadsBatch ¶
func WithThreadsBatch(n int) ContextOption
WithThreadsBatch sets the number of threads for batch/prompt processing. If not specified, defaults to the same value as threads. For most use cases, leaving this unset is recommended.
type GPUInfo ¶
type GPUInfo struct {
DeviceID int // CUDA device ID
DeviceName string // GPU model name (e.g., "NVIDIA GeForce RTX 3090")
FreeMemoryMB int // Available VRAM in MB
TotalMemoryMB int // Total VRAM in MB
}
GPUInfo contains information about a CUDA GPU device.
type GenerateOption ¶
type GenerateOption func(*generateConfig)
GenerateOption configures text generation behaviour.
func WithDRYAllowedLength ¶
func WithDRYAllowedLength(length int) GenerateOption
WithDRYAllowedLength sets minimum repeat length before DRY penalty applies.
Repetitions shorter than this many tokens are ignored by DRY sampling. Prevents penalising common short phrases and natural language patterns. Only relevant when DRY multiplier is enabled.
Default: 2
Example:
// Only penalise repetitions of 4+ tokens
text, err := model.Generate("Write text",
llama.WithDRYMultiplier(0.8),
llama.WithDRYAllowedLength(4),
)
func WithDRYBase ¶
func WithDRYBase(base float32) GenerateOption
WithDRYBase sets the base for DRY penalty exponentiation.
Controls how rapidly penalty grows for longer repeated sequences. Higher values penalise longer repetitions more aggressively. Only affects behaviour when DRY multiplier is enabled (> 0.0).
Default: 1.75
Example:
// Stronger penalty for long repeated sequences
text, err := model.Generate("Write text",
llama.WithDRYMultiplier(0.8),
llama.WithDRYBase(2.0),
)
func WithDRYMultiplier ¶
func WithDRYMultiplier(mult float32) GenerateOption
WithDRYMultiplier enables DRY repetition penalty.
DRY sampling uses sophisticated sequence matching to penalise repetitive patterns. The multiplier controls penalty strength (0.0 = disabled, 0.8 = moderate, higher = stronger). More effective than basic repetition penalties for catching phrase-level and structural repetition.
Default: 0.0 (disabled)
Example:
// Prevent repetitive patterns
text, err := model.Generate("Write varied text",
llama.WithDRYMultiplier(0.8),
llama.WithDRYBase(1.75),
)
func WithDRYPenaltyLastN ¶
func WithDRYPenaltyLastN(n int) GenerateOption
WithDRYPenaltyLastN sets how many recent tokens DRY sampling considers.
DRY looks back this many tokens when detecting repetitive patterns. Use -1 for full context size, or specify a smaller window for efficiency. Only affects behaviour when DRY multiplier is enabled.
Default: -1 (context size)
Example:
// Check last 512 tokens for repetition
text, err := model.Generate("Write text",
llama.WithDRYMultiplier(0.8),
llama.WithDRYPenaltyLastN(512),
)
func WithDRYSequenceBreakers ¶
func WithDRYSequenceBreakers(breakers ...string) GenerateOption
WithDRYSequenceBreakers sets sequences that break DRY repetition matching.
When these sequences appear, DRY stops considering earlier tokens as part of a repeated pattern. Default breakers (newline, colon, quote, asterisk) work well for natural text structure. Only affects behaviour when DRY multiplier is enabled.
Default: []string{"\n", ":", "\"", "*"}
Example:
// Custom breakers for code generation
text, err := model.Generate("Write code",
llama.WithDRYMultiplier(0.8),
llama.WithDRYSequenceBreakers("\n", ";", "{", "}"),
)
func WithDebug ¶
func WithDebug() GenerateOption
WithDebug enables verbose logging for generation internals.
When enabled, prints detailed information about token sampling, timing, and internal state to stderr. Useful for debugging generation issues or understanding model behaviour. Not recommended for production use.
Default: false
Example:
text, err := model.Generate("Test prompt",
llama.WithDebug(),
)
func WithDraftTokens ¶
func WithDraftTokens(n int) GenerateOption
WithDraftTokens sets the number of speculative tokens for draft model usage.
When using GenerateWithDraft, the draft model speculatively generates this many tokens per iteration. Higher values increase potential speedup but waste more work if predictions are rejected. Typical range: 4-32 tokens.
Default: 16
Example:
target, _ := llama.LoadModel("large-model.gguf")
draft, _ := llama.LoadModel("small-model.gguf")
text, err := target.GenerateWithDraft("Write a story", draft,
llama.WithDraftTokens(8),
)
func WithDynamicTemperature ¶
func WithDynamicTemperature(tempRange, exponent float32) GenerateOption
WithDynamicTemperature enables entropy-based temperature adjustment.
Dynamic temperature adjusts sampling temperature based on prediction entropy (uncertainty). The range parameter controls the adjustment span (0.0 = disabled, higher = more dynamic). The exponent controls how entropy maps to temperature. This adapts creativity to context: more focused when confident, more exploratory when uncertain.
Default: range 0.0 (disabled), exponent 1.0
Example:
// Enable dynamic temperature with range 0.5
text, err := model.Generate("Write adaptively",
llama.WithDynamicTemperature(0.5, 1.0),
)
func WithFrequencyPenalty ¶
func WithFrequencyPenalty(penalty float32) GenerateOption
WithFrequencyPenalty sets the frequency-based repetition penalty.
Penalises tokens proportionally to how often they've appeared. Positive values (e.g. 0.5) discourage repetition, negative values encourage it. Use 0.0 to disable. Unlike repeat penalty, this considers cumulative frequency rather than just presence/absence.
Default: 0.0 (disabled)
Example:
// Discourage frequently used words
text, err := model.Generate("Write varied prose",
llama.WithFrequencyPenalty(0.5),
)
func WithIgnoreEOS ¶
func WithIgnoreEOS(ignore bool) GenerateOption
WithIgnoreEOS continues generation past end-of-sequence tokens.
When enabled, generation continues even after the model produces an EOS token, up to max_tokens limit. Useful for forcing longer outputs or exploring model behaviour beyond natural stopping points. Most applications should leave this disabled.
Default: false
Example:
// Force generation to continue past EOS
text, err := model.Generate("Complete this",
llama.WithIgnoreEOS(true),
llama.WithMaxTokens(512),
)
func WithMaxTokens ¶
func WithMaxTokens(n int) GenerateOption
WithMaxTokens sets the maximum number of tokens to generate.
Generation stops after producing this many tokens, even if the model hasn't emitted an end-of-sequence token. This prevents runaway generation and controls response length.
Default: 128
Example:
// Generate up to 512 tokens
text, err := model.Generate("Write a story",
llama.WithMaxTokens(512),
)
func WithMinKeep ¶
func WithMinKeep(n int) GenerateOption
WithMinKeep sets minimum tokens to keep regardless of other filters.
Ensures at least this many tokens remain available after sampling filters (top-k, top-p, min-p, etc.) are applied. Prevents over-aggressive filtering from leaving no valid tokens. Use 0 for no minimum.
Default: 0
Example:
// Ensure at least 5 token choices remain
text, err := model.Generate("Generate text",
llama.WithTopK(10),
llama.WithMinKeep(5),
)
func WithMinP ¶
func WithMinP(p float32) GenerateOption
WithMinP enables minimum probability threshold sampling.
Min-P sampling filters out tokens with probability below p * max_probability. This is a modern alternative to top-p that adapts dynamically to the confidence of predictions. More effective than top-p for maintaining quality whilst allowing appropriate diversity.
Default: 0.05
Example:
// Stricter filtering for focused output
text, err := model.Generate("Explain quantum physics",
llama.WithMinP(0.1),
)
func WithMirostat ¶
func WithMirostat(version int) GenerateOption
WithMirostat enables Mirostat adaptive sampling.
Mirostat dynamically adjusts sampling to maintain consistent perplexity (surprise level). Version 0 = disabled, 1 = Mirostat v1, 2 = Mirostat v2 (recommended). Use WithMirostatTau and WithMirostatEta to control target perplexity and learning rate. Mirostat replaces temperature/top-k/top-p with adaptive control for more consistent quality.
Default: 0 (disabled)
Example:
// Enable Mirostat v2 for consistent quality
text, err := model.Generate("Write text",
llama.WithMirostat(2),
llama.WithMirostatTau(5.0),
llama.WithMirostatEta(0.1),
)
func WithMirostatEta ¶
func WithMirostatEta(eta float32) GenerateOption
WithMirostatEta sets learning rate for Mirostat adaptation.
Eta controls how quickly Mirostat adjusts to maintain target perplexity. Higher values adapt faster but may oscillate, lower values adapt smoothly but slowly. Typical range: 0.05-0.2. Only affects behaviour when Mirostat is enabled (version 1 or 2).
Default: 0.1
Example:
// Faster adaptation
text, err := model.Generate("Write text",
llama.WithMirostat(2),
llama.WithMirostatEta(0.15),
)
func WithMirostatTau ¶
func WithMirostatTau(tau float32) GenerateOption
WithMirostatTau sets target perplexity for Mirostat sampling.
Tau controls the target cross-entropy (surprise level) that Mirostat tries to maintain. Higher values allow more surprise/diversity, lower values produce more focused output. Typical range: 3.0-8.0. Only affects behaviour when Mirostat is enabled (version 1 or 2).
Default: 5.0
Example:
// Lower perplexity for more focused output
text, err := model.Generate("Write precisely",
llama.WithMirostat(2),
llama.WithMirostatTau(3.0),
)
func WithNPrev ¶
func WithNPrev(n int) GenerateOption
WithNPrev sets number of previous tokens to remember for sampling.
Controls internal buffer size for token history used by various sampling methods. Rarely needs adjustment from the default. Larger values may improve long-range coherence at the cost of memory.
Default: 64
Example:
// Larger history buffer
text, err := model.Generate("Write text",
llama.WithNPrev(128),
)
func WithNProbs ¶
func WithNProbs(n int) GenerateOption
WithNProbs enables probability output for top tokens.
When set to n > 0, outputs probabilities for the top n most likely tokens at each step. Use 0 to disable (no probability output). Useful for analysis, debugging, or implementing custom sampling strategies. Note that enabling this may affect performance.
Default: 0 (disabled)
Example:
// Output top 5 token probabilities
text, err := model.Generate("Write text",
llama.WithNProbs(5),
)
func WithPenaltyLastN ¶
func WithPenaltyLastN(n int) GenerateOption
WithPenaltyLastN sets how many recent tokens to consider for penalties.
Repetition penalties (repeat, frequency, presence) only apply to the last n tokens. Use 0 to disable all repetition penalties, -1 to use full context size. Larger values catch longer-range repetition but may over-penalise.
Default: 64
Example:
// Consider last 256 tokens for repetition
text, err := model.Generate("Write text",
llama.WithRepeatPenalty(1.1),
llama.WithPenaltyLastN(256),
)
func WithPresencePenalty ¶
func WithPresencePenalty(penalty float32) GenerateOption
WithPresencePenalty sets the presence-based repetition penalty.
Penalises tokens that have appeared at all, regardless of frequency. Positive values (e.g. 0.6) encourage new topics and vocabulary. Use 0.0 to disable. This is effective for maintaining topic diversity and preventing the model from fixating on specific words.
Default: 0.0 (disabled)
Example:
// Encourage diverse vocabulary
text, err := model.Generate("Write creatively",
llama.WithPresencePenalty(0.6),
)
func WithRepeatPenalty ¶
func WithRepeatPenalty(penalty float32) GenerateOption
WithRepeatPenalty sets the repetition penalty multiplier.
Applies penalty to recently used tokens to reduce repetition. Values > 1.0 penalise repeated tokens (1.1 = mild, 1.5 = strong). Use 1.0 to disable. Applied to last penalty_last_n tokens. This is the classic repetition penalty used in most LLM implementations.
Default: 1.0 (disabled)
Example:
// Reduce repetition in creative writing
text, err := model.Generate("Write a story",
llama.WithRepeatPenalty(1.1),
llama.WithPenaltyLastN(256),
)
func WithSeed ¶
func WithSeed(seed int) GenerateOption
WithSeed sets the random seed for reproducible generation.
Using the same seed with identical settings produces deterministic output. Use -1 for random seed (different output each time). Useful for testing, debugging, or when reproducibility is required.
Default: -1 (random)
Example:
// Reproducible generation
text, err := model.Generate("Write a story",
llama.WithSeed(42),
llama.WithTemperature(0.8),
)
func WithStopWords ¶
func WithStopWords(words ...string) GenerateOption
WithStopWords specifies sequences that terminate generation when encountered.
Generation stops immediately when any stop word is produced. Useful for controlling response format (e.g. stopping at newlines) or implementing chat patterns. The stop words themselves are not included in the output.
Default: none
Examples:
// Stop at double newline
text, err := model.Generate("Q: What is AI?",
llama.WithStopWords("\n\n"),
)
// Multiple stop sequences
text, err := model.Generate("User:",
llama.WithStopWords("User:", "Assistant:", "\n\n"),
)
func WithTemperature ¶
func WithTemperature(t float32) GenerateOption
WithTemperature controls randomness in token selection.
Higher values (e.g. 1.2) increase creativity and diversity but may reduce coherence. Lower values (e.g. 0.3) make output more deterministic and focused. Use 0.0 for fully deterministic greedy sampling (always pick the most likely token).
Default: 0.8
Examples:
// Creative writing
text, err := model.Generate("Write a poem",
llama.WithTemperature(1.1),
)
// Precise factual responses
text, err := model.Generate("What is 2+2?",
llama.WithTemperature(0.1),
)
func WithTopK ¶
func WithTopK(k int) GenerateOption
WithTopK limits token selection to the k most likely candidates.
Top-k sampling considers only the k highest probability tokens at each step. Lower values increase focus and determinism, higher values increase diversity. Use 0 to disable (consider all tokens).
Default: 40
Example:
// Very focused generation
text, err := model.Generate("Complete this",
llama.WithTopK(10),
)
func WithTopNSigma ¶
func WithTopNSigma(sigma float32) GenerateOption
WithTopNSigma enables top-n-sigma statistical filtering.
Filters tokens beyond n standard deviations from the mean log probability. Use -1.0 to disable. This statistical approach removes unlikely outliers whilst preserving the natural probability distribution shape.
Default: -1.0 (disabled)
Example:
// Filter statistical outliers
text, err := model.Generate("Generate text",
llama.WithTopNSigma(2.0),
)
func WithTopP ¶
func WithTopP(p float32) GenerateOption
WithTopP enables nucleus sampling with the specified cumulative probability.
Top-p sampling (nucleus sampling) considers only the smallest set of tokens whose cumulative probability exceeds p. This balances diversity and quality better than top-k for many tasks. Use 1.0 to disable (consider all tokens).
Default: 0.95
Example:
// More focused sampling
text, err := model.Generate("Complete this",
llama.WithTopP(0.85),
)
func WithTypicalP ¶
func WithTypicalP(p float32) GenerateOption
WithTypicalP enables locally typical sampling.
Typical-p sampling (typ-p) filters tokens based on information content, keeping those with typical entropy. Use 1.0 to disable. This helps avoid both highly predictable and highly surprising tokens, producing more "typical" text that feels natural.
Default: 1.0 (disabled)
Example:
// Enable typical sampling
text, err := model.Generate("Write naturally",
llama.WithTypicalP(0.95),
)
func WithXTC ¶
func WithXTC(probability, threshold float32) GenerateOption
WithXTC enables experimental XTC sampling for diversity.
XTC probabilistically excludes the most likely token to encourage diversity. The probability parameter controls how often exclusion occurs (0.0 = disabled, 0.1 = 10% of the time). The threshold parameter limits when XTC applies (> 0.5 effectively disables). This is an experimental technique for reducing predictability.
Default: probability 0.0 (disabled), threshold 0.1
Example:
// Enable XTC for more surprising outputs
text, err := model.Generate("Write creatively",
llama.WithXTC(0.1, 0.1),
)
type Model ¶
type Model struct {
ProgressCallbackID uintptr // Internal ID for progress callback cleanup (for testing)
// contains filtered or unexported fields
}
Model represents loaded model weights.
Model instances are thread-safe and can be used to create multiple execution contexts with different configurations. The model owns the weights in memory but doesn't perform inference directly - use NewContext() to create execution contexts.
Resources are automatically freed via finaliser, but explicit Close() is recommended for deterministic cleanup:
model, _ := llama.LoadModel("model.gguf")
defer model.Close()
Note: Calling methods after Close() returns an error.
func LoadModel ¶
func LoadModel(path string, opts ...ModelOption) (*Model, error)
LoadModel loads a GGUF model from the specified path.
The path must point to a valid GGUF format model file. Legacy GGML formats are not supported. The function applies the provided options using the functional options pattern, with sensible defaults if none are specified.
Resources are managed automatically via finaliser, but explicit cleanup with Close() is recommended for deterministic resource management:
model, err := llama.LoadModel("model.gguf")
if err != nil {
return err
}
defer model.Close()
Returns an error if the file doesn't exist, is not a valid GGUF model, or if model loading fails.
Examples:
// Load with defaults
model, err := llama.LoadModel("model.gguf")
// Load with custom GPU configuration
model, err := llama.LoadModel("model.gguf",
llama.WithGPULayers(35),
)
func (*Model) ChatTemplate ¶
ChatTemplate returns the chat template from the model's GGUF metadata.
Returns an empty string if the model has no embedded chat template. Most modern instruction-tuned models include a template in their GGUF metadata that specifies how to format messages for that specific model.
Example:
template := model.ChatTemplate()
if template == "" {
// Model has no template - user must provide one
}
func (*Model) Close ¶
Close frees the model and its associated resources.
This method is idempotent - multiple calls are safe and subsequent calls return immediately without error.
After Close() is called, all other methods return an error. The method uses a write lock to prevent concurrent operations during cleanup.
Example:
model, _ := llama.LoadModel("model.gguf")
defer model.Close()
func (*Model) FormatChatPrompt ¶
func (m *Model) FormatChatPrompt(messages []ChatMessage, opts ChatOptions) (string, error)
FormatChatPrompt formats chat messages using the model's chat template.
This method applies the chat template to the provided messages and returns the resulting prompt string without performing generation. Useful for:
- Debugging what will be sent to the model
- Pre-computing prompts for caching
- Understanding how the template formats conversations
The template priority is: opts.ChatTemplate > model's GGUF template > error.
See also: Context.Chat for performing chat completion with generation.
Example:
messages := []llama.ChatMessage{
{Role: "system", Content: "You are helpful."},
{Role: "user", Content: "Hello"},
}
prompt, err := model.FormatChatPrompt(messages, llama.ChatOptions{})
fmt.Println("Formatted prompt:", prompt)
func (*Model) NewContext ¶
func (m *Model) NewContext(opts ...ContextOption) (*Context, error)
NewContext creates a new execution context from this model.
This method creates an execution context with the specified configuration. Multiple contexts can be created from the same model to handle different use cases (e.g., small context for tokenization, large context for generation).
Each context maintains its own KV cache and state. For concurrent inference, create multiple contexts from the same model - this is VRAM efficient since contexts share the model weights (e.g., 7GB model + 100MB per context).
Thread safety: Model is thread-safe, but each Context is not. Use one context per goroutine for concurrent inference.
See also: Context.Generate, Context.Chat for inference operations.
Example:
// Load model once
model, _ := llama.LoadModel("model.gguf", llama.WithGPULayers(-1))
defer model.Close()
// Create context for tokenization
tokCtx, _ := model.NewContext(
llama.WithContext(512),
llama.WithKVCacheType("f16"),
)
defer tokCtx.Close()
// Create context for generation
genCtx, _ := model.NewContext(
llama.WithContext(8192),
llama.WithKVCacheType("q8_0"),
)
defer genCtx.Close()
func (*Model) Stats ¶
func (m *Model) Stats() (*ModelStats, error)
Stats returns comprehensive statistics about the model and runtime environment.
This includes:
- GPU device information (name, VRAM)
- Model metadata from GGUF (architecture, name, size, etc.)
- Runtime configuration (context size, batch size, KV cache)
The returned information is useful for:
- Displaying model details to users
- Debugging configuration issues
- Monitoring resource usage
Example:
stats, err := model.Stats()
if err != nil {
log.Fatal(err)
}
fmt.Println(stats)
type ModelMetadata ¶
type ModelMetadata struct {
Architecture string // Model architecture (e.g., "qwen3", "llama")
Name string // Full model name
Basename string // Base model name
QuantizedBy string // Who quantized the model
SizeLabel string // Model size (e.g., "8B", "70B")
RepoURL string // Hugging Face repo URL
}
ModelMetadata contains model information from GGUF metadata.
type ModelOption ¶
type ModelOption func(*modelConfig)
ModelOption configures model loading behaviour (model-level settings).
func WithGPULayers ¶
func WithGPULayers(n int) ModelOption
WithGPULayers sets the number of model layers to offload to GPU.
By default, all layers are offloaded to GPU (-1). If GPU acceleration is unavailable, the library automatically falls back to CPU execution. Set to 0 to force CPU-only execution, or specify a positive number to partially offload layers (useful for models larger than GPU memory).
Default: -1 (offload all layers, with CPU fallback)
Examples:
// Force CPU execution
model, err := llama.LoadModel("model.gguf",
llama.WithGPULayers(0),
)
// Offload 35 layers to GPU, rest on CPU
model, err := llama.LoadModel("model.gguf",
llama.WithGPULayers(35),
)
func WithMLock ¶
func WithMLock() ModelOption
WithMLock forces the model to stay in RAM using mlock().
When enabled, prevents the operating system from swapping model data to disk. Useful for production environments where consistent latency is critical, but requires sufficient physical RAM and may require elevated privileges.
Default: false (allows OS to manage memory)
Example:
model, err := llama.LoadModel("model.gguf",
llama.WithMLock(),
)
func WithMMap ¶
func WithMMap(enabled bool) ModelOption
WithMMap enables or disables memory-mapped file I/O for model loading.
Memory mapping (mmap) allows the OS to load model data on-demand rather than reading the entire file upfront. This significantly reduces startup time and memory usage. Disable only if you encounter platform-specific issues.
Default: true (enabled)
Example:
// Disable mmap for compatibility
model, err := llama.LoadModel("model.gguf",
llama.WithMMap(false),
)
func WithMainGPU ¶
func WithMainGPU(gpu string) ModelOption
WithMainGPU sets the primary GPU device for model execution.
Use this option to select a specific GPU in multi-GPU systems. The device string format depends on the backend (e.g. "0" for CUDA device 0). Most users with single-GPU systems don't need this option.
Default: "" (uses default GPU)
Example:
// Use second GPU
model, err := llama.LoadModel("model.gguf",
llama.WithMainGPU("1"),
)
func WithProgressCallback ¶
func WithProgressCallback(cb ProgressCallback) ModelOption
WithProgressCallback sets a custom progress callback for model loading.
The callback is invoked periodically during model loading with progress values from 0.0 (start) to 1.0 (complete). This allows implementing custom progress indicators, logging, or loading cancellation.
The callback receives:
- progress: float32 from 0.0 to 1.0 indicating loading progress
The callback must return:
- true: continue loading
- false: cancel loading (LoadModel will return an error)
IMPORTANT: The callback is invoked from a C thread during model loading. Ensure any operations are thread-safe. The callback should complete quickly to avoid blocking the loading process.
Default: nil (uses llama.cpp default dot printing)
Examples:
// Simple progress indicator
model, err := llama.LoadModel("model.gguf",
llama.WithProgressCallback(func(progress float32) bool {
fmt.Printf("\rLoading: %.0f%%", progress*100)
return true
}),
)
// Cancel loading after 50%
model, err := llama.LoadModel("model.gguf",
llama.WithProgressCallback(func(progress float32) bool {
if progress > 0.5 {
return false // Cancel
}
return true
}),
)
func WithSilentLoading ¶
func WithSilentLoading() ModelOption
WithSilentLoading disables progress output during model loading.
By default, llama.cpp prints dots to stderr to indicate loading progress. This option suppresses that output completely, useful for clean logs in production environments or when progress output interferes with other output formatting.
Note: The LLAMA_LOG environment variable controls general logging but does not suppress progress dots. Use this option for truly silent loading.
Default: false (shows progress dots)
Example:
model, err := llama.LoadModel("model.gguf",
llama.WithSilentLoading(),
)
func WithTensorSplit ¶
func WithTensorSplit(split string) ModelOption
WithTensorSplit configures tensor distribution across multiple GPUs.
Allows manual control of how model layers are distributed across GPUs in multi-GPU setups. The split string format is backend-specific (e.g. "0.7,0.3" for CUDA to use 70% on GPU 0, 30% on GPU 1). Most users should rely on automatic distribution instead.
Default: "" (automatic distribution)
Example:
// Distribute 60/40 across two GPUs
model, err := llama.LoadModel("model.gguf",
llama.WithTensorSplit("0.6,0.4"),
)
type ModelStats ¶
type ModelStats struct {
GPUs []GPUInfo // Information about available CUDA GPUs
Metadata ModelMetadata // Model metadata from GGUF file
Runtime RuntimeInfo // Runtime configuration and resource usage
}
ModelStats contains comprehensive model statistics and metadata.
This includes GPU information, model metadata from GGUF, and runtime configuration. Use Model.Stats() to retrieve these statistics.
func (*ModelStats) String ¶
func (s *ModelStats) String() string
String returns a formatted summary of model statistics.
The output includes GPU information, model details, and runtime configuration in a human-readable format suitable for display.
Example output:
=== Model Statistics ===
GPU Devices:
GPU 0: NVIDIA GeForce RTX 3090
VRAM: 23733 MB free / 24576 MB total
Model Details:
Name: DeepSeek-R1-0528-Qwen3-8B
Architecture: qwen3 (8B)
Quantized by: Unsloth
Repository: https://huggingface.co/unsloth
Runtime Configuration:
Context: 131,072 tokens | Batch: 512 tokens
KV Cache: q8_0 (9,216 MB)
GPU Layers: 28/28
type ProgressCallback ¶
ProgressCallback is called during model loading with progress 0.0-1.0. Return false to cancel loading, true to continue.
type ReasoningFormat ¶
type ReasoningFormat int
ReasoningFormat specifies how reasoning content is handled for models that emit thinking/reasoning tokens (like DeepSeek-R1).
Reasoning models typically emit content within special tags like <think>...</think>. These formats control whether that content is extracted into separate ReasoningContent fields or left inline.
const ( // ReasoningFormatNone leaves reasoning content inline with regular content. // All tokens appear in Content/delta.Content fields. ReasoningFormatNone ReasoningFormat = iota // ReasoningFormatAuto extracts reasoning to ReasoningContent field. // Tokens inside reasoning tags go to ReasoningContent, others to Content. // This is the recommended format for reasoning models. ReasoningFormatAuto // ReasoningFormatDeepSeekLegacy extracts in non-streaming mode only. // For streaming: reasoning stays inline. For Chat(): extracted. // This matches DeepSeek's original API behaviour. ReasoningFormatDeepSeekLegacy // ReasoningFormatDeepSeek extracts reasoning in all modes. // Always separates reasoning content from regular content. ReasoningFormatDeepSeek )
func (ReasoningFormat) String ¶
func (r ReasoningFormat) String() string
String returns the string representation of a ReasoningFormat.
type RuntimeInfo ¶
type RuntimeInfo struct {
ContextSize int // Context window size in tokens
BatchSize int // Batch processing size
KVCacheType string // KV cache quantization type ("f16", "q8_0", "q4_0")
KVCacheSizeMB int // Estimated KV cache memory usage in MB
GPULayersLoaded int // Number of layers offloaded to GPU
TotalLayers int // Total number of layers in model
}
RuntimeInfo contains current runtime configuration and resource usage.
type Tool ¶
type Tool struct {
Type string `json:"type"` // "function"
Function ToolFunction `json:"function"` // Function definition
}
Tool represents a tool/function that can be called by the model.
Note: Tool calling is not yet implemented in the Go API, but these types are defined for future compatibility with models that support function calling (like GPT-4, Claude, etc.).
When implemented, tools will be passed via ChatOptions and the model may return ToolCall objects in ChatResponse/ChatDelta.
Example (future usage):
tool := llama.Tool{
Type: "function",
Function: llama.ToolFunction{
Name: "get_weather",
Description: "Get the current weather in a location",
Parameters: map[string]interface{}{
"type": "object",
"properties": map[string]interface{}{
"location": map[string]interface{}{
"type": "string",
"description": "City name",
},
},
"required": []string{"location"},
},
},
}
type ToolCall ¶
type ToolCall struct {
ID string `json:"id"` // Unique identifier for this call
Type string `json:"type"` // "function"
Function struct {
Name string `json:"name"` // Function name being called
Arguments string `json:"arguments"` // JSON string of arguments
} `json:"function"`
}
ToolCall represents a function call made by the model.
When a model decides to call a function, it returns a ToolCall with the function name and arguments (as a JSON string). The application should execute the function and return the result in a subsequent message with role "tool".
Example (future usage):
// Model returns tool call
if len(response.ToolCalls) > 0 {
call := response.ToolCalls[0]
result := executeFunction(call.Function.Name, call.Function.Arguments)
// Send result back to model
messages = append(messages, llama.ChatMessage{
Role: "tool",
Content: result,
ToolCallID: call.ID,
})
}
type ToolFunction ¶
type ToolFunction struct {
Name string `json:"name"` // Function name (must be valid identifier)
Description string `json:"description"` // Human-readable description
Parameters map[string]interface{} `json:"parameters"` // JSON Schema for parameters
}
ToolFunction defines a function that can be called by the model.
The Parameters field should contain a JSON Schema describing the function's parameters. This follows the OpenAI function calling format.
Source Files
¶
Directories
¶
| Path | Synopsis |
|---|---|
|
examples
|
|
|
chat
command
Chat example demonstrates non-streaming chat completion.
|
Chat example demonstrates non-streaming chat completion. |
|
chat-streaming
command
Chat streaming example demonstrates real-time chat completion with visual distinction between regular content and reasoning output.
|
Chat streaming example demonstrates real-time chat completion with visual distinction between regular content and reasoning output. |
|
embedding
command
Embedding example demonstrates generating text embeddings for semantic tasks.
|
Embedding example demonstrates generating text embeddings for semantic tasks. |
|
simple
command
Simple example demonstrates basic text generation using llama-go.
|
Simple example demonstrates basic text generation using llama-go. |
|
speculative
command
Speculative example demonstrates speculative decoding for faster generation.
|
Speculative example demonstrates speculative decoding for faster generation. |
|
streaming
command
Streaming example demonstrates both callback and channel-based token streaming.
|
Streaming example demonstrates both callback and channel-based token streaming. |
|
internal
|
|
|
exampleui
Package exampleui provides terminal UI utilities for llama-go examples.
|
Package exampleui provides terminal UI utilities for llama-go examples. |