llama

package module

v0.1.8 Latest Latest Go to latest Published: Jan 27, 2026 License: MIT Imports: 8 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/godeps/gollama

Links

Open Source Insights

README ¶

gollama: Run LLMs locally with Go

Go bindings for llama.cpp, enabling you to run large language models locally with GPU acceleration. Production-ready library with thread-safe concurrent inference and comprehensive test coverage. Integrate LLM inference directly into Go applications with a clean, idiomatic API.

This is an active fork of go-skynet/go-llama.cpp, which hasn't been maintained since October 2023. The goal is keeping Go developers up-to-date with llama.cpp whilst offering a lighter, more performant alternative to Python-based ML stacks like PyTorch and/or vLLM.

Documentation:

Getting started: Installation guide | API guide | Build options
Migration: v1 to v2 migration guide for upgrading from the old API
API reference: pkg.go.dev (complete godoc with examples)
Examples: Working code examples for chat, streaming, embeddings, speculative decoding
Upstream: llama.cpp for model formats and engine details
中文文档: README.zh-CN.md

Quick start

# Clone
git clone https://github.com/godeps/gollama
cd gollama

# Download a test model
wget https://huggingface.co/Qwen/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-Q8_0.gguf

# Run an example (Linux). On macOS use DYLD_LIBRARY_PATH instead.
export LD_LIBRARY_PATH=$PWD/prebuilt/$(go env GOOS)_$(go env GOARCH)
go run ./examples/simple -m Qwen3-0.6B-Q8_0.gguf -p "Hello world" -n 50

Prebuilt libraries

CPU-only prebuilt libraries are stored under prebuilt/<os>_<arch> for:

linux/amd64
linux/arm64
darwin/amd64
darwin/arm64
windows/amd64

The Go build links against these directories by default. If you move the binaries:

Linux: set LD_LIBRARY_PATH or keep the libs next to the built binary.
macOS: keep the .dylib files next to the binary (rpath uses @loader_path).
Windows: copy the .dll files next to your .exe or add the folder to PATH.

To (re)build and stage the current platform's artifacts into prebuilt/, run:

./scripts/build-prebuilt.sh

Basic usage

package main

import (
    "context"
    "fmt"
    llama "github.com/godeps/gollama"
)

func main() {
    // Load model weights (ModelOption: WithGPULayers, WithMLock, etc.)
    model, err := llama.LoadModel(
        "/path/to/model.gguf",
        llama.WithGPULayers(-1), // Offload all layers to GPU
    )
    if err != nil {
        panic(err)
    }
    defer model.Close()

    // Create execution context (ContextOption: WithContext, WithBatch, etc.)
    ctx, err := model.NewContext(
        llama.WithContext(2048),
        llama.WithF16Memory(),
    )
    if err != nil {
        panic(err)
    }
    defer ctx.Close()

    // Chat completion (uses model's chat template)
    messages := []llama.ChatMessage{
        {Role: "system", Content: "You are a helpful assistant."},
        {Role: "user", Content: "What is the capital of France?"},
    }
    response, err := ctx.Chat(context.Background(), messages, llama.ChatOptions{
        MaxTokens: llama.Int(100),
    })
    if err != nil {
        panic(err)
    }
    fmt.Println(response.Content)

    // Or raw text generation
    text, err := ctx.Generate("Hello world", llama.WithMaxTokens(50))
    if err != nil {
        panic(err)
    }
    fmt.Println(text)
}

When building, set these environment variables:

export LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD LD_LIBRARY_PATH=$PWD

Key capabilities

Text generation and chat: Generate text with LLMs using native chat completion (with automatic chat template formatting) or raw text generation. Extract embeddings for semantic search, clustering, and similarity tasks.

GPU acceleration: Supports NVIDIA (CUDA), AMD (ROCm), Apple Silicon (Metal), Intel (SYCL), and cross-platform acceleration (Vulkan, OpenCL). Eight backend options cover virtually all modern GPU hardware, plus distributed inference via RPC.

Production ready: Comprehensive test suite with almost 400 test cases and CI validation including CUDA builds. Active development tracking llama.cpp releases - maintained for production use, not a demo project.

Advanced features: Model/Context separation enables efficient VRAM usage - load model weights once, create multiple contexts with different configurations. Cache common prompt prefixes to avoid recomputing system prompts across thousands of generations. Serve multiple concurrent requests with a single model loaded in VRAM (no weight duplication). Stream tokens via callbacks or buffered channels (decouples GPU inference from slow processing). Speculative decoding for 2-3× generation speedup.

Architecture

The library bridges Go and C++ using CGO, keeping the heavy computation in llama.cpp's optimised C++ code whilst providing a clean Go API. This minimises CGO overhead whilst maximising performance.

Model/Context separation: The API separates model weights (Model) from execution state (Context). Load model weights once, create multiple contexts with different configurations. Each context maintains its own KV cache and state for independent inference operations.

Key components:

wrapper.cpp/wrapper.h - CGO interface to llama.cpp
model.go - Model loading and weight management (thread-safe)
context.go - Execution contexts for inference (one per goroutine)
Clean Go API with comprehensive godoc comments
llama.cpp/ - Vendored upstream (git subtree) tracking llama.cpp releases

The design uses functional options for configuration (ModelOption vs ContextOption), explicit context creation for thread safety, automatic KV cache prefix reuse for performance, resource management with finalizers, and streaming callbacks via cgo.Handle for safe Go-C interaction.

Licence

MIT

Documentation ¶

Rendered for

Overview ¶

Package llama provides Go bindings for llama.cpp, enabling efficient LLM inference with GPU acceleration and advanced features like prefix caching and speculative decoding.

This package wraps llama.cpp's C++ API whilst maintaining Go idioms and safety. Heavy computation stays in optimised C++ code, whilst the Go API provides clean concurrency primitives and resource management.

Quick Start ¶

Load a GGUF model and generate text:

model, err := llama.LoadModel("model.gguf")
if err != nil {
    log.Fatal(err)
}
defer model.Close()

result, err := model.Generate("Once upon a time")
if err != nil {
    log.Fatal(err)
}
fmt.Println(result)

GPU Acceleration ¶

GPU offloading is enabled by default, automatically using CUDA, ROCm, or Metal depending on your build configuration. The library falls back to CPU if GPU resources are unavailable:

// Uses GPU by default (all layers offloaded)
model, err := llama.LoadModel("model.gguf")

// Limit GPU usage (useful for large models)
model, err := llama.LoadModel("model.gguf",
    llama.WithGPULayers(20),
)

// Force CPU-only inference
model, err := llama.LoadModel("model.gguf",
    llama.WithGPULayers(0),
)

Context Management ¶

The library automatically uses each model's native maximum context length from GGUF metadata, giving you full model capabilities without artificial limits:

// Uses model's native context (e.g. 40960 for Qwen3, 128000 for Gemma 3)
model, err := llama.LoadModel("model.gguf")

// Override for memory savings
model, err := llama.LoadModel("model.gguf",
    llama.WithContext(8192),
)

Concurrent Inference ¶

Models are thread-safe and support concurrent generation requests through an internal context pool:

var wg sync.WaitGroup
for i := 0; i < 10; i++ {
    wg.Add(1)
    go func(prompt string) {
        defer wg.Done()
        result, _ := model.Generate(prompt)
        fmt.Println(result)
    }(fmt.Sprintf("Question %d:", i))
}
wg.Wait()

The pool automatically scales between minimum and maximum contexts based on demand, reusing contexts when possible and cleaning up idle ones.

Streaming Generation ¶

Stream tokens as they're generated using a callback:

err := model.GenerateStream("Tell me a story",
    func(token string) bool {
        fmt.Print(token)
        return true  // Continue generation
    },
)

Return false from the callback to stop generation early.

Prefix Caching ¶

The library automatically reuses KV cache entries for matching prompt prefixes, significantly improving performance for conversation-style usage:

// First call processes full prompt
model.Generate("You are a helpful assistant.\n\nUser: Hello")

// Second call reuses cached system prompt
model.Generate("You are a helpful assistant.\n\nUser: How are you?")

Prefix caching is enabled by default and includes a last-token refresh optimisation to maintain deterministic generation with minimal overhead (~0.1-0.5ms per call).

Speculative Decoding ¶

Accelerate generation using a smaller draft model:

target, _ := llama.LoadModel("large-model.gguf")
draft, _ := llama.LoadModel("small-model.gguf")
defer target.Close()
defer draft.Close()

result, err := target.GenerateWithDraft(
    "Once upon a time",
    draft,
    llama.WithDraftTokens(5),
)

The draft model generates candidate tokens that the target model verifies in parallel, reducing overall latency whilst maintaining quality.

Advanced Configuration ¶

Fine-tune generation behaviour with sampling parameters:

result, err := model.Generate("Explain quantum computing",
    llama.WithMaxTokens(500),
    llama.WithTemperature(0.7),
    llama.WithTopP(0.9),
    llama.WithTopK(40),
    llama.WithSeed(42),
    llama.WithStopWords("</answer>", "\n\n"),
)

Thread Safety ¶

All public methods are thread-safe. The Model type uses an internal RWMutex to protect shared state and coordinates access to the context pool. Multiple goroutines can safely call Generate() concurrently.

Resource Cleanup ¶

Always call Close() when finished with a model to free GPU memory and other resources:

model, err := llama.LoadModel("model.gguf")
if err != nil {
    return err
}
defer model.Close()

Close() is safe to call multiple times and will block until all active generation requests complete.

Build Requirements ¶

This package requires CGO and a C++ compiler. Pre-built llama.cpp libraries are included in the repository for convenience. See the project README for detailed build instructions and GPU acceleration setup.

Index ¶

func Bool(v bool) *bool
func Float32(v float32) *float32
func InitLogging()
func Int(v int) *int
type ChatDelta
type ChatMessage
type ChatOptions
type ChatResponse
type Context
- func (c *Context) Chat(ctx gocontext.Context, messages []ChatMessage, opts ChatOptions) (*ChatResponse, error)
- func (c *Context) ChatStream(ctx gocontext.Context, messages []ChatMessage, opts ChatOptions) (<-chan ChatDelta, <-chan error)
- func (c *Context) Close() error
- func (c *Context) Generate(prompt string, opts ...GenerateOption) (string, error)
- func (c *Context) GenerateChannel(ctx gocontext.Context, prompt string, opts ...GenerateOption) (<-chan string, <-chan error)
- func (c *Context) GenerateStream(prompt string, callback func(token string) bool, opts ...GenerateOption) error
- func (c *Context) GenerateWithDraft(prompt string, draft *Context, opts ...GenerateOption) (string, error)
- func (c *Context) GenerateWithDraftChannel(ctx gocontext.Context, prompt string, draft *Context, opts ...GenerateOption) (<-chan string, <-chan error)
- func (c *Context) GenerateWithDraftStream(prompt string, draft *Context, callback func(token string) bool, ...) error
- func (c *Context) GenerateWithTokens(tokens []int32, opts ...GenerateOption) (string, error)
- func (c *Context) GenerateWithTokensStream(tokens []int32, callback func(token string) bool, opts ...GenerateOption) error
- func (c *Context) GetCachedTokenCount() (int, error)
- func (c *Context) GetEmbeddings(text string) ([]float32, error)
- func (c *Context) GetEmbeddingsBatch(texts []string) ([][]float32, error)
- func (c *Context) Tokenize(text string) ([]int32, error)
type ContextOption
- func WithBatch(size int) ContextOption
- func WithContext(size int) ContextOption
- func WithEmbeddings() ContextOption
- func WithF16Memory() ContextOption
- func WithFlashAttn(mode string) ContextOption
- func WithKVCacheType(cacheType string) ContextOption
- func WithParallel(n int) ContextOption
- func WithPrefixCaching(enabled bool) ContextOption
- func WithThreads(n int) ContextOption
- func WithThreadsBatch(n int) ContextOption
type GPUInfo
type GenerateOption
- func WithDRYAllowedLength(length int) GenerateOption
- func WithDRYBase(base float32) GenerateOption
- func WithDRYMultiplier(mult float32) GenerateOption
- func WithDRYPenaltyLastN(n int) GenerateOption
- func WithDRYSequenceBreakers(breakers ...string) GenerateOption
- func WithDebug() GenerateOption
- func WithDraftTokens(n int) GenerateOption
- func WithDynamicTemperature(tempRange, exponent float32) GenerateOption
- func WithFrequencyPenalty(penalty float32) GenerateOption
- func WithIgnoreEOS(ignore bool) GenerateOption
- func WithMaxTokens(n int) GenerateOption
- func WithMinKeep(n int) GenerateOption
- func WithMinP(p float32) GenerateOption
- func WithMirostat(version int) GenerateOption
- func WithMirostatEta(eta float32) GenerateOption
- func WithMirostatTau(tau float32) GenerateOption
- func WithNPrev(n int) GenerateOption
- func WithNProbs(n int) GenerateOption
- func WithPenaltyLastN(n int) GenerateOption
- func WithPresencePenalty(penalty float32) GenerateOption
- func WithRepeatPenalty(penalty float32) GenerateOption
- func WithSeed(seed int) GenerateOption
- func WithStopWords(words ...string) GenerateOption
- func WithTemperature(t float32) GenerateOption
- func WithTopK(k int) GenerateOption
- func WithTopNSigma(sigma float32) GenerateOption
- func WithTopP(p float32) GenerateOption
- func WithTypicalP(p float32) GenerateOption
- func WithXTC(probability, threshold float32) GenerateOption
type Model
- func LoadModel(path string, opts ...ModelOption) (*Model, error)
- func (m *Model) ChatTemplate() string
- func (m *Model) Close() error
- func (m *Model) FormatChatPrompt(messages []ChatMessage, opts ChatOptions) (string, error)
- func (m *Model) NewContext(opts ...ContextOption) (*Context, error)
- func (m *Model) Stats() (*ModelStats, error)
type ModelMetadata
type ModelOption
- func WithGPULayers(n int) ModelOption
- func WithMLock() ModelOption
- func WithMMap(enabled bool) ModelOption
- func WithMainGPU(gpu string) ModelOption
- func WithProgressCallback(cb ProgressCallback) ModelOption
- func WithSilentLoading() ModelOption
- func WithTensorSplit(split string) ModelOption
type ModelStats
- func (s *ModelStats) String() string
type ProgressCallback
type ReasoningFormat
- func (r ReasoningFormat) String() string
type RuntimeInfo
type Tool
type ToolCall
type ToolFunction

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Bool ¶

func Bool(v bool) *bool

Bool returns a pointer to the given bool value. This is a convenience helper for setting optional ChatOptions fields.

Example:

opts := llama.ChatOptions{
    EnableThinking: llama.Bool(true),  // Instead of &true
}

func Float32 ¶

func Float32(v float32) *float32

Float32 returns a pointer to the given float32 value. This is a convenience helper for setting optional ChatOptions fields.

Example:

opts := llama.ChatOptions{
    Temperature: llama.Float32(0.7),  // Instead of &0.7
}

func InitLogging ¶

func InitLogging()

InitLogging (re)initializes llama.cpp logging system based on LLAMA_LOG environment variable.

This function is called automatically when the package loads, but can be called again to reconfigure logging after changing the LLAMA_LOG environment variable.

Supported LLAMA_LOG values:

"none" - No logging
"error" - Only errors
"warn" - Warnings and errors (recommended for production)
"info" - Informational messages (default)
"debug" - Verbose debug output

Example:

os.Setenv("LLAMA_LOG", "warn")  // Quiet mode
llama.InitLogging()             // Apply the change

func Int ¶

func Int(v int) *int

Int returns a pointer to the given int value. This is a convenience helper for setting optional ChatOptions fields.

Example:

opts := llama.ChatOptions{
    MaxTokens: llama.Int(100),  // Instead of &100
}

Types ¶

type ChatDelta ¶

type ChatDelta struct {
	Content          string // Regular content token(s)
	ReasoningContent string // Reasoning token(s)

}

ChatDelta represents a streaming chunk from chat completion.

During streaming, deltas arrive progressively. For standard models, only Content is populated with token(s). For reasoning models with extraction enabled, tokens may appear in either Content or ReasoningContent depending on whether they're inside reasoning tags.

Example:

deltaCh, errCh := model.ChatStream(ctx, messages, opts)
for {
    select {
    case delta, ok := <-deltaCh:
        if !ok {
            return
        }
        if delta.Content != "" {
            fmt.Print(delta.Content)
        }
        if delta.ReasoningContent != "" {
            fmt.Print("[thinking: ", delta.ReasoningContent, "]")
        }
    case err := <-errCh:
        if err != nil {
            log.Fatal(err)
        }
    }
}

type ChatMessage ¶

type ChatMessage struct {
	Role    string // Message role (e.g., "system", "user", "assistant")
	Content string // Message content
}

ChatMessage represents a message in a chat conversation.

Common roles include "system", "user", "assistant", "tool", and "function". The role is not validated by this library - the model's chat template will handle role interpretation and any unknown roles.

Example:

messages := []llama.ChatMessage{
    {Role: "system", Content: "You are a helpful assistant."},
    {Role: "user", Content: "What is the capital of France?"},
}

type ChatOptions ¶

type ChatOptions struct {
	// Base generation options
	MaxTokens   *int     // Maximum tokens to generate (nil = model default)
	Temperature *float32 // Sampling temperature (nil = model default, typically 0.8)
	TopP        *float32 // Nucleus sampling threshold (nil = model default, typically 0.95)
	TopK        *int     // Top-K sampling (nil = model default, typically 40)
	Seed        *int     // Random seed for reproducible generation (nil = random)
	StopWords   []string // Additional stop sequences beyond model defaults

	// Chat template (Jinja2 template string)
	// If empty, uses model's GGUF template. If model has no template, returns error.
	// Supports 40+ formats: chatml, llama2, llama3, mistral, gemma, phi3, etc.
	// See: https://github.com/ggerganov/llama.cpp/blob/master/common/chat.cpp
	ChatTemplate string

	// Chat template variables (arbitrary JSON-compatible key-value pairs)
	// These are passed to the model's Jinja2 chat template for customisation.
	// Common examples: {"add_generation_prompt": true, "tools": [...]}
	ChatTemplateKwargs map[string]interface{}

	// Reasoning model options (for models like DeepSeek-R1)
	EnableThinking  *bool           // Enable/disable thinking output (nil = model default)
	ReasoningBudget *int            // Token limit for reasoning (-1 = unlimited, 0 = disabled)
	ReasoningFormat ReasoningFormat // How to handle reasoning content

	// Streaming configuration
	StreamBufferSize int // Buffer size for streaming channels (default: 256)
}

ChatOptions configures chat completion behaviour.

This extends the base generation options with chat-specific settings like template variables and reasoning parameters. All generation options (temperature, top_p, etc.) can be set here, or left nil to use defaults.

Example:

opts := llama.ChatOptions{
    MaxTokens:   llama.Int(100),
    Temperature: llama.Float32(0.7),
    TopP:        llama.Float32(0.9),
}

type ChatResponse ¶

type ChatResponse struct {
	Content          string // Regular response content
	ReasoningContent string // Extracted reasoning/thinking (if reasoning model)

}

ChatResponse represents the complete response from a chat completion.

For standard models, only Content is populated. For reasoning models (like DeepSeek-R1), ReasoningContent may contain extracted thinking/ reasoning tokens that were separated from the main response.

Example:

response, err := model.Chat(ctx, messages, opts)
if err != nil {
    log.Fatal(err)
}
fmt.Println("Response:", response.Content)
if response.ReasoningContent != "" {
    fmt.Println("Reasoning:", response.ReasoningContent)
}

type Context ¶

type Context struct {
	// contains filtered or unexported fields
}

Context represents an execution context for inference operations.

Context instances maintain their own KV cache and state, allowing independent inference operations. Contexts are NOT thread-safe - each context should be used by only one goroutine at a time. For concurrent inference, create multiple contexts from the same model.

Multiple contexts share model weights, making concurrent inference VRAM-efficient (e.g., one 7GB model + 100MB per context vs 7GB per instance).

Resources should be freed with Close() when finished:

ctx, _ := model.NewContext(llama.WithContext(8192))
defer ctx.Close()

See also: Model.NewContext for creating contexts.

func (*Context) Chat ¶

func (c *Context) Chat(ctx gocontext.Context, messages []ChatMessage, opts ChatOptions) (*ChatResponse, error)

Chat performs conversational generation using chat messages.

This method formats messages using a chat template and generates a response. The template can be provided in opts or will be read from the model's GGUF metadata. Supports 40+ template formats including ChatML, Llama-2, Llama-3, Mistral, Gemma, and Phi-3.

See also: ChatStream for streaming responses, Generate for raw prompt completion.

Example:

messages := []llama.ChatMessage{
    {Role: "system", Content: "You are a helpful assistant."},
    {Role: "user", Content: "Hello!"},
}
response, err := ctx.Chat(context.Background(), messages, llama.ChatOptions{})

func (*Context) ChatStream ¶

func (c *Context) ChatStream(ctx gocontext.Context, messages []ChatMessage, opts ChatOptions) (<-chan ChatDelta, <-chan error)

ChatStream performs conversational generation with streaming output.

Returns channels for chat deltas and errors, similar to GenerateChannel. Supports context cancellation for early termination.

See also: Chat for synchronous chat completion.

Example:

deltas, errs := ctx.ChatStream(context.Background(), messages, llama.ChatOptions{})
for delta := range deltas {
    fmt.Print(delta.Content)
}

func (*Context) Close ¶

func (c *Context) Close() error

Close frees the context and its associated resources.

This method is idempotent - multiple calls are safe and subsequent calls return immediately without error.

After Close() is called, all other methods return an error.

Example:

ctx, _ := model.NewContext()
defer ctx.Close()

func (*Context) Generate ¶

func (c *Context) Generate(prompt string, opts ...GenerateOption) (string, error)

Generate generates text from the given prompt.

This method performs synchronous text generation, returning the complete result when finished. The context automatically reuses KV cache entries for matching prompt prefixes (prefix caching), significantly improving performance for conversation-style usage.

Thread safety: Context is NOT thread-safe. Use separate contexts for concurrent generation requests (create multiple contexts from the same Model).

See also: GenerateStream for streaming output, Chat for structured conversations.

Examples:

// Basic generation
result, err := ctx.Generate("Once upon a time")

// With custom parameters
result, err := ctx.Generate("Explain quantum physics",
    llama.WithMaxTokens(512),
    llama.WithTemperature(0.7),
)

func (*Context) GenerateChannel ¶

func (c *Context) GenerateChannel(ctx gocontext.Context, prompt string, opts ...GenerateOption) (<-chan string, <-chan error)

GenerateChannel generates text with streaming output via channel.

Returns two channels: one for tokens and one for errors. The token channel is closed when generation completes. The error channel receives at most one error before closing.

This method supports context cancellation for stopping generation early.

See also: GenerateStream for callback-based streaming, Generate for synchronous generation.

Example:

tokens, errs := ctx.GenerateChannel(context.Background(), "Write a story")
for token := range tokens {
    fmt.Print(token)
}
if err := <-errs; err != nil {
    log.Fatal(err)
}

func (*Context) GenerateStream ¶

func (c *Context) GenerateStream(prompt string, callback func(token string) bool, opts ...GenerateOption) error

GenerateStream generates text with streaming output via callback.

The callback receives each generated token as it's produced. Return true to continue generation, or false to stop early.

See also: Generate for synchronous generation, GenerateChannel for channel-based streaming with context cancellation support.

Examples:

// Stream to stdout
err := ctx.GenerateStream("Tell me a story",
    func(token string) bool {
        fmt.Print(token)
        return true
    },
)

func (*Context) GenerateWithDraft ¶

func (c *Context) GenerateWithDraft(prompt string, draft *Context, opts ...GenerateOption) (string, error)

GenerateWithDraft performs speculative generation using a draft model.

Speculative decoding uses a smaller draft model to generate candidate tokens that the target model verifies in parallel. This reduces latency whilst maintaining the target model's quality.

Best results when draft model is 5-10x smaller than target and models share similar vocabularies. Typical speedup: 1.5-3x.

See also: GenerateWithDraftStream for streaming speculative generation.

Example:

target, _ := llama.LoadModel("large-model.gguf")
draft, _ := llama.LoadModel("small-model.gguf")
targetCtx, _ := target.NewContext()
draftCtx, _ := draft.NewContext()

result, err := targetCtx.GenerateWithDraft("Once upon a time", draftCtx,
    llama.WithDraftTokens(8),
)

func (*Context) GenerateWithDraftChannel ¶

func (c *Context) GenerateWithDraftChannel(ctx gocontext.Context, prompt string, draft *Context, opts ...GenerateOption) (<-chan string, <-chan error)

GenerateWithDraftChannel generates text with streaming via channel using a draft model.

Combines GenerateWithDraft and GenerateChannel.

Example:

tokens, errs := targetCtx.GenerateWithDraftChannel(context.Background(),
    "Write a story", draftCtx, llama.WithDraftTokens(8))
for token := range tokens {
    fmt.Print(token)
}

func (*Context) GenerateWithDraftStream ¶

func (c *Context) GenerateWithDraftStream(prompt string, draft *Context, callback func(token string) bool, opts ...GenerateOption) error

GenerateWithDraftStream performs speculative generation with streaming output.

Combines GenerateWithDraft and GenerateStream.

Example:

targetCtx.GenerateWithDraftStream("Write a story", draftCtx,
    func(token string) bool {
        fmt.Print(token)
        return true
    },
    llama.WithDraftTokens(8),
)

func (*Context) GenerateWithTokens ¶

func (c *Context) GenerateWithTokens(tokens []int32, opts ...GenerateOption) (string, error)

GenerateWithTokens generates text starting from the given tokens.

This is an advanced method for cases where you've already tokenized the prompt or want to use cached tokens. For normal usage, use Generate() instead.

Example:

tokens, _ := ctx.Tokenize("Once upon a time")
result, _ := ctx.GenerateWithTokens(tokens)

func (*Context) GenerateWithTokensStream ¶

func (c *Context) GenerateWithTokensStream(tokens []int32, callback func(token string) bool, opts ...GenerateOption) error

GenerateWithTokensStream generates text with streaming from tokens.

Combines GenerateWithTokens and GenerateStream.

Example:

tokens, _ := ctx.Tokenize("Write a story")
err := ctx.GenerateWithTokensStream(tokens,
    func(token string) bool {
        fmt.Print(token)
        return true
    },
)

func (*Context) GetCachedTokenCount ¶

func (c *Context) GetCachedTokenCount() (int, error)

GetCachedTokenCount returns the number of cached tokens (for debugging/metrics).

This method provides insight into prefix caching behaviour, showing how many tokens from previous prompts are cached.

Example:

ctx.Generate("System prompt: You are helpful.\n\nUser: Hello")
cached, _ := ctx.GetCachedTokenCount()
fmt.Printf("Cached tokens: %d\n", cached)

func (*Context) GetEmbeddings ¶

func (c *Context) GetEmbeddings(text string) ([]float32, error)

GetEmbeddings computes embeddings for the given text.

Embeddings are vector representations useful for semantic search, clustering, or similarity tasks. The context must be created with WithEmbeddings() to use this method.

See also: GetEmbeddingsBatch for efficient batch processing of multiple texts.

Example:

ctx, _ := model.NewContext(llama.WithEmbeddings())
emb1, _ := ctx.GetEmbeddings("Hello world")
emb2, _ := ctx.GetEmbeddings("Hi there")

func (*Context) GetEmbeddingsBatch ¶

func (c *Context) GetEmbeddingsBatch(texts []string) ([][]float32, error)

GetEmbeddingsBatch computes embeddings for multiple texts efficiently.

This method processes multiple texts in a single batch operation, which is significantly more efficient than calling GetEmbeddings repeatedly. Uses parallel sequence processing (configured via WithParallel) to maximise throughput.

The context must be created with WithEmbeddings() to use this method. Batch size is limited by WithParallel setting (default 8 for embedding contexts).

See also: GetEmbeddings for single text processing.

Example:

ctx, _ := model.NewContext(llama.WithEmbeddings())
texts := []string{"First", "Second", "Third"}
embeddings, _ := ctx.GetEmbeddingsBatch(texts)

func (*Context) Tokenize ¶

func (c *Context) Tokenize(text string) ([]int32, error)

Tokenize converts text to tokens.

Tokens are integer IDs representing subword units in the model's vocabulary. This method is useful for advanced use cases like manual prompt construction, token counting, or analysis.

Examples:

// Count tokens in a prompt
tokens, _ := ctx.Tokenize("Hello world")
fmt.Printf("Token count: %d\n", len(tokens))

type ContextOption ¶

type ContextOption func(*contextConfig)

ContextOption configures context creation (context-level settings).

func WithBatch ¶

func WithBatch(size int) ContextOption

WithBatch sets the batch size for prompt processing.

Larger batch sizes improve throughput for long prompts but increase memory usage. The batch size determines how many tokens are processed in parallel during the prompt evaluation phase.

Default: 512

Example:

// Process 1024 tokens at once for faster prompt handling
ctx, err := model.NewContext(llama.WithBatch(1024))

func WithContext ¶

func WithContext(size int) ContextOption

WithContext sets the context window size in tokens.

The context size determines how many tokens (prompt + generation) the context can process. By default, the library uses the model's native maximum context length (e.g. 32768 for Qwen3, 128000 for Gemma 3 models >4B).

Override this if you need to limit memory usage or have specific requirements.

IMPORTANT: Very small context sizes (< 64 tokens) may cause llama.cpp to crash internally. The library provides defensive checks but cannot prevent all edge cases with absurdly small contexts.

Default: 0 (uses model's native maximum from GGUF metadata)

Examples:

// Use model's full capability (default)
ctx, err := model.NewContext()

// Limit to 8K for memory savings
ctx, err := model.NewContext(llama.WithContext(8192))

func WithEmbeddings ¶

func WithEmbeddings() ContextOption

WithEmbeddings enables embedding extraction mode.

When enabled, the context can compute text embeddings via GetEmbeddings(). This mode is required for semantic search, clustering, or similarity tasks. Note that not all models support embeddings - check model documentation.

Default: false (text generation mode)

Example:

ctx, err := model.NewContext(llama.WithEmbeddings())
embeddings, err := ctx.GetEmbeddings("Hello world")

func WithF16Memory ¶

func WithF16Memory() ContextOption

WithF16Memory enables 16-bit floating point memory mode.

When enabled, the context uses FP16 precision for KV cache storage, reducing memory usage at the cost of slight accuracy loss. Most useful when working with very long contexts or memory-constrained environments.

Default: false (uses FP32 for KV cache)

Example:

ctx, err := model.NewContext(llama.WithF16Memory())

func WithFlashAttn ¶

func WithFlashAttn(mode string) ContextOption

WithFlashAttn controls Flash Attention kernel usage for attention computation.

Flash Attention is a GPU-optimized attention implementation that significantly reduces VRAM usage and improves performance, especially for longer contexts. It's required when using quantized KV cache types (q8_0, q4_0).

Available modes:

"auto" (default): llama.cpp decides based on hardware and model config
"enabled": Force Flash Attention on (fails if hardware doesn't support it)
"disabled": Use traditional attention (incompatible with quantized KV cache)

Technical details:

Requires CUDA compute capability 7.0+ (Volta/Turing or newer)
With GGML_CUDA_FA_ALL_QUANTS: Supports all KV cache quantization types
Without flag: Only supports f16, q4_0, and q8_0 (matching K/V types)
AUTO mode detects if backend scheduler supports the Flash Attention ops

Default: "auto" (llama.cpp chooses optimal path)

Examples:

// Use default auto-detection (recommended)
ctx, err := model.NewContext(llama.WithKVCacheType("q8_0"))

// Force Flash Attention on (errors if unsupported)
ctx, err := model.NewContext(llama.WithFlashAttn("enabled"))

// Disable Flash Attention (requires f16 KV cache)
ctx, err := model.NewContext(
    llama.WithKVCacheType("f16"),
    llama.WithFlashAttn("disabled"),
)

func WithKVCacheType ¶

func WithKVCacheType(cacheType string) ContextOption

WithKVCacheType sets the quantization type for KV cache storage.

The KV (key-value) cache stores attention states during generation and grows with context length. Quantizing this cache dramatically reduces VRAM usage with minimal quality impact:

"q8_0" (default): 50% VRAM savings, ~0.1% quality loss (imperceptible)
"f16": Full precision, no savings, maximum quality
"q4_0": 75% VRAM savings, noticeable quality loss (models become forgetful)

Memory scaling example for 131K context (DeepSeek-R1 trained capacity):

f16: 18 GB
q8_0: 9 GB (recommended)
q4_0: 4.5 GB (use only for extreme VRAM constraints)

Default: "q8_0" (best balance of memory and quality)

Examples:

// Use default Q8 quantization (recommended)
ctx, err := model.NewContext()

// Maximum quality for VRAM-rich systems
ctx, err := model.NewContext(llama.WithKVCacheType("f16"))

// Extreme memory savings (accept quality loss)
ctx, err := model.NewContext(llama.WithKVCacheType("q4_0"))

func WithParallel ¶

func WithParallel(n int) ContextOption

WithParallel sets the number of parallel sequences for batch processing.

This option controls how many independent sequences can be processed simultaneously in a single batch. Higher values enable larger batch sizes for operations like GetEmbeddingsBatch() but consume more VRAM.

For embedding contexts, the library defaults to n_parallel=8 if not explicitly set. This option allows tuning this value for your specific VRAM constraints and batch sizes.

VRAM usage scales approximately as:

base_model_size + (n_parallel × context_size × kv_cache_bytes)

For example, a 4B Q8 embedding model with 8192 context and q8_0 cache:

n_parallel=8: ~12 GB VRAM
n_parallel=4: ~8 GB VRAM
n_parallel=2: ~6 GB VRAM
n_parallel=1: ~5 GB VRAM (disables batch processing)

Trade-offs:

Lower values: Less VRAM usage, slower batch processing, smaller max batch size
Higher values: More VRAM usage, faster batch processing, larger max batch size

Default: 1 for generation contexts, 8 for embedding contexts (auto-set)

Examples:

// Use default (8 for embeddings, 1 for generation)
ctx, err := model.NewContext(llama.WithEmbeddings())

// Tune down for large embedding model with limited VRAM
ctx, err := model.NewContext(
    llama.WithEmbeddings(),
    llama.WithParallel(4),
)

// Single sequence (minimal VRAM, no batching)
ctx, err := model.NewContext(
    llama.WithEmbeddings(),
    llama.WithParallel(1),
)

func WithPrefixCaching ¶

func WithPrefixCaching(enabled bool) ContextOption

WithPrefixCaching enables or disables KV cache prefix reuse.

When enabled (default), the context automatically reuses cached KV entries for matching prompt prefixes, significantly improving performance for conversation-style usage where prompts share common beginnings.

Default: true (enabled)

Example:

// Disable prefix caching (not recommended for most use cases)
ctx, err := model.NewContext(llama.WithPrefixCaching(false))

func WithThreads ¶

func WithThreads(n int) ContextOption

WithThreads sets the number of threads for token generation. If not specified, defaults to runtime.NumCPU(). This also sets threadsBatch to the same value unless WithThreadsBatch is used.

func WithThreadsBatch ¶

func WithThreadsBatch(n int) ContextOption

WithThreadsBatch sets the number of threads for batch/prompt processing. If not specified, defaults to the same value as threads. For most use cases, leaving this unset is recommended.

type GPUInfo ¶

type GPUInfo struct {
	DeviceID      int    // CUDA device ID
	DeviceName    string // GPU model name (e.g., "NVIDIA GeForce RTX 3090")
	FreeMemoryMB  int    // Available VRAM in MB
	TotalMemoryMB int    // Total VRAM in MB
}

GPUInfo contains information about a CUDA GPU device.

type GenerateOption ¶

type GenerateOption func(*generateConfig)

GenerateOption configures text generation behaviour.

func WithDRYAllowedLength ¶

func WithDRYAllowedLength(length int) GenerateOption

WithDRYAllowedLength sets minimum repeat length before DRY penalty applies.

Repetitions shorter than this many tokens are ignored by DRY sampling. Prevents penalising common short phrases and natural language patterns. Only relevant when DRY multiplier is enabled.

Default: 2

Example:

// Only penalise repetitions of 4+ tokens
text, err := model.Generate("Write text",
    llama.WithDRYMultiplier(0.8),
    llama.WithDRYAllowedLength(4),
)

func WithDRYBase ¶

func WithDRYBase(base float32) GenerateOption

WithDRYBase sets the base for DRY penalty exponentiation.

Controls how rapidly penalty grows for longer repeated sequences. Higher values penalise longer repetitions more aggressively. Only affects behaviour when DRY multiplier is enabled (> 0.0).

Default: 1.75

Example:

// Stronger penalty for long repeated sequences
text, err := model.Generate("Write text",
    llama.WithDRYMultiplier(0.8),
    llama.WithDRYBase(2.0),
)

func WithDRYMultiplier ¶

func WithDRYMultiplier(mult float32) GenerateOption

WithDRYMultiplier enables DRY repetition penalty.

DRY sampling uses sophisticated sequence matching to penalise repetitive patterns. The multiplier controls penalty strength (0.0 = disabled, 0.8 = moderate, higher = stronger). More effective than basic repetition penalties for catching phrase-level and structural repetition.

Default: 0.0 (disabled)

Example:

// Prevent repetitive patterns
text, err := model.Generate("Write varied text",
    llama.WithDRYMultiplier(0.8),
    llama.WithDRYBase(1.75),
)

func WithDRYPenaltyLastN ¶

func WithDRYPenaltyLastN(n int) GenerateOption

WithDRYPenaltyLastN sets how many recent tokens DRY sampling considers.

DRY looks back this many tokens when detecting repetitive patterns. Use -1 for full context size, or specify a smaller window for efficiency. Only affects behaviour when DRY multiplier is enabled.

Default: -1 (context size)

Example:

// Check last 512 tokens for repetition
text, err := model.Generate("Write text",
    llama.WithDRYMultiplier(0.8),
    llama.WithDRYPenaltyLastN(512),
)

func WithDRYSequenceBreakers ¶

func WithDRYSequenceBreakers(breakers ...string) GenerateOption

WithDRYSequenceBreakers sets sequences that break DRY repetition matching.

When these sequences appear, DRY stops considering earlier tokens as part of a repeated pattern. Default breakers (newline, colon, quote, asterisk) work well for natural text structure. Only affects behaviour when DRY multiplier is enabled.

Default: []string{"\n", ":", "\"", "*"}

Example:

// Custom breakers for code generation
text, err := model.Generate("Write code",
    llama.WithDRYMultiplier(0.8),
    llama.WithDRYSequenceBreakers("\n", ";", "{", "}"),
)

func WithDebug ¶

func WithDebug() GenerateOption

WithDebug enables verbose logging for generation internals.

When enabled, prints detailed information about token sampling, timing, and internal state to stderr. Useful for debugging generation issues or understanding model behaviour. Not recommended for production use.

Default: false

Example:

text, err := model.Generate("Test prompt",
    llama.WithDebug(),
)

func WithDraftTokens ¶

func WithDraftTokens(n int) GenerateOption

WithDraftTokens sets the number of speculative tokens for draft model usage.

When using GenerateWithDraft, the draft model speculatively generates this many tokens per iteration. Higher values increase potential speedup but waste more work if predictions are rejected. Typical range: 4-32 tokens.

Default: 16

Example:

target, _ := llama.LoadModel("large-model.gguf")
draft, _ := llama.LoadModel("small-model.gguf")
text, err := target.GenerateWithDraft("Write a story", draft,
    llama.WithDraftTokens(8),
)

func WithDynamicTemperature ¶

func WithDynamicTemperature(tempRange, exponent float32) GenerateOption

WithDynamicTemperature enables entropy-based temperature adjustment.

Dynamic temperature adjusts sampling temperature based on prediction entropy (uncertainty). The range parameter controls the adjustment span (0.0 = disabled, higher = more dynamic). The exponent controls how entropy maps to temperature. This adapts creativity to context: more focused when confident, more exploratory when uncertain.

Default: range 0.0 (disabled), exponent 1.0

Example:

// Enable dynamic temperature with range 0.5
text, err := model.Generate("Write adaptively",
    llama.WithDynamicTemperature(0.5, 1.0),
)

func WithFrequencyPenalty ¶

func WithFrequencyPenalty(penalty float32) GenerateOption

WithFrequencyPenalty sets the frequency-based repetition penalty.

Penalises tokens proportionally to how often they've appeared. Positive values (e.g. 0.5) discourage repetition, negative values encourage it. Use 0.0 to disable. Unlike repeat penalty, this considers cumulative frequency rather than just presence/absence.

Default: 0.0 (disabled)

Example:

// Discourage frequently used words
text, err := model.Generate("Write varied prose",
    llama.WithFrequencyPenalty(0.5),
)

func WithIgnoreEOS ¶

func WithIgnoreEOS(ignore bool) GenerateOption

WithIgnoreEOS continues generation past end-of-sequence tokens.

When enabled, generation continues even after the model produces an EOS token, up to max_tokens limit. Useful for forcing longer outputs or exploring model behaviour beyond natural stopping points. Most applications should leave this disabled.

Default: false

Example:

// Force generation to continue past EOS
text, err := model.Generate("Complete this",
    llama.WithIgnoreEOS(true),
    llama.WithMaxTokens(512),
)

func WithMaxTokens ¶

func WithMaxTokens(n int) GenerateOption

WithMaxTokens sets the maximum number of tokens to generate.

Generation stops after producing this many tokens, even if the model hasn't emitted an end-of-sequence token. This prevents runaway generation and controls response length.

Default: 128

Example:

// Generate up to 512 tokens
text, err := model.Generate("Write a story",
    llama.WithMaxTokens(512),
)

func WithMinKeep ¶

func WithMinKeep(n int) GenerateOption

WithMinKeep sets minimum tokens to keep regardless of other filters.

Ensures at least this many tokens remain available after sampling filters (top-k, top-p, min-p, etc.) are applied. Prevents over-aggressive filtering from leaving no valid tokens. Use 0 for no minimum.

Default: 0

Example:

// Ensure at least 5 token choices remain
text, err := model.Generate("Generate text",
    llama.WithTopK(10),
    llama.WithMinKeep(5),
)

func WithMinP ¶

func WithMinP(p float32) GenerateOption

WithMinP enables minimum probability threshold sampling.

Min-P sampling filters out tokens with probability below p * max_probability. This is a modern alternative to top-p that adapts dynamically to the confidence of predictions. More effective than top-p for maintaining quality whilst allowing appropriate diversity.

Default: 0.05

Example:

// Stricter filtering for focused output
text, err := model.Generate("Explain quantum physics",
    llama.WithMinP(0.1),
)

func WithMirostat ¶

func WithMirostat(version int) GenerateOption

WithMirostat enables Mirostat adaptive sampling.

Mirostat dynamically adjusts sampling to maintain consistent perplexity (surprise level). Version 0 = disabled, 1 = Mirostat v1, 2 = Mirostat v2 (recommended). Use WithMirostatTau and WithMirostatEta to control target perplexity and learning rate. Mirostat replaces temperature/top-k/top-p with adaptive control for more consistent quality.

Default: 0 (disabled)

Example:

// Enable Mirostat v2 for consistent quality
text, err := model.Generate("Write text",
    llama.WithMirostat(2),
    llama.WithMirostatTau(5.0),
    llama.WithMirostatEta(0.1),
)

func WithMirostatEta ¶

func WithMirostatEta(eta float32) GenerateOption

WithMirostatEta sets learning rate for Mirostat adaptation.

Eta controls how quickly Mirostat adjusts to maintain target perplexity. Higher values adapt faster but may oscillate, lower values adapt smoothly but slowly. Typical range: 0.05-0.2. Only affects behaviour when Mirostat is enabled (version 1 or 2).

Default: 0.1

Example:

// Faster adaptation
text, err := model.Generate("Write text",
    llama.WithMirostat(2),
    llama.WithMirostatEta(0.15),
)

func WithMirostatTau ¶

func WithMirostatTau(tau float32) GenerateOption

WithMirostatTau sets target perplexity for Mirostat sampling.

Tau controls the target cross-entropy (surprise level) that Mirostat tries to maintain. Higher values allow more surprise/diversity, lower values produce more focused output. Typical range: 3.0-8.0. Only affects behaviour when Mirostat is enabled (version 1 or 2).

Default: 5.0

Example:

// Lower perplexity for more focused output
text, err := model.Generate("Write precisely",
    llama.WithMirostat(2),
    llama.WithMirostatTau(3.0),
)

func WithNPrev ¶

func WithNPrev(n int) GenerateOption

WithNPrev sets number of previous tokens to remember for sampling.

Controls internal buffer size for token history used by various sampling methods. Rarely needs adjustment from the default. Larger values may improve long-range coherence at the cost of memory.

Default: 64

Example:

// Larger history buffer
text, err := model.Generate("Write text",
    llama.WithNPrev(128),
)

func WithNProbs ¶

func WithNProbs(n int) GenerateOption

WithNProbs enables probability output for top tokens.

When set to n > 0, outputs probabilities for the top n most likely tokens at each step. Use 0 to disable (no probability output). Useful for analysis, debugging, or implementing custom sampling strategies. Note that enabling this may affect performance.

Default: 0 (disabled)

Example:

// Output top 5 token probabilities
text, err := model.Generate("Write text",
    llama.WithNProbs(5),
)

func WithPenaltyLastN ¶

func WithPenaltyLastN(n int) GenerateOption

WithPenaltyLastN sets how many recent tokens to consider for penalties.

Repetition penalties (repeat, frequency, presence) only apply to the last n tokens. Use 0 to disable all repetition penalties, -1 to use full context size. Larger values catch longer-range repetition but may over-penalise.

Default: 64

Example:

// Consider last 256 tokens for repetition
text, err := model.Generate("Write text",
    llama.WithRepeatPenalty(1.1),
    llama.WithPenaltyLastN(256),
)

func WithPresencePenalty ¶

func WithPresencePenalty(penalty float32) GenerateOption

WithPresencePenalty sets the presence-based repetition penalty.

Penalises tokens that have appeared at all, regardless of frequency. Positive values (e.g. 0.6) encourage new topics and vocabulary. Use 0.0 to disable. This is effective for maintaining topic diversity and preventing the model from fixating on specific words.

Default: 0.0 (disabled)

Example:

// Encourage diverse vocabulary
text, err := model.Generate("Write creatively",
    llama.WithPresencePenalty(0.6),
)

func WithRepeatPenalty ¶

func WithRepeatPenalty(penalty float32) GenerateOption

WithRepeatPenalty sets the repetition penalty multiplier.

Applies penalty to recently used tokens to reduce repetition. Values > 1.0 penalise repeated tokens (1.1 = mild, 1.5 = strong). Use 1.0 to disable. Applied to last penalty_last_n tokens. This is the classic repetition penalty used in most LLM implementations.

Default: 1.0 (disabled)

Example:

// Reduce repetition in creative writing
text, err := model.Generate("Write a story",
    llama.WithRepeatPenalty(1.1),
    llama.WithPenaltyLastN(256),
)

func WithSeed ¶

func WithSeed(seed int) GenerateOption

WithSeed sets the random seed for reproducible generation.

Using the same seed with identical settings produces deterministic output. Use -1 for random seed (different output each time). Useful for testing, debugging, or when reproducibility is required.

Default: -1 (random)

Example:

// Reproducible generation
text, err := model.Generate("Write a story",
    llama.WithSeed(42),
    llama.WithTemperature(0.8),
)

func WithStopWords ¶

func WithStopWords(words ...string) GenerateOption

WithStopWords specifies sequences that terminate generation when encountered.

Generation stops immediately when any stop word is produced. Useful for controlling response format (e.g. stopping at newlines) or implementing chat patterns. The stop words themselves are not included in the output.

Default: none

Examples:

// Stop at double newline
text, err := model.Generate("Q: What is AI?",
    llama.WithStopWords("\n\n"),
)

// Multiple stop sequences
text, err := model.Generate("User:",
    llama.WithStopWords("User:", "Assistant:", "\n\n"),
)

func WithTemperature ¶

func WithTemperature(t float32) GenerateOption

WithTemperature controls randomness in token selection.

Higher values (e.g. 1.2) increase creativity and diversity but may reduce coherence. Lower values (e.g. 0.3) make output more deterministic and focused. Use 0.0 for fully deterministic greedy sampling (always pick the most likely token).

Default: 0.8

Examples:

// Creative writing
text, err := model.Generate("Write a poem",
    llama.WithTemperature(1.1),
)

// Precise factual responses
text, err := model.Generate("What is 2+2?",
    llama.WithTemperature(0.1),
)

func WithTopK ¶

func WithTopK(k int) GenerateOption

WithTopK limits token selection to the k most likely candidates.

Top-k sampling considers only the k highest probability tokens at each step. Lower values increase focus and determinism, higher values increase diversity. Use 0 to disable (consider all tokens).

Default: 40

Example:

// Very focused generation
text, err := model.Generate("Complete this",
    llama.WithTopK(10),
)

func WithTopNSigma ¶

func WithTopNSigma(sigma float32) GenerateOption

WithTopNSigma enables top-n-sigma statistical filtering.

Filters tokens beyond n standard deviations from the mean log probability. Use -1.0 to disable. This statistical approach removes unlikely outliers whilst preserving the natural probability distribution shape.

Default: -1.0 (disabled)

Example:

// Filter statistical outliers
text, err := model.Generate("Generate text",
    llama.WithTopNSigma(2.0),
)

func WithTopP ¶

func WithTopP(p float32) GenerateOption

WithTopP enables nucleus sampling with the specified cumulative probability.

Top-p sampling (nucleus sampling) considers only the smallest set of tokens whose cumulative probability exceeds p. This balances diversity and quality better than top-k for many tasks. Use 1.0 to disable (consider all tokens).

Default: 0.95

Example:

// More focused sampling
text, err := model.Generate("Complete this",
    llama.WithTopP(0.85),
)

func WithTypicalP ¶

func WithTypicalP(p float32) GenerateOption

WithTypicalP enables locally typical sampling.

Typical-p sampling (typ-p) filters tokens based on information content, keeping those with typical entropy. Use 1.0 to disable. This helps avoid both highly predictable and highly surprising tokens, producing more "typical" text that feels natural.

Default: 1.0 (disabled)

Example:

// Enable typical sampling
text, err := model.Generate("Write naturally",
    llama.WithTypicalP(0.95),
)

func WithXTC ¶

func WithXTC(probability, threshold float32) GenerateOption

WithXTC enables experimental XTC sampling for diversity.

XTC probabilistically excludes the most likely token to encourage diversity. The probability parameter controls how often exclusion occurs (0.0 = disabled, 0.1 = 10% of the time). The threshold parameter limits when XTC applies (> 0.5 effectively disables). This is an experimental technique for reducing predictability.

Default: probability 0.0 (disabled), threshold 0.1

Example:

// Enable XTC for more surprising outputs
text, err := model.Generate("Write creatively",
    llama.WithXTC(0.1, 0.1),
)

type Model ¶

type Model struct {
	ProgressCallbackID uintptr // Internal ID for progress callback cleanup (for testing)
	// contains filtered or unexported fields
}

Model represents loaded model weights.

Model instances are thread-safe and can be used to create multiple execution contexts with different configurations. The model owns the weights in memory but doesn't perform inference directly - use NewContext() to create execution contexts.

Resources are automatically freed via finaliser, but explicit Close() is recommended for deterministic cleanup:

model, _ := llama.LoadModel("model.gguf")
defer model.Close()

Note: Calling methods after Close() returns an error.

func LoadModel ¶

func LoadModel(path string, opts ...ModelOption) (*Model, error)

LoadModel loads a GGUF model from the specified path.

The path must point to a valid GGUF format model file. Legacy GGML formats are not supported. The function applies the provided options using the functional options pattern, with sensible defaults if none are specified.

Resources are managed automatically via finaliser, but explicit cleanup with Close() is recommended for deterministic resource management:

model, err := llama.LoadModel("model.gguf")
if err != nil {
    return err
}
defer model.Close()

Returns an error if the file doesn't exist, is not a valid GGUF model, or if model loading fails.

Examples:

// Load with defaults
model, err := llama.LoadModel("model.gguf")

// Load with custom GPU configuration
model, err := llama.LoadModel("model.gguf",
    llama.WithGPULayers(35),
)

func (*Model) ChatTemplate ¶

func (m *Model) ChatTemplate() string

ChatTemplate returns the chat template from the model's GGUF metadata.

Returns an empty string if the model has no embedded chat template. Most modern instruction-tuned models include a template in their GGUF metadata that specifies how to format messages for that specific model.

Example:

template := model.ChatTemplate()
if template == "" {
    // Model has no template - user must provide one
}

func (*Model) Close ¶

func (m *Model) Close() error

Close frees the model and its associated resources.

This method is idempotent - multiple calls are safe and subsequent calls return immediately without error.

After Close() is called, all other methods return an error. The method uses a write lock to prevent concurrent operations during cleanup.

Example:

model, _ := llama.LoadModel("model.gguf")
defer model.Close()

func (*Model) FormatChatPrompt ¶

func (m *Model) FormatChatPrompt(messages []ChatMessage, opts ChatOptions) (string, error)

FormatChatPrompt formats chat messages using the model's chat template.

This method applies the chat template to the provided messages and returns the resulting prompt string without performing generation. Useful for:

Debugging what will be sent to the model
Pre-computing prompts for caching
Understanding how the template formats conversations

The template priority is: opts.ChatTemplate > model's GGUF template > error.

See also: Context.Chat for performing chat completion with generation.

Example:

messages := []llama.ChatMessage{
    {Role: "system", Content: "You are helpful."},
    {Role: "user", Content: "Hello"},
}
prompt, err := model.FormatChatPrompt(messages, llama.ChatOptions{})
fmt.Println("Formatted prompt:", prompt)

func (*Model) NewContext ¶

func (m *Model) NewContext(opts ...ContextOption) (*Context, error)

NewContext creates a new execution context from this model.

This method creates an execution context with the specified configuration. Multiple contexts can be created from the same model to handle different use cases (e.g., small context for tokenization, large context for generation).

Each context maintains its own KV cache and state. For concurrent inference, create multiple contexts from the same model - this is VRAM efficient since contexts share the model weights (e.g., 7GB model + 100MB per context).

Thread safety: Model is thread-safe, but each Context is not. Use one context per goroutine for concurrent inference.

See also: Context.Generate, Context.Chat for inference operations.

Example:

// Load model once
model, _ := llama.LoadModel("model.gguf", llama.WithGPULayers(-1))
defer model.Close()

// Create context for tokenization
tokCtx, _ := model.NewContext(
    llama.WithContext(512),
    llama.WithKVCacheType("f16"),
)
defer tokCtx.Close()

// Create context for generation
genCtx, _ := model.NewContext(
    llama.WithContext(8192),
    llama.WithKVCacheType("q8_0"),
)
defer genCtx.Close()

func (*Model) Stats ¶

func (m *Model) Stats() (*ModelStats, error)

Stats returns comprehensive statistics about the model and runtime environment.

This includes:

GPU device information (name, VRAM)
Model metadata from GGUF (architecture, name, size, etc.)
Runtime configuration (context size, batch size, KV cache)

The returned information is useful for:

Displaying model details to users
Debugging configuration issues
Monitoring resource usage

Example:

stats, err := model.Stats()
if err != nil {
    log.Fatal(err)
}
fmt.Println(stats)

type ModelMetadata ¶

type ModelMetadata struct {
	Architecture string // Model architecture (e.g., "qwen3", "llama")
	Name         string // Full model name
	Basename     string // Base model name
	QuantizedBy  string // Who quantized the model
	SizeLabel    string // Model size (e.g., "8B", "70B")
	RepoURL      string // Hugging Face repo URL
}

ModelMetadata contains model information from GGUF metadata.

type ModelOption ¶

type ModelOption func(*modelConfig)

ModelOption configures model loading behaviour (model-level settings).

func WithGPULayers ¶

func WithGPULayers(n int) ModelOption

WithGPULayers sets the number of model layers to offload to GPU.

By default, all layers are offloaded to GPU (-1). If GPU acceleration is unavailable, the library automatically falls back to CPU execution. Set to 0 to force CPU-only execution, or specify a positive number to partially offload layers (useful for models larger than GPU memory).

Default: -1 (offload all layers, with CPU fallback)

Examples:

// Force CPU execution
model, err := llama.LoadModel("model.gguf",
    llama.WithGPULayers(0),
)

// Offload 35 layers to GPU, rest on CPU
model, err := llama.LoadModel("model.gguf",
    llama.WithGPULayers(35),
)

func WithMLock ¶

func WithMLock() ModelOption

WithMLock forces the model to stay in RAM using mlock().

When enabled, prevents the operating system from swapping model data to disk. Useful for production environments where consistent latency is critical, but requires sufficient physical RAM and may require elevated privileges.

Default: false (allows OS to manage memory)

Example:

model, err := llama.LoadModel("model.gguf",
    llama.WithMLock(),
)

func WithMMap ¶

func WithMMap(enabled bool) ModelOption

WithMMap enables or disables memory-mapped file I/O for model loading.

Memory mapping (mmap) allows the OS to load model data on-demand rather than reading the entire file upfront. This significantly reduces startup time and memory usage. Disable only if you encounter platform-specific issues.

Default: true (enabled)

Example:

// Disable mmap for compatibility
model, err := llama.LoadModel("model.gguf",
    llama.WithMMap(false),
)

func WithMainGPU ¶

func WithMainGPU(gpu string) ModelOption

WithMainGPU sets the primary GPU device for model execution.

Use this option to select a specific GPU in multi-GPU systems. The device string format depends on the backend (e.g. "0" for CUDA device 0). Most users with single-GPU systems don't need this option.

Default: "" (uses default GPU)

Example:

// Use second GPU
model, err := llama.LoadModel("model.gguf",
    llama.WithMainGPU("1"),
)

func WithProgressCallback ¶

func WithProgressCallback(cb ProgressCallback) ModelOption

WithProgressCallback sets a custom progress callback for model loading.

The callback is invoked periodically during model loading with progress values from 0.0 (start) to 1.0 (complete). This allows implementing custom progress indicators, logging, or loading cancellation.

The callback receives:

progress: float32 from 0.0 to 1.0 indicating loading progress

The callback must return:

true: continue loading
false: cancel loading (LoadModel will return an error)

IMPORTANT: The callback is invoked from a C thread during model loading. Ensure any operations are thread-safe. The callback should complete quickly to avoid blocking the loading process.

Default: nil (uses llama.cpp default dot printing)

Examples:

// Simple progress indicator
model, err := llama.LoadModel("model.gguf",
    llama.WithProgressCallback(func(progress float32) bool {
        fmt.Printf("\rLoading: %.0f%%", progress*100)
        return true
    }),
)

// Cancel loading after 50%
model, err := llama.LoadModel("model.gguf",
    llama.WithProgressCallback(func(progress float32) bool {
        if progress > 0.5 {
            return false // Cancel
        }
        return true
    }),
)

func WithSilentLoading ¶

func WithSilentLoading() ModelOption

WithSilentLoading disables progress output during model loading.

By default, llama.cpp prints dots to stderr to indicate loading progress. This option suppresses that output completely, useful for clean logs in production environments or when progress output interferes with other output formatting.

Note: The LLAMA_LOG environment variable controls general logging but does not suppress progress dots. Use this option for truly silent loading.

Default: false (shows progress dots)

Example:

model, err := llama.LoadModel("model.gguf",
    llama.WithSilentLoading(),
)

func WithTensorSplit ¶

func WithTensorSplit(split string) ModelOption

WithTensorSplit configures tensor distribution across multiple GPUs.

Allows manual control of how model layers are distributed across GPUs in multi-GPU setups. The split string format is backend-specific (e.g. "0.7,0.3" for CUDA to use 70% on GPU 0, 30% on GPU 1). Most users should rely on automatic distribution instead.

Default: "" (automatic distribution)

Example:

// Distribute 60/40 across two GPUs
model, err := llama.LoadModel("model.gguf",
    llama.WithTensorSplit("0.6,0.4"),
)

type ModelStats ¶

type ModelStats struct {
	GPUs     []GPUInfo     // Information about available CUDA GPUs
	Metadata ModelMetadata // Model metadata from GGUF file
	Runtime  RuntimeInfo   // Runtime configuration and resource usage
}

ModelStats contains comprehensive model statistics and metadata.

This includes GPU information, model metadata from GGUF, and runtime configuration. Use Model.Stats() to retrieve these statistics.

func (*ModelStats) String ¶

func (s *ModelStats) String() string

String returns a formatted summary of model statistics.

The output includes GPU information, model details, and runtime configuration in a human-readable format suitable for display.

Example output:

=== Model Statistics ===

GPU Devices:
  GPU 0: NVIDIA GeForce RTX 3090
    VRAM: 23733 MB free / 24576 MB total

Model Details:
  Name: DeepSeek-R1-0528-Qwen3-8B
  Architecture: qwen3 (8B)
  Quantized by: Unsloth
  Repository: https://huggingface.co/unsloth

Runtime Configuration:
  Context: 131,072 tokens | Batch: 512 tokens
  KV Cache: q8_0 (9,216 MB)
  GPU Layers: 28/28

type ProgressCallback ¶

type ProgressCallback func(progress float32) bool

ProgressCallback is called during model loading with progress 0.0-1.0. Return false to cancel loading, true to continue.

type ReasoningFormat ¶

type ReasoningFormat int

ReasoningFormat specifies how reasoning content is handled for models that emit thinking/reasoning tokens (like DeepSeek-R1).

Reasoning models typically emit content within special tags like <think>...</think>. These formats control whether that content is extracted into separate ReasoningContent fields or left inline.

const (
	// ReasoningFormatNone leaves reasoning content inline with regular content.
	// All tokens appear in Content/delta.Content fields.
	ReasoningFormatNone ReasoningFormat = iota

	// ReasoningFormatAuto extracts reasoning to ReasoningContent field.
	// Tokens inside reasoning tags go to ReasoningContent, others to Content.
	// This is the recommended format for reasoning models.
	ReasoningFormatAuto

	// ReasoningFormatDeepSeekLegacy extracts in non-streaming mode only.
	// For streaming: reasoning stays inline. For Chat(): extracted.
	// This matches DeepSeek's original API behaviour.
	ReasoningFormatDeepSeekLegacy

	// ReasoningFormatDeepSeek extracts reasoning in all modes.
	// Always separates reasoning content from regular content.
	ReasoningFormatDeepSeek
)

func (ReasoningFormat) String ¶

func (r ReasoningFormat) String() string

String returns the string representation of a ReasoningFormat.

type RuntimeInfo ¶

type RuntimeInfo struct {
	ContextSize     int    // Context window size in tokens
	BatchSize       int    // Batch processing size
	KVCacheType     string // KV cache quantization type ("f16", "q8_0", "q4_0")
	KVCacheSizeMB   int    // Estimated KV cache memory usage in MB
	GPULayersLoaded int    // Number of layers offloaded to GPU
	TotalLayers     int    // Total number of layers in model
}

RuntimeInfo contains current runtime configuration and resource usage.

type Tool ¶

type Tool struct {
	Type     string       `json:"type"`     // "function"
	Function ToolFunction `json:"function"` // Function definition
}

Tool represents a tool/function that can be called by the model.

Note: Tool calling is not yet implemented in the Go API, but these types are defined for future compatibility with models that support function calling (like GPT-4, Claude, etc.).

When implemented, tools will be passed via ChatOptions and the model may return ToolCall objects in ChatResponse/ChatDelta.

Example (future usage):

tool := llama.Tool{
    Type: "function",
    Function: llama.ToolFunction{
        Name:        "get_weather",
        Description: "Get the current weather in a location",
        Parameters: map[string]interface{}{
            "type": "object",
            "properties": map[string]interface{}{
                "location": map[string]interface{}{
                    "type":        "string",
                    "description": "City name",
                },
            },
            "required": []string{"location"},
        },
    },
}

type ToolCall ¶

type ToolCall struct {
	ID       string `json:"id"`   // Unique identifier for this call
	Type     string `json:"type"` // "function"
	Function struct {
		Name      string `json:"name"`      // Function name being called
		Arguments string `json:"arguments"` // JSON string of arguments
	} `json:"function"`
}

ToolCall represents a function call made by the model.

When a model decides to call a function, it returns a ToolCall with the function name and arguments (as a JSON string). The application should execute the function and return the result in a subsequent message with role "tool".

Example (future usage):

// Model returns tool call
if len(response.ToolCalls) > 0 {
    call := response.ToolCalls[0]
    result := executeFunction(call.Function.Name, call.Function.Arguments)

    // Send result back to model
    messages = append(messages, llama.ChatMessage{
        Role:    "tool",
        Content: result,
        ToolCallID: call.ID,
    })
}

type ToolFunction ¶

type ToolFunction struct {
	Name        string                 `json:"name"`        // Function name (must be valid identifier)
	Description string                 `json:"description"` // Human-readable description
	Parameters  map[string]interface{} `json:"parameters"`  // JSON Schema for parameters
}

ToolFunction defines a function that can be called by the model.

The Parameters field should contain a JSON Schema describing the function's parameters. This follows the OpenAI function calling format.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
examples
chat command Chat example demonstrates non-streaming chat completion.	Chat example demonstrates non-streaming chat completion.
chat-streaming command Chat streaming example demonstrates real-time chat completion with visual distinction between regular content and reasoning output.	Chat streaming example demonstrates real-time chat completion with visual distinction between regular content and reasoning output.
embedding command Embedding example demonstrates generating text embeddings for semantic tasks.	Embedding example demonstrates generating text embeddings for semantic tasks.
simple command Simple example demonstrates basic text generation using llama-go.	Simple example demonstrates basic text generation using llama-go.
speculative command Speculative example demonstrates speculative decoding for faster generation.	Speculative example demonstrates speculative decoding for faster generation.
streaming command Streaming example demonstrates both callback and channel-based token streaming.	Streaming example demonstrates both callback and channel-based token streaming.
internal
exampleui Package exampleui provides terminal UI utilities for llama-go examples.	Package exampleui provides terminal UI utilities for llama-go examples.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL