Documentation
¶
Overview ¶
Package generate implements autoregressive text generation for transformer models loaded by the inference package. It provides the core decode loop, KV caching, token sampling, streaming output, batch generation, and speculative decoding. (Stability: stable)
Generator ¶
Generator is the primary entry point. It takes a compiled computation graph, a tokenizer, a compute engine, and a ModelConfig, then drives the prefill-and-decode loop:
gen := generate.NewGenerator[float32](graph, tok, engine, cfg) text, err := gen.Generate(ctx, "Once upon a time", generate.DefaultSamplingConfig())
The Generator compiles the graph into an graph.ExecutionPlan after the first decode step, optionally capturing a CUDA graph for near-zero kernel launch overhead. A megakernel code generator may further fuse the plan's instructions into a single GPU kernel.
Generator options include WithPagedKV (block-allocated KV cache) and WithGeneratorKVDtype (FP16 KV cache storage for reduced memory bandwidth).
Sampling ¶
SamplingConfig controls token selection: temperature scaling, top-K filtering, nucleus (top-P) sampling, repetition penalty, stop tokens, stop strings, and grammar-constrained decoding. DefaultSamplingConfig returns sensible defaults (temperature 1.0, no filtering, 256 max tokens). When Temperature is zero, greedy argmax is used with an optimized GPU fast path that copies only 4 bytes instead of the full vocabulary logits.
KV Cache Variants ¶
The package provides three KV cache implementations behind the CacheProvider interface:
KVCache pre-allocates flat CPU buffers sized to maxSeqLen on first use. Zero-copy views are returned for batch=1. Suitable for simple CPU inference.
TensorCache is the default for GPU-accelerated inference. It pre-allocates GPU-resident buffers and uses direct D2D memcpy for KV appends. It supports FP16 storage mode (halving memory bandwidth), GPU-resident position counters for CUDA graph capture compatibility, and the FullBufferProvider interface for flash attention decode.
PagedKVCache allocates fixed-size blocks on demand from a shared BlockPool, reducing memory waste for concurrent sequences of varying length. Blocks are shared across layers and recycled via Alloc/Free.
GPUKVCache manages raw GPU device pointers for megakernel inference. It uses offset_memcpy and increment_counter CUDA kernels to write KV data at GPU-counter-derived offsets, making the entire append path capturable in CUDA graphs.
Caches are attached to a context via WithCache and retrieved by attention layers via GetCache.
Streaming ¶
[GenerateStream] delivers tokens incrementally through a TokenStream callback as they are decoded. TokenStreamFunc adapts a plain function to the interface. Stop strings are checked incrementally and any text preceding a match is emitted before the done signal.
Batch Generation ¶
Generator.BatchGenerate and Generator.BatchGenerateStream accept multiple prompts and run them sequentially (request-level parallelism). True batched tensor operations (batch dimension > 1) require native batch support in the model graph, which is planned but not yet implemented.
Speculative Decoding ¶
SpeculativeGenerator pairs a small draft model with a large target model. The draft proposes N tokens greedily, the target verifies all N in a single batched forward pass, and accepted tokens are emitted. On mismatch the target's token is used. An adaptive draft length tracker adjusts N based on rolling acceptance rate (increasing when acceptance > 80%, decreasing when < 40%).
Constrained Decoding ¶
When [SamplingConfig.GrammarState] is set, a token mask is computed from the grammar at each step, restricting sampling to tokens that produce valid continuations. The grammar state advances through the bytes of each sampled token. Generation stops early when the grammar reaches a complete state.
Tracing ¶
TracingCacheProvider wraps a real CacheProvider and records KV cache operations into a compute.Tracer during compilation tracing passes, capturing the full attention dataflow including cache reads and writes. Stability: stable
Index ¶
- func WithCache[T tensor.Numeric](ctx context.Context, cache CacheProvider[T]) context.Context
- func WithKVCache[T tensor.Numeric](ctx context.Context, cache *KVCache[T]) context.Contextdeprecated
- func WithTokenUsage(ctx context.Context, usage *TokenUsage) context.Context
- type BatchRequest
- type BatchResult
- type Block
- type BlockPool
- type CacheProvider
- type CompressedKVCache
- func (c *CompressedKVCache[T]) Get(layer int) (*LayerKV[T], bool)
- func (c *CompressedKVCache[T]) NumLayers() int
- func (c *CompressedKVCache[T]) Reset()
- func (c *CompressedKVCache[T]) SeqLen() int
- func (c *CompressedKVCache[T]) Truncate(newSeqLen int)
- func (c *CompressedKVCache[T]) Update(layer int, newK, newV *tensor.TensorNumeric[T]) error
- type EAGLEForwardFunc
- type EAGLEForwardResult
- type EAGLEGenerator
- type FullBufferProvider
- type GPUAllocator
- type GPUKVCache
- func (c *GPUKVCache) Append(layerIdx int, k, v []float32, seqPos int) error
- func (c *GPUKVCache) AppendGPU(layerIdx int, kSrc, vSrc unsafe.Pointer, stream unsafe.Pointer) error
- func (c *GPUKVCache) Close() error
- func (c *GPUKVCache) DevicePointerArrays() (kPtrs, vPtrs unsafe.Pointer, err error)
- func (c *GPUKVCache) GPUCounterPtr() unsafe.Pointer
- func (c *GPUKVCache) Pointers(layerIdx int) (kPtr, vPtr unsafe.Pointer, seqLen int)
- func (c *GPUKVCache) Reset()
- func (c *GPUKVCache) SeqLen() int
- func (c *GPUKVCache) SyncCounterFromGPU() error
- type Generator
- func (gen *Generator[T]) BatchGenerate(ctx context.Context, requests []BatchRequest) []BatchResult
- func (gen *Generator[T]) BatchGenerateStream(ctx context.Context, requests []BatchRequest, streams []TokenStream) []error
- func (gen *Generator[T]) Config() ModelConfig
- func (gen *Generator[T]) EAGLEEnabled() bool
- func (gen *Generator[T]) EAGLEWeightsPath() string
- func (gen *Generator[T]) Engine() compute.Engine[T]
- func (gen *Generator[T]) Generate(ctx context.Context, prompt string, sc SamplingConfig) (string, error)
- func (gen *Generator[T]) GenerateStream(ctx context.Context, prompt string, sc SamplingConfig, stream TokenStream) error
- func (gen *Generator[T]) GetPrefixCache() *PrefixCache[T]
- func (gen *Generator[T]) Graph() *graph.Graph[T]
- func (gen *Generator[T]) NewSession() *InferenceSession[T]
- func (gen *Generator[T]) Tokenizer() tokenizer.Tokenizer
- type GeneratorOption
- func WithCompressedKV(chunkSize int) GeneratorOption
- func WithEAGLE(headWeightsPath string) GeneratorOption
- func WithGeneratorKVDtype(dtype string) GeneratorOption
- func WithMetrics(c runtime.Collector) GeneratorOption
- func WithPagedKV(maxMemoryMB, headDim int) GeneratorOption
- func WithPrefixCache(capacityBlocks int) GeneratorOption
- func WithSpeculativeDraft[T tensor.Numeric](draftGraph *graph.Graph[T], draftCfg ModelConfig, draftLen int) GeneratorOption
- func WithTieredKV(cfg TieredKVStoreConfig) GeneratorOption
- type InferenceSession
- type KVCache
- type KVCacheFP8
- type KVCacheFP16
- func (c *KVCacheFP16) Get(layer int) (*LayerKV[float32], bool)
- func (c *KVCacheFP16) NumLayers() int
- func (c *KVCacheFP16) Reset()
- func (c *KVCacheFP16) SeqLen() int
- func (c *KVCacheFP16) Truncate(newSeqLen int)
- func (c *KVCacheFP16) Update(layer int, newK, newV *tensor.TensorNumeric[float32]) error
- type KVCacheQ3
- type KVCacheQ4
- type LayerKV
- type ModelConfig
- type PagedKVCache
- func (c *PagedKVCache[T]) Append(layer int, newK, newV *tensor.TensorNumeric[T]) error
- func (c *PagedKVCache[T]) BlockTable() []*Block[T]
- func (c *PagedKVCache[T]) Free()
- func (c *PagedKVCache[T]) Get(layer int) (*LayerKV[T], bool)
- func (c *PagedKVCache[T]) GetKV(layer int) (*LayerKV[T], bool)
- func (c *PagedKVCache[T]) InjectBlocks(blocks []*Block[T], seqLen int)
- func (c *PagedKVCache[T]) NumLayers() int
- func (c *PagedKVCache[T]) Reset()
- func (c *PagedKVCache[T]) SeqLen() int
- func (c *PagedKVCache[T]) Truncate(newSeqLen int)
- func (c *PagedKVCache[T]) Update(layer int, newK, newV *tensor.TensorNumeric[T]) error
- type PrefixCache
- type RadixCache
- type RadixNode
- type SSMState
- type SamplingConfig
- type SpeculativeGenerator
- type TensorCache
- func (c *TensorCache[T]) Free()
- func (c *TensorCache[T]) GPUCounterPtr() unsafe.Pointer
- func (c *TensorCache[T]) Get(layer int) (*LayerKV[T], bool)
- func (c *TensorCache[T]) GetFullBuffer(layer int) (k, v *tensor.TensorNumeric[T])
- func (c *TensorCache[T]) KVSeqLenPtr() unsafe.Pointer
- func (c *TensorCache[T]) MaxSeqLen() int
- func (c *TensorCache[T]) Reset()
- func (c *TensorCache[T]) SeqLen() int
- func (c *TensorCache[T]) SyncCounterFromGPU() error
- func (c *TensorCache[T]) Truncate(newSeqLen int)
- func (c *TensorCache[T]) Update(layer int, newK, newV *tensor.TensorNumeric[T]) error
- type TensorCacheOption
- type Tier
- type TieredKVStore
- func (s *TieredKVStore[T]) AccessCount(layer int) int
- func (s *TieredKVStore[T]) Close() error
- func (s *TieredKVStore[T]) Demote(layer int) error
- func (s *TieredKVStore[T]) Get(layer int) (*LayerKV[T], bool)
- func (s *TieredKVStore[T]) GetPrefetched(layer int) (*LayerKV[T], bool)
- func (s *TieredKVStore[T]) ManageTiers() error
- func (s *TieredKVStore[T]) NumLayers() int
- func (s *TieredKVStore[T]) PrefetchAsync(positions []int)
- func (s *TieredKVStore[T]) Promote(layer int) error
- func (s *TieredKVStore[T]) Reset()
- func (s *TieredKVStore[T]) SeqLen() int
- func (s *TieredKVStore[T]) Tier(layer int) Tier
- func (s *TieredKVStore[T]) Truncate(newSeqLen int)
- func (s *TieredKVStore[T]) Update(layer int, newK, newV *tensor.TensorNumeric[T]) error
- type TieredKVStoreConfig
- type TokenStream
- type TokenStreamFunc
- type TokenUsage
- type TracingCacheProvider
- func (t *TracingCacheProvider[T]) Get(layer int) (*LayerKV[T], bool)
- func (t *TracingCacheProvider[T]) Reset()
- func (t *TracingCacheProvider[T]) SeqLen() int
- func (t *TracingCacheProvider[T]) Truncate(newSeqLen int)
- func (t *TracingCacheProvider[T]) Update(layer int, newK, newV *tensor.TensorNumeric[T]) error
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func WithTokenUsage ¶ added in v1.12.0
func WithTokenUsage(ctx context.Context, usage *TokenUsage) context.Context
WithTokenUsage returns a new context carrying the given TokenUsage. Billing middleware should call this before dispatching to the handler, then read back the counts after the handler returns.
Types ¶
type BatchRequest ¶
type BatchRequest struct {
Prompt string
Sampling SamplingConfig
}
BatchRequest represents a single generation request in a batch.
type BatchResult ¶
BatchResult holds the output for a single request in a batch.
type Block ¶
type Block[T tensor.Numeric] struct { K []T V []T Used int // number of token positions written (0..blockSize) }
Block holds pre-allocated key and value data for a fixed number of token positions across all layers. K and V each have numLayers * blockSize * headDim elements laid out as [layer][position][headDim] in row-major order.
type BlockPool ¶
BlockPool manages a fixed-size pool of pre-allocated KV cache blocks. Blocks are allocated at startup and recycled via Alloc/Free. All methods are safe for concurrent use.
func NewBlockPool ¶
func NewBlockPool[T tensor.Numeric](numLayers, blockSize, headDim, maxMemoryMB int) (*BlockPool[T], error)
NewBlockPool creates a pool of blocks sized to fit within maxMemoryMB. Each block holds K and V data for blockSize token positions across numLayers, with headDim elements per position per layer. The element size is assumed to be 4 bytes (float32).
func (*BlockPool[T]) Alloc ¶
Alloc returns a free block from the pool. Returns an error if the pool is exhausted. The returned block has Used reset to 0.
func (*BlockPool[T]) FragmentationRatio ¶ added in v1.5.0
FragmentationRatio returns the fraction of allocated block capacity that is wasted (allocated but unused token positions). A value of 0 means every allocated block is fully used; higher values indicate internal fragmentation from partially-filled blocks.
type CacheProvider ¶
type CacheProvider[T tensor.Numeric] interface { Update(layer int, newK, newV *tensor.TensorNumeric[T]) error Get(layer int) (*LayerKV[T], bool) SeqLen() int Reset() Truncate(newSeqLen int) }
CacheProvider is the interface implemented by both KVCache (pre-allocated) and PagedKVCache (block-based). Attention layers use this interface to store and retrieve cached key-value tensors during generation.
type CompressedKVCache ¶ added in v1.28.0
CompressedKVCache stores key-value tensors with chunk-wise mean pooling compression. When a chunk of chunkSize tokens is full, it is compressed into a single vector by averaging (ReduceMean over the sequence axis). Recent tokens within the current chunk are stored uncompressed. Get() returns the compressed chunks concatenated with recent tokens.
func NewCompressedKVCache ¶ added in v1.28.0
func NewCompressedKVCache[T tensor.Numeric](engine compute.Engine[T], layers, heads, dim, chunkSize int) *CompressedKVCache[T]
NewCompressedKVCache creates a CompressedKVCache. layers: number of attention layers. heads: number of attention heads (reserved for future use). dim: feature dimension per token. chunkSize: number of tokens per chunk before compression.
func (*CompressedKVCache[T]) Get ¶ added in v1.28.0
func (c *CompressedKVCache[T]) Get(layer int) (*LayerKV[T], bool)
Get returns the cached key-value pair for the given layer. The returned tensors have shape [batch, numCompressedChunks + recentTokens, dim], with compressed chunks first followed by uncompressed recent tokens.
func (*CompressedKVCache[T]) NumLayers ¶ added in v1.28.0
func (c *CompressedKVCache[T]) NumLayers() int
NumLayers returns the number of layers in the cache.
func (*CompressedKVCache[T]) Reset ¶ added in v1.28.0
func (c *CompressedKVCache[T]) Reset()
Reset clears all cached data.
func (*CompressedKVCache[T]) SeqLen ¶ added in v1.28.0
func (c *CompressedKVCache[T]) SeqLen() int
SeqLen returns the total number of tokens stored (compressed + recent). Compressed chunks each represent chunkSize tokens; recent tokens are counted directly.
func (*CompressedKVCache[T]) Truncate ¶ added in v1.28.0
func (c *CompressedKVCache[T]) Truncate(newSeqLen int)
Truncate reduces the cache to newSeqLen original tokens. Compressed chunks that fall entirely within newSeqLen are kept; the recent buffer is trimmed to cover the remainder. If newSeqLen falls in the middle of a compressed chunk, that chunk and all subsequent data are discarded (lossy truncation).
func (*CompressedKVCache[T]) Update ¶ added in v1.28.0
func (c *CompressedKVCache[T]) Update(layer int, newK, newV *tensor.TensorNumeric[T]) error
Update appends new key and value tensors for the given layer. Tensors must have shape [batch, seq_len, dim]. When the current chunk fills up, it is compressed via mean pooling and moved to the compressed store.
type EAGLEForwardFunc ¶ added in v1.29.0
type EAGLEForwardFunc[T tensor.Numeric] func(ctx context.Context, input *tensor.TensorNumeric[T]) (*EAGLEForwardResult[T], error)
EAGLEForwardFunc runs a target model forward pass and returns both logits and penultimate layer hidden states. The caller is responsible for wiring this to capture the output of the second-to-last transformer layer.
type EAGLEForwardResult ¶ added in v1.29.0
type EAGLEForwardResult[T tensor.Numeric] struct { Logits *tensor.TensorNumeric[T] // [1, seqLen, vocabSize] PenultimateFeatures *tensor.TensorNumeric[T] // [1, seqLen, hiddenDim] }
EAGLEForwardResult holds both the logits and penultimate hidden states returned by a target model forward pass for EAGLE speculative decoding.
type EAGLEGenerator ¶ added in v1.29.0
EAGLEGenerator implements EAGLE-style self-speculative decoding. It uses the target model for verification and a lightweight EAGLEHead to draft tokens from the target's penultimate layer features — no separate draft model is needed.
The decode loop:
- Run target forward, capture penultimate features and logits.
- Feed penultimate features to EAGLEHead for N draft tokens.
- Verify all N draft tokens in a single batched target forward pass.
- Accept the matching prefix, reject the rest.
- Adaptively adjust N based on acceptance rate.
func NewEAGLEGenerator ¶ added in v1.29.0
func NewEAGLEGenerator[T tensor.Numeric]( forwardFn EAGLEForwardFunc[T], eagleHead *core.EAGLEHead[T], tok tokenizer.Tokenizer, engine compute.Engine[T], lmHeadWeight *tensor.TensorNumeric[T], cfg ModelConfig, draftLen int, ) *EAGLEGenerator[T]
NewEAGLEGenerator creates an EAGLE speculative generator.
Parameters:
- forwardFn: target model forward that returns logits + penultimate features
- eagleHead: lightweight FFN that predicts next hidden state from penultimate features
- tok: tokenizer for encoding prompts and decoding output
- engine: compute engine for tensor operations
- lmHeadWeight: LM head weight tensor [vocabSize, hiddenDim] for draft token projection
- cfg: model configuration
- draftLen: initial number of draft tokens per step (default 4)
func (*EAGLEGenerator[T]) Generate ¶ added in v1.29.0
func (eg *EAGLEGenerator[T]) Generate(ctx context.Context, prompt string, sc SamplingConfig) (string, error)
Generate produces text from a prompt using EAGLE speculative decoding with greedy sampling. Generates identical output to vanilla autoregressive decoding (greedy).
func (*EAGLEGenerator[T]) WithAdaptive ¶ added in v1.29.0
func (eg *EAGLEGenerator[T]) WithAdaptive(enabled bool) *EAGLEGenerator[T]
WithAdaptive enables or disables adaptive draft length adjustment. When enabled (default), the draft length is adjusted based on acceptance rate.
type FullBufferProvider ¶
type FullBufferProvider[T tensor.Numeric] interface { // GetFullBuffer returns GPU-backed KV tensors spanning the full // pre-allocated buffer (maxSeqLen capacity) for the given layer. // Shape is [batch, maxSeqLen, dim]. Returns nil if the layer is // CPU-backed or not yet initialized. GetFullBuffer(layer int) (k, v *tensor.TensorNumeric[T]) // MaxSeqLen returns the maximum sequence length (buffer capacity). MaxSeqLen() int // KVSeqLenPtr returns the device pointer to the GPU-resident int32 // KV sequence length counter. Returns nil if not allocated. KVSeqLenPtr() unsafe.Pointer }
FullBufferProvider is an optional interface for caches that support fixed-size (maxSeqLen) KV buffer access. This enables CUDA graph capture for the decode attention loop: the FlashAttentionDecode kernel reads the actual KV length from a GPU-resident counter (KVSeqLenPtr), so tensor shapes stay fixed across graph replays.
type GPUAllocator ¶
type GPUAllocator interface {
// Alloc allocates size bytes of device memory and returns a device pointer.
Alloc(size int) (unsafe.Pointer, error)
// Free releases device memory previously returned by Alloc.
Free(ptr unsafe.Pointer) error
// Memcpy copies size bytes between host and device memory.
// kind follows the gpuapi convention: 0 = HostToDevice, 1 = DeviceToHost.
Memcpy(dst, src unsafe.Pointer, size int, kind int) error
}
GPUAllocator abstracts GPU memory operations so that GPUKVCache can be tested without a real GPU device. Production code passes a thin wrapper around gpuapi.Runtime; tests supply a mock.
type GPUKVCache ¶
type GPUKVCache struct {
// contains filtered or unexported fields
}
GPUKVCache manages GPU-resident key/value buffers for all attention layers during megakernel inference. Memory is allocated once at construction and reused across generation steps.
func NewGPUKVCache ¶
func NewGPUKVCache(alloc GPUAllocator, numLayers, maxSeqLen, numHeads, headDim int) (*GPUKVCache, error)
NewGPUKVCache allocates GPU buffers for numLayers attention layers. Each layer gets two buffers (K and V) of size maxSeqLen * numHeads * headDim float32 elements.
func (*GPUKVCache) Append ¶
func (c *GPUKVCache) Append(layerIdx int, k, v []float32, seqPos int) error
Append copies new K/V float32 data to the correct position in the GPU buffer for the given layer. k and v must each have length numHeads * headDim (one token's worth of data). seqPos is the sequence position to write at; it must equal the current seqLen (enforcing sequential append).
func (*GPUKVCache) AppendGPU ¶
func (c *GPUKVCache) AppendGPU(layerIdx int, kSrc, vSrc unsafe.Pointer, stream unsafe.Pointer) error
AppendGPU copies one token's K/V data from GPU-resident src pointers into the KV cache using the offset_memcpy kernel. The kernel reads gpuCounter on the GPU to compute the write offset, eliminating any D2H copy per token. After writing K and V for the last layer, it increments the GPU counter via the increment_counter kernel and advances the CPU seqLen for compatibility.
kSrc and vSrc must each point to numHeads*headDim float32 values on the GPU. stream is the CUDA stream for async execution.
func (*GPUKVCache) Close ¶
func (c *GPUKVCache) Close() error
Close frees all GPU memory held by the cache. The cache must not be used after Close is called.
func (*GPUKVCache) DevicePointerArrays ¶
func (c *GPUKVCache) DevicePointerArrays() (kPtrs, vPtrs unsafe.Pointer, err error)
DevicePointerArrays returns GPU-resident arrays of float* pointers for K and V buffers across all layers. These can be passed directly to the megakernel. The arrays are allocated once and cached.
func (*GPUKVCache) GPUCounterPtr ¶
func (c *GPUKVCache) GPUCounterPtr() unsafe.Pointer
GPUCounterPtr returns the device pointer to the GPU-resident int32 position counter. Kernels (offset_memcpy, rope_select, increment_counter) use this pointer to read/write the current sequence position on the GPU, enabling CUDA graph capture of the decode loop.
func (*GPUKVCache) Pointers ¶
func (c *GPUKVCache) Pointers(layerIdx int) (kPtr, vPtr unsafe.Pointer, seqLen int)
Pointers returns the device pointers for the K and V buffers of the given layer, along with the current sequence length. The megakernel reads from these pointers directly.
func (*GPUKVCache) Reset ¶
func (c *GPUKVCache) Reset()
Reset resets the sequence position to zero without freeing GPU memory. Buffers are reused for the next generation. The GPU counter is also zeroed so that GPU-side kernels see the reset position.
func (*GPUKVCache) SeqLen ¶
func (c *GPUKVCache) SeqLen() int
SeqLen returns the current cached sequence length.
func (*GPUKVCache) SyncCounterFromGPU ¶
func (c *GPUKVCache) SyncCounterFromGPU() error
SyncCounterFromGPU performs a D2H copy of the GPU counter to update the CPU seqLen. Call this after the decode loop completes, not per token.
type Generator ¶
Generator produces text autoregressively using a loaded model graph.
func NewGenerator ¶
func NewGenerator[T tensor.Numeric]( g *graph.Graph[T], tok tokenizer.Tokenizer, eng compute.Engine[T], cfg ModelConfig, opts ...GeneratorOption, ) *Generator[T]
NewGenerator creates a Generator from a model graph, tokenizer, engine, and config.
func (*Generator[T]) BatchGenerate ¶
func (gen *Generator[T]) BatchGenerate(ctx context.Context, requests []BatchRequest) []BatchResult
BatchGenerate runs multiple generation requests concurrently. Each request gets its own KV cache and sampling state. This provides throughput gains when the model graph is configured with WithParallel(true) or when generation is I/O bound.
For true batched tensor operations (batch dimension > 1 in a single forward pass), the model graph and attention layers need native batch support, which is not yet implemented. This function provides request-level parallelism as an interim solution.
func (*Generator[T]) BatchGenerateStream ¶
func (gen *Generator[T]) BatchGenerateStream(ctx context.Context, requests []BatchRequest, streams []TokenStream) []error
BatchGenerateStream runs multiple streaming generation requests concurrently. Each request gets its own KV cache, sampling state, and token stream.
func (*Generator[T]) Config ¶
func (gen *Generator[T]) Config() ModelConfig
Config returns the model configuration.
func (*Generator[T]) EAGLEEnabled ¶ added in v1.29.0
EAGLEEnabled reports whether EAGLE speculative decoding should be used. It returns true only when a weights path was configured and the file exists.
func (*Generator[T]) EAGLEWeightsPath ¶ added in v1.29.0
EAGLEWeightsPath returns the configured EAGLE head weights path, or empty if EAGLE was not requested.
func (*Generator[T]) Generate ¶
func (gen *Generator[T]) Generate(ctx context.Context, prompt string, sc SamplingConfig) (string, error)
Generate produces text from a prompt using the given sampling configuration. It tokenizes the prompt, runs the autoregressive loop with KV caching, and returns the generated text (excluding the prompt).
When WithSpeculativeDraft is configured, Generate uses speculative decoding: the draft model proposes K tokens, the target model verifies them in one forward pass. If the rolling acceptance rate (alpha) drops below 0.4, generation falls back to standard autoregressive decoding.
func (*Generator[T]) GenerateStream ¶
func (gen *Generator[T]) GenerateStream(ctx context.Context, prompt string, sc SamplingConfig, stream TokenStream) error
GenerateStream produces text from a prompt, delivering each token to the stream as it is generated. The final output matches what Generate would return.
func (*Generator[T]) GetPrefixCache ¶ added in v1.5.0
func (gen *Generator[T]) GetPrefixCache() *PrefixCache[T]
GetPrefixCache returns the prefix cache, or nil if prefix caching is disabled.
func (*Generator[T]) NewSession ¶ added in v1.4.0
func (gen *Generator[T]) NewSession() *InferenceSession[T]
NewSession creates a new InferenceSession with its own KV cache. The session shares the Generator's graph, tokenizer, and engine but maintains independent KV cache state for isolation.
type GeneratorOption ¶
type GeneratorOption func(*generatorOptions)
GeneratorOption configures a Generator.
func WithCompressedKV ¶ added in v1.28.0
func WithCompressedKV(chunkSize int) GeneratorOption
WithCompressedKV enables compressed KV caching using chunk-wise mean pooling. When a chunk of chunkSize tokens fills up, it is compressed into a single vector by averaging. If chunkSize <= 0, it defaults to 64.
func WithEAGLE ¶ added in v1.29.0
func WithEAGLE(headWeightsPath string) GeneratorOption
WithEAGLE enables EAGLE-style self-speculative decoding. headWeightsPath points to a GGUF file containing the EAGLE head weights. When the file exists at generation time the generator uses the EAGLE decode loop; if the file is missing or the path is empty, it falls back to vanilla autoregressive decoding.
func WithGeneratorKVDtype ¶
func WithGeneratorKVDtype(dtype string) GeneratorOption
WithGeneratorKVDtype sets the KV cache storage dtype. Supported: "fp32" (default), "fp16", "q4", "q3".
func WithMetrics ¶ added in v1.5.0
func WithMetrics(c runtime.Collector) GeneratorOption
WithMetrics attaches a metrics collector to the generator. When speculative decoding is active, the generator updates a "speculative_acceptance_rate" gauge after each verify step.
func WithPagedKV ¶
func WithPagedKV(maxMemoryMB, headDim int) GeneratorOption
WithPagedKV enables paged KV caching with the given memory budget in MB. When enabled, the Generator allocates blocks from a shared BlockPool instead of pre-allocating the full maxSeqLen per sequence. headDim is the per-position storage size: for GQA models pass numKVHeads * actualHeadDim so the pool can store all KV heads per position.
func WithPrefixCache ¶ added in v1.5.0
func WithPrefixCache(capacityBlocks int) GeneratorOption
WithPrefixCache enables prefix caching with the given capacity in blocks. When enabled and paged KV is active, sessions that share the same system prompt prefix reuse cached KV blocks instead of re-running prefill.
func WithSpeculativeDraft ¶ added in v1.5.0
func WithSpeculativeDraft[T tensor.Numeric](draftGraph *graph.Graph[T], draftCfg ModelConfig, draftLen int) GeneratorOption
WithSpeculativeDraft enables speculative decoding using a separate draft model graph. The draft model proposes draftLen tokens greedily per step, then the target model verifies them in a single batched forward pass. If the rolling acceptance rate drops below 0.4, generation falls back to standard autoregressive decoding for the remainder.
func WithTieredKV ¶ added in v1.36.0
func WithTieredKV(cfg TieredKVStoreConfig) GeneratorOption
WithTieredKV enables tiered KV caching with hot/warm/cold storage tiers. When enabled, a TieredKVStore is created per generation call using the provided configuration. NumLayers and MaxSeqLen are filled from the model config if left at zero.
type InferenceSession ¶ added in v1.4.0
InferenceSession holds per-session state for independent, concurrent inference. Each session owns its own KV cache and position tracking, allowing multiple sessions to generate simultaneously without data races.
func (*InferenceSession[T]) Cache ¶ added in v1.4.0
func (s *InferenceSession[T]) Cache() CacheProvider[T]
Cache returns the session's KV cache provider.
func (*InferenceSession[T]) Generate ¶ added in v1.4.0
func (s *InferenceSession[T]) Generate(ctx context.Context, prompt string, sc SamplingConfig) (string, error)
Generate produces text from a prompt using the session's own KV cache. Multiple sessions can Generate concurrently without data races, though calls within a single session are serialized.
func (*InferenceSession[T]) GenerateStream ¶ added in v1.4.0
func (s *InferenceSession[T]) GenerateStream(ctx context.Context, prompt string, sc SamplingConfig, stream TokenStream) error
GenerateStream produces text from a prompt using the session's own KV cache, delivering each token to the stream as it is generated.
type KVCache ¶
KVCache stores key-value tensors for all attention layers during autoregressive generation. Buffers are pre-allocated to maxSeqLen on first Update, and subsequent Updates copy data at the cursor position with zero allocation.
func NewKVCache ¶
NewKVCache creates a KVCache for the specified number of layers and maximum sequence length. Backing buffers are lazily allocated on the first Update call for each layer (when batch and dim become known).
func (*KVCache[T]) Get ¶
Get returns the cached key-value pair for the given layer as tensors covering [0:cursor] on the sequence axis. For batch=1, the returned tensors are zero-copy views over the pre-allocated buffer. For batch>1, data is compacted into a contiguous slice. Returns false if the layer has not been populated yet.
func (*KVCache[T]) Reset ¶
func (c *KVCache[T]) Reset()
Reset clears all cached data and resets cursors to zero. The pre-allocated buffers are retained for reuse.
func (*KVCache[T]) SeqLen ¶
SeqLen returns the current cached sequence length. Returns 0 if the cache is empty.
func (*KVCache[T]) Truncate ¶
Truncate rolls back the cache to the given sequence length. If newSeqLen >= current SeqLen, this is a no-op.
func (*KVCache[T]) Update ¶
func (c *KVCache[T]) Update(layer int, newK, newV *tensor.TensorNumeric[T]) error
Update appends new key and value tensors to the cache for the given layer. Tensors are expected to have shape [batch, seq_len, dim]. Data is copied into the pre-allocated buffer at the current cursor position. After the initial allocation, Update performs zero heap allocations.
type KVCacheFP8 ¶ added in v1.8.0
type KVCacheFP8 struct {
// contains filtered or unexported fields
}
KVCacheFP8 stores key-value tensors for all attention layers using FP8 E4M3 storage, reducing memory by ~4x compared to float32 and ~2x compared to FP16. On Update, float32 values are quantized to FP8; on Get, FP8 values are dequantized back to float32.
FP8 E4M3 has lower precision than FP16 (~1.5 decimal digits vs ~3) but the perplexity impact is typically within 0.5 for attention KV values.
func NewKVCacheFP8 ¶ added in v1.8.0
func NewKVCacheFP8(numLayers, maxSeqLen int) *KVCacheFP8
NewKVCacheFP8 creates a KVCacheFP8 for the specified number of layers and maximum sequence length. FP8 backing buffers are lazily allocated on the first Update call for each layer.
func (*KVCacheFP8) Get ¶ added in v1.8.0
func (c *KVCacheFP8) Get(layer int) (*LayerKV[float32], bool)
Get returns the cached key-value pair for the given layer as float32 tensors covering [0:cursor] on the sequence axis. FP8 data is dequantized to float32 on the fly. Returns false if the layer has not been populated yet.
func (*KVCacheFP8) NumLayers ¶ added in v1.8.0
func (c *KVCacheFP8) NumLayers() int
NumLayers returns the number of layers in the cache.
func (*KVCacheFP8) Reset ¶ added in v1.8.0
func (c *KVCacheFP8) Reset()
Reset clears all cached data and resets cursors to zero. The pre-allocated FP8 buffers are retained for reuse.
func (*KVCacheFP8) SeqLen ¶ added in v1.8.0
func (c *KVCacheFP8) SeqLen() int
SeqLen returns the current cached sequence length. Returns 0 if the cache is empty.
func (*KVCacheFP8) Truncate ¶ added in v1.8.0
func (c *KVCacheFP8) Truncate(newSeqLen int)
Truncate rolls back the cache to the given sequence length. If newSeqLen >= current SeqLen, this is a no-op.
func (*KVCacheFP8) Update ¶ added in v1.8.0
func (c *KVCacheFP8) Update(layer int, newK, newV *tensor.TensorNumeric[float32]) error
Update appends new key and value float32 tensors to the FP8 cache for the given layer. Tensors are expected to have shape [batch, seq_len, dim]. Data is converted from float32 to FP8 and copied into the pre-allocated buffer at the current cursor position.
type KVCacheFP16 ¶ added in v1.7.0
type KVCacheFP16 struct {
// contains filtered or unexported fields
}
KVCacheFP16 stores key-value tensors for all attention layers using FP16 storage, halving the memory bandwidth compared to float32. On Update, float32 values are converted to FP16; on Get, FP16 values are converted back to float32.
This is a drop-in replacement for KVCache[float32] with 2x bandwidth reduction at the cost of slight precision loss (FP16 has ~3 decimal digits of precision).
func NewKVCacheFP16 ¶ added in v1.7.0
func NewKVCacheFP16(numLayers, maxSeqLen int) *KVCacheFP16
NewKVCacheFP16 creates a KVCacheFP16 for the specified number of layers and maximum sequence length. FP16 backing buffers are lazily allocated on the first Update call for each layer.
func (*KVCacheFP16) Get ¶ added in v1.7.0
func (c *KVCacheFP16) Get(layer int) (*LayerKV[float32], bool)
Get returns the cached key-value pair for the given layer as float32 tensors covering [0:cursor] on the sequence axis. FP16 data is decoded to float32 on the fly. Returns false if the layer has not been populated yet.
func (*KVCacheFP16) NumLayers ¶ added in v1.7.0
func (c *KVCacheFP16) NumLayers() int
NumLayers returns the number of layers in the cache.
func (*KVCacheFP16) Reset ¶ added in v1.7.0
func (c *KVCacheFP16) Reset()
Reset clears all cached data and resets cursors to zero. The pre-allocated FP16 buffers are retained for reuse.
func (*KVCacheFP16) SeqLen ¶ added in v1.7.0
func (c *KVCacheFP16) SeqLen() int
SeqLen returns the current cached sequence length. Returns 0 if the cache is empty.
func (*KVCacheFP16) Truncate ¶ added in v1.7.0
func (c *KVCacheFP16) Truncate(newSeqLen int)
Truncate rolls back the cache to the given sequence length. If newSeqLen >= current SeqLen, this is a no-op.
func (*KVCacheFP16) Update ¶ added in v1.7.0
func (c *KVCacheFP16) Update(layer int, newK, newV *tensor.TensorNumeric[float32]) error
Update appends new key and value float32 tensors to the FP16 cache for the given layer. Tensors are expected to have shape [batch, seq_len, dim]. Data is converted from float32 to FP16 and copied into the pre-allocated buffer at the current cursor position.
type KVCacheQ3 ¶ added in v1.29.0
type KVCacheQ3 struct {
// contains filtered or unexported fields
}
KVCacheQ3 stores key-value tensors for all attention layers using 3-bit non-uniform codebook quantization. Each group of q3GroupSize elements gets 8 centroids computed via sensitivity-weighted k-means, where larger-magnitude values receive higher weight. This is a KVQuant-style approach that quantizes keys pre-RoPE to preserve rotary position information. Memory reduction is ~6.4x compared to float32.
func NewKVCacheQ3 ¶ added in v1.29.0
NewKVCacheQ3 creates a KVCacheQ3 for the specified number of layers and maximum sequence length. Q3 backing buffers are lazily allocated on the first Update call for each layer.
func (*KVCacheQ3) Get ¶ added in v1.29.0
Get returns the cached key-value pair for the given layer as float32 tensors covering [0:cursor] on the sequence axis. Q3 data is dequantized to float32 on the fly via codebook lookup. Returns false if the layer has not been populated yet.
func (*KVCacheQ3) Reset ¶ added in v1.29.0
func (c *KVCacheQ3) Reset()
Reset clears all cached data and resets cursors to zero. The pre-allocated Q3 buffers are retained for reuse.
func (*KVCacheQ3) SeqLen ¶ added in v1.29.0
SeqLen returns the current cached sequence length. Returns 0 if the cache is empty.
func (*KVCacheQ3) Truncate ¶ added in v1.29.0
Truncate rolls back the cache to the given sequence length. If newSeqLen >= current SeqLen, this is a no-op.
func (*KVCacheQ3) Update ¶ added in v1.29.0
Update appends new key and value float32 tensors to the Q3 cache for the given layer. Tensors are expected to have shape [batch, seq_len, dim]. Data is converted from float32 to Q3 via sensitivity-weighted k-means codebook quantization and stored in the pre-allocated buffer at the current cursor position.
type KVCacheQ4 ¶ added in v1.28.0
type KVCacheQ4 struct {
// contains filtered or unexported fields
}
KVCacheQ4 stores key-value tensors for all attention layers using 4-bit group quantization, reducing memory by ~8x compared to float32. On Update, float32 values are quantized to Q4 with per-group (group_size=128) absmax scaling. On Get, Q4 values are dequantized back to float32.
func NewKVCacheQ4 ¶ added in v1.28.0
NewKVCacheQ4 creates a KVCacheQ4 for the specified number of layers and maximum sequence length. Q4 backing buffers are lazily allocated on the first Update call for each layer.
func (*KVCacheQ4) Get ¶ added in v1.28.0
Get returns the cached key-value pair for the given layer as float32 tensors covering [0:cursor] on the sequence axis. Q4 data is dequantized to float32 on the fly. Returns false if the layer has not been populated yet.
func (*KVCacheQ4) Reset ¶ added in v1.28.0
func (c *KVCacheQ4) Reset()
Reset clears all cached data and resets cursors to zero. The pre-allocated Q4 buffers are retained for reuse.
func (*KVCacheQ4) SeqLen ¶ added in v1.28.0
SeqLen returns the current cached sequence length. Returns 0 if the cache is empty.
func (*KVCacheQ4) Truncate ¶ added in v1.28.0
Truncate rolls back the cache to the given sequence length. If newSeqLen >= current SeqLen, this is a no-op.
type LayerKV ¶
type LayerKV[T tensor.Numeric] struct { Key *tensor.TensorNumeric[T] Value *tensor.TensorNumeric[T] }
LayerKV holds the cached key and value tensors for a single attention layer.
type ModelConfig ¶
type ModelConfig struct {
VocabSize int // Total tokens in vocabulary
MaxSeqLen int // Maximum sequence length the model supports
EOSTokenID int // End-of-sequence token ID
BOSTokenID int // Beginning-of-sequence token ID
NumLayers int // Number of transformer layers (for KV cache sizing)
}
ModelConfig holds model architecture parameters needed for generation.
type PagedKVCache ¶
PagedKVCache stores key-value tensors for autoregressive generation using block-level allocation from a BlockPool. Instead of pre-allocating the full maxSeqLen per sequence, blocks of blockSize tokens are allocated on demand, reducing memory waste for concurrent sequences of varying length.
Each sequence gets its own PagedKVCache. The cache accepts tensors with arbitrary first dimensions (channels). GQA attention stores KV as [batchSize*numKVHeads, seqLen, headDim]; the pool's headDim must equal channels * dim to accommodate the full per-position data.
func NewPagedKVCache ¶
func NewPagedKVCache[T tensor.Numeric](pool *BlockPool[T], numLayers int) *PagedKVCache[T]
NewPagedKVCache creates a paged KV cache backed by the given block pool.
func (*PagedKVCache[T]) Append ¶
func (c *PagedKVCache[T]) Append(layer int, newK, newV *tensor.TensorNumeric[T]) error
Append writes new key and value data for the given layer. The tensors must have shape [channels, seqLen, dim] where channels*dim equals the pool's headDim. For standard caching channels=1; for GQA caching channels equals batchSize*numKVHeads. Data is written into the current block; a new block is allocated from the pool when the current one fills up.
func (*PagedKVCache[T]) BlockTable ¶ added in v1.5.0
func (c *PagedKVCache[T]) BlockTable() []*Block[T]
BlockTable returns the cache's current block table. This is used by the prefix cache to snapshot blocks after prefill for caching.
func (*PagedKVCache[T]) Free ¶
func (c *PagedKVCache[T]) Free()
Free returns all allocated blocks to the pool and resets the cache.
func (*PagedKVCache[T]) Get ¶
func (c *PagedKVCache[T]) Get(layer int) (*LayerKV[T], bool)
Get returns the cached KV for the given layer. This is an alias for GetKV that satisfies the CacheProvider interface.
func (*PagedKVCache[T]) GetKV ¶
func (c *PagedKVCache[T]) GetKV(layer int) (*LayerKV[T], bool)
GetKV returns the cached key and value tensors for the given layer, gathered into contiguous [channels, seqLen, dim] tensors. Returns false if the layer is out of range or the cache is empty for that layer.
func (*PagedKVCache[T]) InjectBlocks ¶ added in v1.5.0
func (c *PagedKVCache[T]) InjectBlocks(blocks []*Block[T], seqLen int)
InjectBlocks sets the cache's block table to the given pre-populated blocks and advances all layer cursors to seqLen. This is used by the prefix cache to inject cached KV data without running a forward pass.
func (*PagedKVCache[T]) NumLayers ¶
func (c *PagedKVCache[T]) NumLayers() int
NumLayers returns the number of layers in the cache.
func (*PagedKVCache[T]) Reset ¶
func (c *PagedKVCache[T]) Reset()
Reset clears the cache and returns all blocks to the pool.
func (*PagedKVCache[T]) SeqLen ¶
func (c *PagedKVCache[T]) SeqLen() int
SeqLen returns the number of token positions stored in the cache, based on layer 0's cursor. Returns 0 if the cache is empty.
func (*PagedKVCache[T]) Truncate ¶
func (c *PagedKVCache[T]) Truncate(newSeqLen int)
Truncate rolls back the cache to the given sequence length. Blocks beyond the new length are returned to the pool.
func (*PagedKVCache[T]) Update ¶
func (c *PagedKVCache[T]) Update(layer int, newK, newV *tensor.TensorNumeric[T]) error
Update appends new key and value data for the given layer. This is an alias for Append that satisfies the CacheProvider interface.
type PrefixCache ¶ added in v1.5.0
PrefixCache wraps a radix tree to cache KV blocks for shared prompt prefixes. When multiple sessions share the same system prompt, the second session can skip prefill for the prefix by copying cached block data instead of running the forward pass. PrefixCache is safe for concurrent use.
func NewPrefixCache ¶ added in v1.5.0
func NewPrefixCache[T tensor.Numeric](capacity int, pool *BlockPool[T]) *PrefixCache[T]
NewPrefixCache creates a prefix cache that stores up to capacity KV blocks in a radix tree. The pool is used to allocate blocks when inserting cached prefix data.
func (*PrefixCache[T]) Insert ¶ added in v1.5.0
func (pc *PrefixCache[T]) Insert(tokenIDs []int32, blocks []*Block[T])
Insert stores the KV blocks associated with a token prefix in the cache. The block data is copied into kv.Block[T] instances owned by the radix tree.
func (*PrefixCache[T]) Match ¶ added in v1.5.0
func (pc *PrefixCache[T]) Match(prefix []int32) ([]*Block[T], int)
Match returns the cached blocks for the longest matching prefix and the number of tokens matched. The returned blocks are freshly allocated from the pool with data copied from the cache, so the caller owns them. Returns nil, 0 if no prefix matches or block allocation fails.
func (*PrefixCache[T]) Size ¶ added in v1.5.0
func (pc *PrefixCache[T]) Size() int
Size returns the number of blocks currently cached in the tree.
type RadixCache ¶ added in v1.28.0
type RadixCache struct {
// contains filtered or unexported fields
}
RadixCache implements a hash-based radix tree for KV block prefix matching. Token sequences are divided into fixed-size blocks, each hashed with FNV-1a. Tree traversal matches one block hash per level, giving O(prefix_length / blockSize) matching complexity. LRU eviction removes the coldest leaf when the block pool is exhausted.
RadixCache is safe for concurrent use.
func NewRadixCache ¶ added in v1.28.0
func NewRadixCache(blockSize, maxBlocks int) *RadixCache
NewRadixCache creates a radix cache that hashes token sequences in chunks of blockSize tokens and stores up to maxBlocks block entries.
func (*RadixCache) Evict ¶ added in v1.28.0
func (rc *RadixCache) Evict()
Evict removes the least-recently-used leaf node and frees its block.
func (*RadixCache) Insert ¶ added in v1.28.0
func (rc *RadixCache) Insert(tokens []int) []int
Insert divides tokens into blocks of blockSize, hashes each block, and inserts them into the tree. Returns the block IDs assigned to each block. Partial trailing blocks (len < blockSize) are included. When the cache is full, LRU eviction frees space before allocating new blocks.
func (*RadixCache) Match ¶ added in v1.28.0
func (rc *RadixCache) Match(tokens []int) (matchLen int, blockIDs []int)
Match finds the longest prefix match for the given token sequence. Returns the number of tokens matched (a multiple of blockSize, or the full length if the last block is partial) and the block IDs for each matched block.
func (*RadixCache) Stats ¶ added in v1.28.0
func (rc *RadixCache) Stats() (hits, misses, evictions int)
Stats returns cumulative hit, miss, and eviction counts.
type RadixNode ¶ added in v1.28.0
type RadixNode struct {
// contains filtered or unexported fields
}
RadixNode is a node in the hash-based radix tree. Each node stores a hash of the token block it represents, enabling O(1) comparison per block rather than per-token matching.
type SSMState ¶ added in v1.5.0
type SSMState[T tensor.Numeric] struct { States []*tensor.TensorNumeric[T] // one per layer: [1, d_inner, d_state] NumLayers int DInner int DState int }
SSMState holds the recurrent hidden state h_t for each MambaBlock layer. Unlike KV cache which grows with sequence length O(seq_len * d_model), SSM state is O(d_state * d_inner) per layer — constant regardless of sequence length.
func NewSSMState ¶ added in v1.5.0
NewSSMState creates an SSMState for the specified number of layers, with each layer's hidden state initialized to zeros.
func (*SSMState[T]) GetLayer ¶ added in v1.5.0
func (s *SSMState[T]) GetLayer(i int) (*tensor.TensorNumeric[T], error)
GetLayer returns the hidden state tensor for the given layer.
func (*SSMState[T]) MemoryBytes ¶ added in v1.5.0
MemoryBytes returns the total memory used by all layer states in bytes. This is O(numLayers * dInner * dState) — independent of sequence length.
type SamplingConfig ¶
type SamplingConfig struct {
Temperature float64 // Divide logits by this value; 0 = greedy
TopK int // Keep only top K tokens; 0 = disabled
TopP float64 // Keep tokens with cumulative prob >= P; 1.0 = disabled
RepetitionPenalty float64 // Penalize repeated tokens; 1.0 = disabled
MaxNewTokens int // Maximum number of tokens to generate
StopTokenIDs []int // Stop when any of these token IDs are generated
StopStrings []string // Stop when output contains any of these strings
GrammarState *grammar.Grammar // Optional grammar for constrained decoding
AdapterName string // Optional LoRA adapter name for per-request selection
// contains filtered or unexported fields
}
SamplingConfig controls how tokens are selected during generation.
func DefaultSamplingConfig ¶
func DefaultSamplingConfig() SamplingConfig
DefaultSamplingConfig returns a SamplingConfig with sensible defaults.
type SpeculativeGenerator ¶
SpeculativeGenerator implements speculative decoding using a small draft model and a large target model. The draft model proposes N tokens greedily, then the target model verifies all N in a single batched forward pass. Accepted tokens are emitted; on first mismatch the target's token is used.
func NewSpeculativeGenerator ¶
func NewSpeculativeGenerator[T tensor.Numeric]( draftGraph, targetGraph *graph.Graph[T], tok tokenizer.Tokenizer, engine compute.Engine[T], draftCfg, targetCfg ModelConfig, draftLen int, ) *SpeculativeGenerator[T]
NewSpeculativeGenerator creates a speculative generator with separate draft and target model graphs. draftLen controls how many tokens the draft model proposes per verification step (typically 2-8).
func (*SpeculativeGenerator[T]) Generate ¶
func (sg *SpeculativeGenerator[T]) Generate(ctx context.Context, prompt string, sc SamplingConfig) (string, error)
Generate produces text from a prompt using speculative decoding with greedy sampling. The draft model proposes tokens, the target model verifies them.
func (*SpeculativeGenerator[T]) WithAdaptive ¶
func (sg *SpeculativeGenerator[T]) WithAdaptive(enabled bool) *SpeculativeGenerator[T]
WithAdaptive enables or disables adaptive draft length adjustment. When enabled (default), the draft length is adjusted based on acceptance rate.
type TensorCache ¶
TensorCache is a KV cache that keeps tensors in pre-allocated buffers. On the first Update for a layer, it allocates [batch, maxSeqLen, dim] memory (GPU or CPU depending on the source tensor). Subsequent Updates append new K/V data via direct memcpy at the correct offset, avoiding per-token allocation overhead.
func NewTensorCache ¶
func NewTensorCache[T tensor.Numeric](engine compute.Engine[T], numLayers, maxSeqLen int, opts ...TensorCacheOption) *TensorCache[T]
NewTensorCache creates a TensorCache backed by the given engine. numLayers should match the model's transformer layer count. maxSeqLen limits the total cached sequence length. If the engine implements GPUStreamAccessor, async memcpy is used for KV cache updates (required for CUDA graph capture compatibility).
func (*TensorCache[T]) Free ¶
func (c *TensorCache[T]) Free()
Free releases all pre-allocated GPU buffers. CPU buffers are left to GC.
func (*TensorCache[T]) GPUCounterPtr ¶
func (c *TensorCache[T]) GPUCounterPtr() unsafe.Pointer
GPUCounterPtr returns the device pointer to the GPU-resident int32 position counter. Returns nil if no GPU counter is allocated (CPU-only cache). Kernels (offset_memcpy, rope_select, increment_counter) use this pointer to read/write the current sequence position on the GPU, enabling CUDA graph capture of the decode loop.
func (*TensorCache[T]) Get ¶
func (c *TensorCache[T]) Get(layer int) (*LayerKV[T], bool)
Get returns the cached key-value pair for the given layer. For GPU-backed layers, returns a view into the pre-allocated buffer. For CPU-backed layers, returns a tensor wrapping the buffer slice. Returns false if the layer index is out of range or the layer is empty.
func (*TensorCache[T]) GetFullBuffer ¶
func (c *TensorCache[T]) GetFullBuffer(layer int) (k, v *tensor.TensorNumeric[T])
GetFullBuffer returns GPU-backed KV tensors spanning the full pre-allocated buffer (maxSeqLen capacity) for the given layer. The shape is [batch, maxSeqLen, dim]. This is used by flash_attention_decode which reads the actual KV length from a GPU-resident counter, so it needs the buffer with its full stride rather than a seqLen-trimmed view. Returns nil if the layer is CPU-backed or not yet initialized.
func (*TensorCache[T]) KVSeqLenPtr ¶
func (c *TensorCache[T]) KVSeqLenPtr() unsafe.Pointer
KVSeqLenPtr returns the device pointer to the GPU-resident int32 KV sequence length counter. Returns nil if not allocated (CPU-only cache). The flash_attention_decode kernel reads this pointer at runtime so the KV length is not frozen by CUDA graph capture.
func (*TensorCache[T]) MaxSeqLen ¶
func (c *TensorCache[T]) MaxSeqLen() int
MaxSeqLen returns the maximum sequence length (buffer capacity).
func (*TensorCache[T]) Reset ¶
func (c *TensorCache[T]) Reset()
Reset clears sequence lengths to zero. Pre-allocated buffers are kept for reuse; only data pointers are logically invalidated. The GPU counters are also zeroed so that GPU-side kernels see the reset position.
func (*TensorCache[T]) SeqLen ¶
func (c *TensorCache[T]) SeqLen() int
SeqLen returns the current cached sequence length (from layer 0).
func (*TensorCache[T]) SyncCounterFromGPU ¶
func (c *TensorCache[T]) SyncCounterFromGPU() error
SyncCounterFromGPU performs a D2H copy of the GPU counter to update the CPU seqLen across all layers. Call this after the decode loop completes to bring the CPU-side cursor back in sync with the GPU counter.
func (*TensorCache[T]) Truncate ¶
func (c *TensorCache[T]) Truncate(newSeqLen int)
Truncate rolls back the cache to the given sequence length. Pre-allocated buffers are kept; the data beyond newSeqLen is simply ignored. GPU-resident counters (gpuCounter and kvSeqLenCounter) are also reset to match newSeqLen so that GPU-side kernels (offset_memcpy, rope_select, flash_attention_decode) see the correct position after rollback.
func (*TensorCache[T]) Update ¶
func (c *TensorCache[T]) Update(layer int, newK, newV *tensor.TensorNumeric[T]) error
Update appends new key and value tensors to the cache for the given layer. Tensors must be 3D with shape [batch, seqLen, dim]. On the first call for a layer, pre-allocated buffers are created. Subsequent calls copy new data directly into the buffers at the current sequence offset.
type TensorCacheOption ¶
type TensorCacheOption func(*tensorCacheOptions)
TensorCacheOption configures a TensorCache.
func WithKVDtype ¶
func WithKVDtype(dtype string) TensorCacheOption
WithKVDtype sets the KV cache storage dtype. Supported values: "fp32" (default), "fp16". FP16 mode halves KV memory bandwidth but requires GPU and CUDA conversion kernels.
type Tier ¶ added in v1.29.0
type Tier int
Tier identifies which storage tier a layer's KV data resides in.
type TieredKVStore ¶ added in v1.29.0
TieredKVStore provides multi-tier KV cache storage with automatic promotion and demotion based on access patterns.
Hot tier: recent tokens stored uncompressed (KVCache). Warm tier: compressed tokens in CPU memory (CompressedKVCache). Cold tier: evicted tokens serialized to disk.
Tokens are initially stored in the hot tier. When a layer's access count drops below demoteThreshold, it is moved to the warm tier (compressed). If demoted again, data moves to the cold tier (disk). Accessing a cold or warm layer promotes it back toward the hot tier.
func NewTieredKVStore ¶ added in v1.29.0
func NewTieredKVStore[T tensor.Numeric](engine compute.Engine[T], cfg TieredKVStoreConfig) (*TieredKVStore[T], error)
NewTieredKVStore creates a TieredKVStore. All layers start in the hot tier. If cfg.ColdDir is empty, a temporary directory is created.
func (*TieredKVStore[T]) AccessCount ¶ added in v1.29.0
func (s *TieredKVStore[T]) AccessCount(layer int) int
AccessCount returns the current access count for a layer.
func (*TieredKVStore[T]) Close ¶ added in v1.29.0
func (s *TieredKVStore[T]) Close() error
Close stops the prefetch goroutine and cleans up cold-tier files. When the cold directory was created by NewTieredKVStore (cfg.ColdDir was empty), Close removes it. When the caller supplied their own ColdDir, Close leaves the directory untouched so the caller can reuse it across generation calls.
func (*TieredKVStore[T]) Demote ¶ added in v1.29.0
func (s *TieredKVStore[T]) Demote(layer int) error
Demote moves a layer one tier down (hot→warm, warm→cold).
func (*TieredKVStore[T]) Get ¶ added in v1.29.0
func (s *TieredKVStore[T]) Get(layer int) (*LayerKV[T], bool)
Get returns the cached key-value pair for the given layer. The data is retrieved from whichever tier the layer currently resides in. Accessing a layer increments its access count.
func (*TieredKVStore[T]) GetPrefetched ¶ added in v1.29.0
func (s *TieredKVStore[T]) GetPrefetched(layer int) (*LayerKV[T], bool)
GetPrefetched returns a previously prefetched LayerKV for the given layer and removes it from the prefetch cache. Returns nil, false if no prefetched data is available.
func (*TieredKVStore[T]) ManageTiers ¶ added in v1.29.0
func (s *TieredKVStore[T]) ManageTiers() error
ManageTiers evaluates access patterns and moves layers between tiers. Layers with low access counts are demoted; layers with high access counts are promoted. Access counts are reset after management.
func (*TieredKVStore[T]) NumLayers ¶ added in v1.29.0
func (s *TieredKVStore[T]) NumLayers() int
NumLayers returns the number of layers in the store.
func (*TieredKVStore[T]) PrefetchAsync ¶ added in v1.29.0
func (s *TieredKVStore[T]) PrefetchAsync(positions []int)
PrefetchAsync queues the given layer positions for asynchronous prefetch from cold/warm tiers to the hot tier. Layers already in the hot tier or already prefetched are skipped. The caller continues execution immediately.
func (*TieredKVStore[T]) Promote ¶ added in v1.29.0
func (s *TieredKVStore[T]) Promote(layer int) error
Promote moves a layer one tier up (cold→warm, warm→hot).
func (*TieredKVStore[T]) Reset ¶ added in v1.29.0
func (s *TieredKVStore[T]) Reset()
Reset clears all tiers, resets access counts, and drains prefetched data.
func (*TieredKVStore[T]) SeqLen ¶ added in v1.29.0
func (s *TieredKVStore[T]) SeqLen() int
SeqLen returns the current sequence length from the hot tier.
func (*TieredKVStore[T]) Tier ¶ added in v1.29.0
func (s *TieredKVStore[T]) Tier(layer int) Tier
Tier returns the current storage tier for the given layer.
func (*TieredKVStore[T]) Truncate ¶ added in v1.29.0
func (s *TieredKVStore[T]) Truncate(newSeqLen int)
Truncate reduces the hot tier cache to the given sequence length.
func (*TieredKVStore[T]) Update ¶ added in v1.29.0
func (s *TieredKVStore[T]) Update(layer int, newK, newV *tensor.TensorNumeric[T]) error
Update appends new key and value tensors for the given layer. Data is always written to the hot tier. If the layer was in a lower tier, it is promoted to hot first.
type TieredKVStoreConfig ¶ added in v1.29.0
type TieredKVStoreConfig struct {
NumLayers int
MaxSeqLen int
ChunkSize int // compression chunk size for warm tier
DemoteThreshold int // access count below which layers are demoted
PromoteThreshold int // access count above which layers are promoted
ColdDir string
}
TieredKVStoreConfig holds configuration for a TieredKVStore.
type TokenStream ¶
type TokenStream interface {
// OnToken is called for each decoded token during streaming generation.
// When done is true, generation is complete (token may be empty).
// Returning a non-nil error stops generation.
OnToken(token string, done bool) error
}
TokenStream receives tokens as they are generated.
type TokenStreamFunc ¶
TokenStreamFunc adapts a function to the TokenStream interface.
type TokenUsage ¶ added in v1.12.0
type TokenUsage struct {
// contains filtered or unexported fields
}
TokenUsage tracks prompt and completion token counts during generation. It is safe for concurrent reads via the atomic loads, and is written by the generation session after prefill and decode complete.
func TokenUsageFromContext ¶ added in v1.12.0
func TokenUsageFromContext(ctx context.Context) *TokenUsage
TokenUsageFromContext extracts the TokenUsage from the context, or nil if none is present.
func (*TokenUsage) CompletionTokens ¶ added in v1.12.0
func (u *TokenUsage) CompletionTokens() int
CompletionTokens returns the completion token count.
func (*TokenUsage) PromptTokens ¶ added in v1.12.0
func (u *TokenUsage) PromptTokens() int
PromptTokens returns the prompt token count.
func (*TokenUsage) SetCompletionTokens ¶ added in v1.12.0
func (u *TokenUsage) SetCompletionTokens(n int)
SetCompletionTokens stores the completion token count.
func (*TokenUsage) SetPromptTokens ¶ added in v1.12.0
func (u *TokenUsage) SetPromptTokens(n int)
SetPromptTokens stores the prompt token count.
type TracingCacheProvider ¶
TracingCacheProvider wraps a real CacheProvider and records KV cache operations into a Tracer during a tracing compilation pass. This allows the tracing compiler to capture the full attention dataflow including cache reads and writes.
func NewTracingCacheProvider ¶
func NewTracingCacheProvider[T tensor.Numeric](real CacheProvider[T], tracer *compute.Tracer[T]) *TracingCacheProvider[T]
NewTracingCacheProvider creates a TracingCacheProvider wrapping the given real cache and recording ops into the tracer.
func (*TracingCacheProvider[T]) Get ¶
func (t *TracingCacheProvider[T]) Get(layer int) (*LayerKV[T], bool)
Get delegates to the real cache and records KVCacheGetK/V ops.
func (*TracingCacheProvider[T]) Reset ¶
func (t *TracingCacheProvider[T]) Reset()
Reset delegates to the real cache.
func (*TracingCacheProvider[T]) SeqLen ¶
func (t *TracingCacheProvider[T]) SeqLen() int
SeqLen delegates to the real cache.
func (*TracingCacheProvider[T]) Truncate ¶
func (t *TracingCacheProvider[T]) Truncate(newSeqLen int)
Truncate delegates to the real cache.
func (*TracingCacheProvider[T]) Update ¶
func (t *TracingCacheProvider[T]) Update(layer int, newK, newV *tensor.TensorNumeric[T]) error
Update delegates to the real cache and records KVCacheAppendK/V ops.
Source Files
¶
- adaptive.go
- batch.go
- block_pool.go
- compressed_kv_cache.go
- context.go
- cuda_allocator.go
- doc.go
- eagle_speculative.go
- generator.go
- gpu_kv_cache.go
- grammar_mask.go
- kvcache.go
- kvcache_fp16.go
- kvcache_fp8.go
- kvcache_q3.go
- kvcache_q4.go
- megakernel.go
- megakernel_kv.go
- paged_kv.go
- prefix_cache.go
- radix_cache.go
- sampling.go
- session.go
- speculative.go
- ssm_state.go
- stream.go
- tensor_cache.go
- tiered_kv_adapter.go
- tiered_kv_store.go
- tracing_cache.go
- usage.go
Directories
¶
| Path | Synopsis |
|---|---|
|
Package agent implements the agentic tool-use loop for multi-step reasoning.
|
Package agent implements the agentic tool-use loop for multi-step reasoning. |
|
Package grammar converts a subset of JSON Schema into a context-free grammar for constrained decoding.
|
Package grammar converts a subset of JSON Schema into a context-free grammar for constrained decoding. |
|
Package speculative implements speculative decoding strategies for accelerated generation.
|
Package speculative implements speculative decoding strategies for accelerated generation. |