Documentation
¶
Overview ¶
Package kv provides paged-attention KV cache primitives for transformer inference. The BlockPool pre-allocates fixed-size blocks of key/value storage so that the hot path (Alloc / Free) is allocation-free.
Index ¶
- Constants
- type Block
- type BlockPool
- func (p *BlockPool[T]) Alloc() (*Block[T], error)
- func (p *BlockPool[T]) Available() int
- func (p *BlockPool[T]) BlockSize() int
- func (p *BlockPool[T]) Cap() int
- func (p *BlockPool[T]) FragmentationRatio() float64
- func (p *BlockPool[T]) Free(b *Block[T])
- func (p *BlockPool[T]) HeadDim() int
- func (p *BlockPool[T]) NumLayers() int
- type BlockTable
- func (bt *BlockTable[T]) Append(tokenCount int) error
- func (bt *BlockTable[T]) BlockCount() int
- func (bt *BlockTable[T]) Blocks() []*Block[T]
- func (bt *BlockTable[T]) Free()
- func (bt *BlockTable[T]) Lookup(logicalPos int) (*Block[T], int, error)
- func (bt *BlockTable[T]) Pool() *BlockPool[T]
- func (bt *BlockTable[T]) TokenCount() int
- type PagedKVCache
- type RadixNode
- type RadixTree
Constants ¶
const DefaultBlockSize = 16
DefaultBlockSize is the number of token positions per block.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Block ¶
type Block[T tensor.Numeric] struct { K []T V []T Used int // number of token positions written (0..blockSize) }
Block holds pre-allocated key and value data for a fixed number of token positions across all layers. K and V each have numLayers * blockSize * headDim elements laid out as [layer][position][headDim] in row-major order.
type BlockPool ¶
BlockPool manages a fixed-size pool of pre-allocated KV cache blocks. Blocks are allocated at startup and recycled via Alloc/Free. All methods are safe for concurrent use.
func NewBlockPool ¶
func NewBlockPool[T tensor.Numeric](numBlocks, numLayers, blockSize, headDim int) (*BlockPool[T], error)
NewBlockPool creates a pool with numBlocks pre-allocated blocks. Each block holds K and V data for blockSize token positions across numLayers, with headDim elements per position per layer. blockSize defaults to DefaultBlockSize (16) when zero.
func (*BlockPool[T]) Alloc ¶
Alloc returns a free block from the pool. Returns an error if the pool is exhausted. The returned block has Used reset to 0.
func (*BlockPool[T]) FragmentationRatio ¶
FragmentationRatio returns the fraction of allocated blocks that are partially used (0 < Used < blockSize). A ratio of 0.0 means all allocated blocks are either empty or full; 1.0 means every allocated block is partially filled. Returns 0.0 when no blocks are allocated.
func (*BlockPool[T]) Free ¶
Free returns a block to the pool. The block's Used counter is reset to 0.
type BlockTable ¶
BlockTable maps a single sequence's logical token positions to physical blocks allocated from a BlockPool. It grows automatically as tokens are appended and returns all blocks to the pool on Free.
func NewBlockTable ¶
func NewBlockTable[T tensor.Numeric](pool *BlockPool[T]) *BlockTable[T]
NewBlockTable creates a BlockTable backed by the given pool.
func (*BlockTable[T]) Append ¶
func (bt *BlockTable[T]) Append(tokenCount int) error
Append adds tokenCount new token positions to the table, allocating additional blocks from the pool as needed.
func (*BlockTable[T]) BlockCount ¶
func (bt *BlockTable[T]) BlockCount() int
BlockCount returns the number of blocks currently held.
func (*BlockTable[T]) Blocks ¶
func (bt *BlockTable[T]) Blocks() []*Block[T]
Blocks returns the ordered slice of physical blocks held by this table.
func (*BlockTable[T]) Free ¶
func (bt *BlockTable[T]) Free()
Free returns all blocks to the pool and resets the table.
func (*BlockTable[T]) Lookup ¶
func (bt *BlockTable[T]) Lookup(logicalPos int) (*Block[T], int, error)
Lookup returns the physical block and offset within that block for a logical token position.
func (*BlockTable[T]) Pool ¶
func (bt *BlockTable[T]) Pool() *BlockPool[T]
Pool returns the BlockPool backing this table.
func (*BlockTable[T]) TokenCount ¶
func (bt *BlockTable[T]) TokenCount() int
TokenCount returns the total number of token positions tracked.
type PagedKVCache ¶
type PagedKVCache[T tensor.Numeric] struct { Table *BlockTable[T] }
PagedKVCache bundles a BlockTable and its backing BlockPool into a single value that can be passed to the compute engine's paged GQA method.
func NewPagedKVCache ¶
func NewPagedKVCache[T tensor.Numeric](table *BlockTable[T]) *PagedKVCache[T]
NewPagedKVCache creates a PagedKVCache wrapping the given BlockTable.
func (*PagedKVCache[T]) BlockSize ¶
func (c *PagedKVCache[T]) BlockSize() int
BlockSize returns the number of token positions per block.
func (*PagedKVCache[T]) HeadDim ¶
func (c *PagedKVCache[T]) HeadDim() int
HeadDim returns the head dimension from the backing pool.
func (*PagedKVCache[T]) SeqLen ¶
func (c *PagedKVCache[T]) SeqLen() int
SeqLen returns the number of token positions in the cache.
type RadixNode ¶
RadixNode is a node in the radix tree. Each edge stores a slice of token IDs and a corresponding slice of KV blocks. Children are keyed by the first token ID of their edge.
type RadixTree ¶
RadixTree stores shared prompt prefixes so that KV blocks can be reused across sequences that share a common prefix. It is safe for concurrent use.
func NewRadixTree ¶
NewRadixTree creates a radix tree that caches up to capacity KV blocks.
func (*RadixTree[T]) Evict ¶
func (rt *RadixTree[T]) Evict()
Evict removes the least recently used leaf node from the tree.
func (*RadixTree[T]) Insert ¶
Insert adds a token prefix and its associated KV blocks into the tree. If the prefix already exists, the blocks are updated. When the tree exceeds capacity, LRU leaf nodes are evicted.