Documentation
¶
Overview ¶
Package speculative implements speculative decoding strategies for accelerated generation.
Stability: beta
Package speculative implements speculative decoding strategies for accelerating autoregressive text generation.
Index ¶
- func AcceptTokens(draftTokens []int32, draftProbs [][]float32, targetProbs [][]float32, ...) ([]int32, float32)
- type ExternalDraft
- type ForwardFunc
- type SelfDraft
- func (sd *SelfDraft[T]) AcceptanceRate(ctx context.Context, prompt []int, K int) (float64, error)
- func (sd *SelfDraft[T]) DraftDepth() int
- func (sd *SelfDraft[T]) Generate(ctx context.Context, tokens []int, K int) ([]int, error)
- func (sd *SelfDraft[T]) Verify(ctx context.Context, draftTokens []int) (accepted int, correction int, err error)
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func AcceptTokens ¶
func AcceptTokens(draftTokens []int32, draftProbs [][]float32, targetProbs [][]float32, rng *rand.Rand) ([]int32, float32)
AcceptTokens implements the rejection sampling algorithm from Leviathan et al. 2023 ("Fast Inference from Transformers via Speculative Decoding").
For each draft token x_i with draft probability q(x_i) and target probability p(x_i):
- Accept with probability min(1, p(x_i) / q(x_i))
- If rejected at position i, sample a correction token from the renormalized max(0, p - q) distribution
- If all K tokens accepted, sample a bonus token from the target distribution at position K
Parameters:
- draftTokens: K token IDs proposed by the draft model
- draftProbs: full probability distributions from the draft model, one []float32 per draft position (each slice has length vocabSize)
- targetProbs: full probability distributions from the target model, one []float32 per draft position (each slice has length vocabSize)
- rng: random number generator (if nil, uses deterministic acceptance where r=0 — equivalent to temperature 0 / greedy behavior)
Returns the accepted token sequence (up to K+1 including a possible bonus token) and the acceptance rate (accepted / proposed).
Types ¶
type ExternalDraft ¶
ExternalDraft uses a smaller external model to generate draft tokens for speculative decoding. The draft and target models share a compute engine and block manager so GPU memory is not duplicated.
func NewExternalDraft ¶
func NewExternalDraft[T tensor.Numeric]( draftGraph *graph.Graph[T], engine compute.Engine[T], blockPool *generate.BlockPool[T], config generate.ModelConfig, ) *ExternalDraft[T]
NewExternalDraft creates an ExternalDraft that uses draftGraph as the draft model. The engine and blockPool are shared with the target model. blockPool may be nil if paged KV caching is not used.
func (*ExternalDraft[T]) Generate ¶
func (ed *ExternalDraft[T]) Generate(ctx context.Context, tokens []int32, K int) ([]int32, []float32, error)
Generate runs K greedy decoding steps on the draft model, starting from the given input tokens. It returns up to K draft token IDs and their corresponding log probabilities. Generation stops early if an EOS token is produced. The returned slices have equal length (<= K).
type ForwardFunc ¶
type ForwardFunc[T tensor.Numeric] func(ctx context.Context, input *tensor.TensorNumeric[T]) (*tensor.TensorNumeric[T], error)
ForwardFunc runs a model forward pass on the given input tokens and returns logits shaped [1, seqLen, vocabSize]. The implementation decides how many transformer layers to execute — callers use a partial-layer function for drafting and the full model for verification.
type SelfDraft ¶
SelfDraft implements self-speculative decoding: the same model is used for both drafting and verification. Drafting runs only the first N/2 layers (early exit), producing cheap approximate tokens. Verification runs the full model to accept or reject draft tokens.
This avoids loading a separate draft model, reducing memory at the cost of draft quality (typically alpha > 0.4 for well-trained models).
func NewSelfDraft ¶
func NewSelfDraft[T tensor.Numeric]( draftFn, verifyFn ForwardFunc[T], vocabSize, numLayers, draftDepth int, ) *SelfDraft[T]
NewSelfDraft creates a SelfDraft speculative decoder.
Parameters:
- draftFn: forward function using only the first draftDepth layers
- verifyFn: forward function using all layers
- vocabSize: model vocabulary size
- numLayers: total transformer layers in the full model
- draftDepth: number of layers to use for drafting (typically numLayers/2)
func (*SelfDraft[T]) AcceptanceRate ¶
AcceptanceRate measures the fraction of draft tokens accepted by the full model (alpha). It generates K draft tokens from the prompt, then verifies them. Returns alpha in [0, 1].
func (*SelfDraft[T]) DraftDepth ¶
DraftDepth returns the number of layers used for drafting.
func (*SelfDraft[T]) Generate ¶
Generate produces K draft tokens using the partial-layer forward function. Each draft step feeds the previous draft token back as input. The returned slice contains up to K token IDs.
func (*SelfDraft[T]) Verify ¶
func (sd *SelfDraft[T]) Verify(ctx context.Context, draftTokens []int) (accepted int, correction int, err error)
Verify checks draft tokens against the full model. It runs the full model on the draft token sequence and returns the number of accepted tokens (where the full model's greedy prediction matches the draft).