speculative

package

v1.9.0 Latest Latest Go to latest Published: Mar 21, 2026 License: Apache-2.0 Imports: 8 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/zerfoo/zerfoo

Links

Open Source Insights

Documentation ¶

Overview ¶

Package speculative implements speculative decoding strategies for accelerated generation.

Stability: beta

Package speculative implements speculative decoding strategies for accelerating autoregressive text generation.

Index ¶

func AcceptTokens(draftTokens []int32, draftProbs [][]float32, targetProbs [][]float32, ...) ([]int32, float32)
type ExternalDraft
- func NewExternalDraft[T tensor.Numeric](draftGraph *graph.Graph[T], engine compute.Engine[T], ...) *ExternalDraft[T]
- func (ed *ExternalDraft[T]) Generate(ctx context.Context, tokens []int32, K int) ([]int32, []float32, error)
type ForwardFunc
type SelfDraft
- func NewSelfDraft[T tensor.Numeric](draftFn, verifyFn ForwardFunc[T], vocabSize, numLayers, draftDepth int) *SelfDraft[T]

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func AcceptTokens ¶

func AcceptTokens(draftTokens []int32, draftProbs [][]float32, targetProbs [][]float32, rng *rand.Rand) ([]int32, float32)

AcceptTokens implements the rejection sampling algorithm from Leviathan et al. 2023 ("Fast Inference from Transformers via Speculative Decoding").

For each draft token x_i with draft probability q(x_i) and target probability p(x_i):

Accept with probability min(1, p(x_i) / q(x_i))
If rejected at position i, sample a correction token from the renormalized max(0, p - q) distribution
If all K tokens accepted, sample a bonus token from the target distribution at position K

Parameters:

draftTokens: K token IDs proposed by the draft model
draftProbs: full probability distributions from the draft model, one []float32 per draft position (each slice has length vocabSize)
targetProbs: full probability distributions from the target model, one []float32 per draft position (each slice has length vocabSize)
rng: random number generator (if nil, uses deterministic acceptance where r=0 — equivalent to temperature 0 / greedy behavior)

Returns the accepted token sequence (up to K+1 including a possible bonus token) and the acceptance rate (accepted / proposed).

Types ¶

type ExternalDraft ¶

type ExternalDraft[T tensor.Numeric] struct {
	// contains filtered or unexported fields
}

ExternalDraft uses a smaller external model to generate draft tokens for speculative decoding. The draft and target models share a compute engine and block manager so GPU memory is not duplicated.

func NewExternalDraft ¶

func NewExternalDraft[T tensor.Numeric](
	draftGraph *graph.Graph[T],
	engine compute.Engine[T],
	blockPool *generate.BlockPool[T],
	config generate.ModelConfig,
) *ExternalDraft[T]

NewExternalDraft creates an ExternalDraft that uses draftGraph as the draft model. The engine and blockPool are shared with the target model. blockPool may be nil if paged KV caching is not used.

func (*ExternalDraft[T]) Generate ¶

func (ed *ExternalDraft[T]) Generate(ctx context.Context, tokens []int32, K int) ([]int32, []float32, error)

Generate runs K greedy decoding steps on the draft model, starting from the given input tokens. It returns up to K draft token IDs and their corresponding log probabilities. Generation stops early if an EOS token is produced. The returned slices have equal length (<= K).

type ForwardFunc ¶

type ForwardFunc[T tensor.Numeric] func(ctx context.Context, input *tensor.TensorNumeric[T]) (*tensor.TensorNumeric[T], error)

ForwardFunc runs a model forward pass on the given input tokens and returns logits shaped [1, seqLen, vocabSize]. The implementation decides how many transformer layers to execute — callers use a partial-layer function for drafting and the full model for verification.

type SelfDraft ¶

type SelfDraft[T tensor.Numeric] struct {
	// contains filtered or unexported fields
}

SelfDraft implements self-speculative decoding: the same model is used for both drafting and verification. Drafting runs only the first N/2 layers (early exit), producing cheap approximate tokens. Verification runs the full model to accept or reject draft tokens.

This avoids loading a separate draft model, reducing memory at the cost of draft quality (typically alpha > 0.4 for well-trained models).

func NewSelfDraft ¶

func NewSelfDraft[T tensor.Numeric](
	draftFn, verifyFn ForwardFunc[T],
	vocabSize, numLayers, draftDepth int,
) *SelfDraft[T]

NewSelfDraft creates a SelfDraft speculative decoder.

Parameters:

draftFn: forward function using only the first draftDepth layers
verifyFn: forward function using all layers
vocabSize: model vocabulary size
numLayers: total transformer layers in the full model
draftDepth: number of layers to use for drafting (typically numLayers/2)

func (*SelfDraft[T]) AcceptanceRate ¶

func (sd *SelfDraft[T]) AcceptanceRate(ctx context.Context, prompt []int, K int) (float64, error)

AcceptanceRate measures the fraction of draft tokens accepted by the full model (alpha). It generates K draft tokens from the prompt, then verifies them. Returns alpha in [0, 1].

func (*SelfDraft[T]) DraftDepth ¶

func (sd *SelfDraft[T]) DraftDepth() int

DraftDepth returns the number of layers used for drafting.

func (*SelfDraft[T]) Generate ¶

func (sd *SelfDraft[T]) Generate(ctx context.Context, tokens []int, K int) ([]int, error)

Generate produces K draft tokens using the partial-layer forward function. Each draft step feeds the previous draft token back as input. The returned slice contains up to K token IDs.

func (*SelfDraft[T]) Verify ¶

func (sd *SelfDraft[T]) Verify(ctx context.Context, draftTokens []int) (accepted int, correction int, err error)

Verify checks draft tokens against the full model. It runs the full model on the draft token sequence and returns the number of accepted tokens (where the full model's greedy prediction matches the draft).

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL