Documentation
¶
Overview ¶
Package residual provides residual connection layers for neural networks.
Standard transformer architectures use additive residual connections: each layer's output is added to its input before being passed forward. While effective, this fixed scheme weights all previous layers equally, limiting the network's ability to route information across depth.
This package implements Attention Residuals (arXiv:2603.15031, Kimi Team, 2026), which replace fixed additive residuals with learned, softmax-weighted aggregation over depth. Each layer carries a small pseudo-query vector that attends over RMSNorm-projected keys from all preceding layer outputs, producing per-layer attention weights that dynamically control how much each earlier representation contributes to the current layer's input.
Two variants are provided:
AttnRes: Full Attention Residuals. Every layer attends over every previous layer output. This gives maximum expressiveness but requires O(L*d) memory to store all L layer outputs of dimension d.
BlockAttnRes: Block Attention Residuals. Layers are partitioned into N blocks of S layers each. Within a block, outputs are accumulated via standard addition. At block boundaries, softmax attention aggregates block-level representations. This reduces memory from O(L*d) to O(N*d) while recovering most of the benefit of full AttnRes. The paper shows that N=8 blocks recovers the majority of full AttnRes gains.
Usage: AttnRes in a transformer layer loop ¶
For full attention residuals, create one AttnRes per layer and collect all layer outputs:
// During graph construction:
var layerOutputs []*tensor.TensorNumeric[T]
layerOutputs = append(layerOutputs, embedOutput) // layer 0 = embedding
for i := 0; i < numLayers; i++ {
ar, _ := residual.NewAttnRes[float32](
fmt.Sprintf("layer_%d_attnres", i), engine, ops, modelDim,
)
// Aggregate previous layers via attention.
hidden, _ = ar.Forward(ctx, layerOutputs...)
// Run attention + FFN on the aggregated hidden state.
hidden = transformerLayer(ctx, hidden, i)
layerOutputs = append(layerOutputs, hidden)
}
Usage: BlockAttnRes with block boundaries ¶
For block attention residuals, accumulate within blocks and attend across block representations:
blockSize := numLayers / 8 // N=8 blocks
bar, _ := residual.NewBlockAttnRes[float32](engine, ops, blockSize, modelDim, 1e-5)
var completedBlocks []*tensor.TensorNumeric[T]
var blockAccum *tensor.TensorNumeric[T] // running sum within current block
for i := 0; i < numLayers; i++ {
// At block boundary (except first), finalize the block.
if i > 0 && i%blockSize == 0 {
completedBlocks = append(completedBlocks, blockAccum)
blockAccum = nil
}
// Compute attention-weighted residual from blocks.
partial := blockAccum
if partial == nil { partial = hidden }
hidden, _ = bar.Forward(ctx, hidden, completedBlocks, partial)
// Run transformer layer.
hidden = transformerLayer(ctx, hidden, i)
// Accumulate into current block.
if blockAccum == nil {
blockAccum = hidden
} else {
blockAccum, _ = engine.Add(ctx, blockAccum, hidden)
}
}
GGUF metadata ¶
Residual mode is configured via GGUF metadata keys (see inference package):
- general.residual_mode: "standard" (default), "attnres", or "block_attnres"
- general.attnres_blocks: number of blocks for block_attnres (default 8)
Stability ¶
This package is experimental. The API may change as the approach matures.
Index ¶
- type AttnRes
- func (a *AttnRes[T]) Attributes() map[string]interface{}
- func (a *AttnRes[T]) Backward(_ context.Context, _ types.BackwardMode, _ *tensor.TensorNumeric[T], ...) ([]*tensor.TensorNumeric[T], error)
- func (a *AttnRes[T]) Forward(ctx context.Context, inputs ...*tensor.TensorNumeric[T]) (*tensor.TensorNumeric[T], error)
- func (a *AttnRes[T]) OpType() string
- func (a *AttnRes[T]) OutputShape() []int
- func (a *AttnRes[T]) Parameters() []*graph.Parameter[T]
- type BlockAttnRes
- func (b *BlockAttnRes[T]) AttentionWeights(ctx context.Context, query *tensor.TensorNumeric[T], ...) (*tensor.TensorNumeric[T], error)
- func (b *BlockAttnRes[T]) BlockSize() int
- func (b *BlockAttnRes[T]) Forward(ctx context.Context, query *tensor.TensorNumeric[T], ...) (*tensor.TensorNumeric[T], error)
- func (b *BlockAttnRes[T]) Parameters() []*graph.Parameter[T]
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type AttnRes ¶
AttnRes implements full Attention Residuals (arXiv:2603.15031). Each layer has a learned pseudo-query w_l that attends over all previous layer outputs via softmax attention, replacing fixed additive residuals.
Forward computes:
keys_i = RMSNorm(layerOutput_i) for each previous layer output logit_i = dot(w_l, keys_i) per-layer scalar logit alpha = softmax(logits) attention weights over depth h_l = sum(alpha_i * layerOutput_i)
func NewAttnRes ¶
func NewAttnRes[T tensor.Numeric](name string, engine compute.Engine[T], ops numeric.Arithmetic[T], modelDim int) (*AttnRes[T], error)
NewAttnRes creates a new AttnRes layer. modelDim is the hidden dimension of the model.
func (*AttnRes[T]) Attributes ¶
Attributes returns the attributes of the AttnRes layer.
func (*AttnRes[T]) Backward ¶
func (a *AttnRes[T]) Backward(_ context.Context, _ types.BackwardMode, _ *tensor.TensorNumeric[T], _ ...*tensor.TensorNumeric[T]) ([]*tensor.TensorNumeric[T], error)
Backward computes the backward pass of the AttnRes layer.
func (*AttnRes[T]) Forward ¶
func (a *AttnRes[T]) Forward(ctx context.Context, inputs ...*tensor.TensorNumeric[T]) (*tensor.TensorNumeric[T], error)
Forward computes the attention-weighted residual combination. layerOutputs contains the outputs of all previous layers, each with shape compatible for dot product with the query (typically [1, modelDim] or [modelDim]).
func (*AttnRes[T]) OutputShape ¶
OutputShape returns the output shape of the AttnRes layer.
func (*AttnRes[T]) Parameters ¶
Parameters returns the trainable parameters of the AttnRes layer.
type BlockAttnRes ¶
BlockAttnRes implements Block Attention Residuals (arXiv:2603.15031). Partitions L layers into N blocks of S layers each. Intra-block: standard residual accumulation (sum of layer outputs). Inter-block: softmax attention over N block-level representations.
Forward implements Fig 2 from the paper:
- Stack block representations + partial block into value matrix V
- Apply RMSNorm to get keys K
- Compute logits = query^T * K (dot product)
- alpha = softmax(logits)
- h = sum(alpha_i * V_i)
func NewBlockAttnRes ¶
func NewBlockAttnRes[T tensor.Numeric](engine compute.Engine[T], ops numeric.Arithmetic[T], blockSize, modelDim int, epsilon T) (*BlockAttnRes[T], error)
NewBlockAttnRes creates a new BlockAttnRes layer.
Parameters:
- engine: the compute engine for all arithmetic
- ops: arithmetic operations for type T
- blockSize: number of layers per block (S)
- modelDim: hidden dimension size (for RMSNorm initialization)
- epsilon: small constant for RMSNorm numerical stability
func (*BlockAttnRes[T]) AttentionWeights ¶
func (b *BlockAttnRes[T]) AttentionWeights( ctx context.Context, query *tensor.TensorNumeric[T], blocks []*tensor.TensorNumeric[T], partialBlock *tensor.TensorNumeric[T], ) (*tensor.TensorNumeric[T], error)
AttentionWeights computes and returns the softmax attention weights over blocks. This is useful for inspection/debugging. Returns weights [1, n] where n = len(blocks) + 1.
func (*BlockAttnRes[T]) BlockSize ¶
func (b *BlockAttnRes[T]) BlockSize() int
BlockSize returns the number of layers per block.
func (*BlockAttnRes[T]) Forward ¶
func (b *BlockAttnRes[T]) Forward( ctx context.Context, query *tensor.TensorNumeric[T], blocks []*tensor.TensorNumeric[T], partialBlock *tensor.TensorNumeric[T], ) (*tensor.TensorNumeric[T], error)
Forward computes the block attention residual.
Parameters:
- ctx: context for cancellation
- query: current layer hidden state [dim] or [1, dim]
- blocks: completed block representations, each [dim] or [1, dim]
- partialBlock: sum of layer outputs in the current (incomplete) block, [dim] or [1, dim]
Returns the weighted combination of all block representations via softmax attention.
func (*BlockAttnRes[T]) Parameters ¶
func (b *BlockAttnRes[T]) Parameters() []*graph.Parameter[T]
Parameters returns the learnable parameters. The RMSNorm gain is initialized to unit (ones) and not trained by BlockAttnRes, but is exposed for completeness.