residual

package
v1.45.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 11, 2026 License: Apache-2.0 Imports: 8 Imported by: 0

Documentation

Overview

Package residual provides residual connection layers for neural networks.

Standard transformer architectures use additive residual connections: each layer's output is added to its input before being passed forward. While effective, this fixed scheme weights all previous layers equally, limiting the network's ability to route information across depth.

This package implements Attention Residuals (arXiv:2603.15031, Kimi Team, 2026), which replace fixed additive residuals with learned, softmax-weighted aggregation over depth. Each layer carries a small pseudo-query vector that attends over RMSNorm-projected keys from all preceding layer outputs, producing per-layer attention weights that dynamically control how much each earlier representation contributes to the current layer's input.

Two variants are provided:

  • AttnRes: Full Attention Residuals. Every layer attends over every previous layer output. This gives maximum expressiveness but requires O(L*d) memory to store all L layer outputs of dimension d.

  • BlockAttnRes: Block Attention Residuals. Layers are partitioned into N blocks of S layers each. Within a block, outputs are accumulated via standard addition. At block boundaries, softmax attention aggregates block-level representations. This reduces memory from O(L*d) to O(N*d) while recovering most of the benefit of full AttnRes. The paper shows that N=8 blocks recovers the majority of full AttnRes gains.

Usage: AttnRes in a transformer layer loop

For full attention residuals, create one AttnRes per layer and collect all layer outputs:

// During graph construction:
var layerOutputs []*tensor.TensorNumeric[T]
layerOutputs = append(layerOutputs, embedOutput) // layer 0 = embedding

for i := 0; i < numLayers; i++ {
    ar, _ := residual.NewAttnRes[float32](
        fmt.Sprintf("layer_%d_attnres", i), engine, ops, modelDim,
    )
    // Aggregate previous layers via attention.
    hidden, _ = ar.Forward(ctx, layerOutputs...)
    // Run attention + FFN on the aggregated hidden state.
    hidden = transformerLayer(ctx, hidden, i)
    layerOutputs = append(layerOutputs, hidden)
}

Usage: BlockAttnRes with block boundaries

For block attention residuals, accumulate within blocks and attend across block representations:

blockSize := numLayers / 8 // N=8 blocks
bar, _ := residual.NewBlockAttnRes[float32](engine, ops, blockSize, modelDim, 1e-5)

var completedBlocks []*tensor.TensorNumeric[T]
var blockAccum *tensor.TensorNumeric[T] // running sum within current block

for i := 0; i < numLayers; i++ {
    // At block boundary (except first), finalize the block.
    if i > 0 && i%blockSize == 0 {
        completedBlocks = append(completedBlocks, blockAccum)
        blockAccum = nil
    }
    // Compute attention-weighted residual from blocks.
    partial := blockAccum
    if partial == nil { partial = hidden }
    hidden, _ = bar.Forward(ctx, hidden, completedBlocks, partial)
    // Run transformer layer.
    hidden = transformerLayer(ctx, hidden, i)
    // Accumulate into current block.
    if blockAccum == nil {
        blockAccum = hidden
    } else {
        blockAccum, _ = engine.Add(ctx, blockAccum, hidden)
    }
}

GGUF metadata

Residual mode is configured via GGUF metadata keys (see inference package):

  • general.residual_mode: "standard" (default), "attnres", or "block_attnres"
  • general.attnres_blocks: number of blocks for block_attnres (default 8)

Stability

This package is experimental. The API may change as the approach matures.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type AttnRes

type AttnRes[T tensor.Numeric] struct {
	// contains filtered or unexported fields
}

AttnRes implements full Attention Residuals (arXiv:2603.15031). Each layer has a learned pseudo-query w_l that attends over all previous layer outputs via softmax attention, replacing fixed additive residuals.

Forward computes:

keys_i = RMSNorm(layerOutput_i)  for each previous layer output
logit_i = dot(w_l, keys_i)       per-layer scalar logit
alpha = softmax(logits)           attention weights over depth
h_l = sum(alpha_i * layerOutput_i)

func NewAttnRes

func NewAttnRes[T tensor.Numeric](name string, engine compute.Engine[T], ops numeric.Arithmetic[T], modelDim int) (*AttnRes[T], error)

NewAttnRes creates a new AttnRes layer. modelDim is the hidden dimension of the model.

func (*AttnRes[T]) Attributes

func (a *AttnRes[T]) Attributes() map[string]interface{}

Attributes returns the attributes of the AttnRes layer.

func (*AttnRes[T]) Backward

func (a *AttnRes[T]) Backward(_ context.Context, _ types.BackwardMode, _ *tensor.TensorNumeric[T], _ ...*tensor.TensorNumeric[T]) ([]*tensor.TensorNumeric[T], error)

Backward computes the backward pass of the AttnRes layer.

func (*AttnRes[T]) Forward

func (a *AttnRes[T]) Forward(ctx context.Context, inputs ...*tensor.TensorNumeric[T]) (*tensor.TensorNumeric[T], error)

Forward computes the attention-weighted residual combination. layerOutputs contains the outputs of all previous layers, each with shape compatible for dot product with the query (typically [1, modelDim] or [modelDim]).

func (*AttnRes[T]) OpType

func (a *AttnRes[T]) OpType() string

OpType returns the operation type of the AttnRes layer.

func (*AttnRes[T]) OutputShape

func (a *AttnRes[T]) OutputShape() []int

OutputShape returns the output shape of the AttnRes layer.

func (*AttnRes[T]) Parameters

func (a *AttnRes[T]) Parameters() []*graph.Parameter[T]

Parameters returns the trainable parameters of the AttnRes layer.

type BlockAttnRes

type BlockAttnRes[T tensor.Numeric] struct {
	// contains filtered or unexported fields
}

BlockAttnRes implements Block Attention Residuals (arXiv:2603.15031). Partitions L layers into N blocks of S layers each. Intra-block: standard residual accumulation (sum of layer outputs). Inter-block: softmax attention over N block-level representations.

Forward implements Fig 2 from the paper:

  1. Stack block representations + partial block into value matrix V
  2. Apply RMSNorm to get keys K
  3. Compute logits = query^T * K (dot product)
  4. alpha = softmax(logits)
  5. h = sum(alpha_i * V_i)

func NewBlockAttnRes

func NewBlockAttnRes[T tensor.Numeric](engine compute.Engine[T], ops numeric.Arithmetic[T], blockSize, modelDim int, epsilon T) (*BlockAttnRes[T], error)

NewBlockAttnRes creates a new BlockAttnRes layer.

Parameters:

  • engine: the compute engine for all arithmetic
  • ops: arithmetic operations for type T
  • blockSize: number of layers per block (S)
  • modelDim: hidden dimension size (for RMSNorm initialization)
  • epsilon: small constant for RMSNorm numerical stability

func (*BlockAttnRes[T]) AttentionWeights

func (b *BlockAttnRes[T]) AttentionWeights(
	ctx context.Context,
	query *tensor.TensorNumeric[T],
	blocks []*tensor.TensorNumeric[T],
	partialBlock *tensor.TensorNumeric[T],
) (*tensor.TensorNumeric[T], error)

AttentionWeights computes and returns the softmax attention weights over blocks. This is useful for inspection/debugging. Returns weights [1, n] where n = len(blocks) + 1.

func (*BlockAttnRes[T]) BlockSize

func (b *BlockAttnRes[T]) BlockSize() int

BlockSize returns the number of layers per block.

func (*BlockAttnRes[T]) Forward

func (b *BlockAttnRes[T]) Forward(
	ctx context.Context,
	query *tensor.TensorNumeric[T],
	blocks []*tensor.TensorNumeric[T],
	partialBlock *tensor.TensorNumeric[T],
) (*tensor.TensorNumeric[T], error)

Forward computes the block attention residual.

Parameters:

  • ctx: context for cancellation
  • query: current layer hidden state [dim] or [1, dim]
  • blocks: completed block representations, each [dim] or [1, dim]
  • partialBlock: sum of layer outputs in the current (incomplete) block, [dim] or [1, dim]

Returns the weighted combination of all block representations via softmax attention.

func (*BlockAttnRes[T]) Parameters

func (b *BlockAttnRes[T]) Parameters() []*graph.Parameter[T]

Parameters returns the learnable parameters. The RMSNorm gain is initialized to unit (ones) and not trained by BlockAttnRes, but is exposed for completeness.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL