residual

package

v1.45.0 Latest Latest Go to latest Published: Apr 11, 2026 License: Apache-2.0 Imports: 8 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/zerfoo/zerfoo

Links

Documentation ¶

Overview ¶

Package residual provides residual connection layers for neural networks.

Standard transformer architectures use additive residual connections: each layer's output is added to its input before being passed forward. While effective, this fixed scheme weights all previous layers equally, limiting the network's ability to route information across depth.

This package implements Attention Residuals (arXiv:2603.15031, Kimi Team, 2026), which replace fixed additive residuals with learned, softmax-weighted aggregation over depth. Each layer carries a small pseudo-query vector that attends over RMSNorm-projected keys from all preceding layer outputs, producing per-layer attention weights that dynamically control how much each earlier representation contributes to the current layer's input.

Two variants are provided:

AttnRes: Full Attention Residuals. Every layer attends over every previous layer output. This gives maximum expressiveness but requires O(L*d) memory to store all L layer outputs of dimension d.
BlockAttnRes: Block Attention Residuals. Layers are partitioned into N blocks of S layers each. Within a block, outputs are accumulated via standard addition. At block boundaries, softmax attention aggregates block-level representations. This reduces memory from O(L*d) to O(N*d) while recovering most of the benefit of full AttnRes. The paper shows that N=8 blocks recovers the majority of full AttnRes gains.

Usage: AttnRes in a transformer layer loop ¶

For full attention residuals, create one AttnRes per layer and collect all layer outputs:

// During graph construction:
var layerOutputs []*tensor.TensorNumeric[T]
layerOutputs = append(layerOutputs, embedOutput) // layer 0 = embedding

for i := 0; i < numLayers; i++ {
    ar, _ := residual.NewAttnRes[float32](
        fmt.Sprintf("layer_%d_attnres", i), engine, ops, modelDim,
    )
    // Aggregate previous layers via attention.
    hidden, _ = ar.Forward(ctx, layerOutputs...)
    // Run attention + FFN on the aggregated hidden state.
    hidden = transformerLayer(ctx, hidden, i)
    layerOutputs = append(layerOutputs, hidden)
}

Usage: BlockAttnRes with block boundaries ¶

For block attention residuals, accumulate within blocks and attend across block representations:

blockSize := numLayers / 8 // N=8 blocks
bar, _ := residual.NewBlockAttnRes[float32](engine, ops, blockSize, modelDim, 1e-5)

var completedBlocks []*tensor.TensorNumeric[T]
var blockAccum *tensor.TensorNumeric[T] // running sum within current block

for i := 0; i < numLayers; i++ {
    // At block boundary (except first), finalize the block.
    if i > 0 && i%blockSize == 0 {
        completedBlocks = append(completedBlocks, blockAccum)
        blockAccum = nil
    }
    // Compute attention-weighted residual from blocks.
    partial := blockAccum
    if partial == nil { partial = hidden }
    hidden, _ = bar.Forward(ctx, hidden, completedBlocks, partial)
    // Run transformer layer.
    hidden = transformerLayer(ctx, hidden, i)
    // Accumulate into current block.
    if blockAccum == nil {
        blockAccum = hidden
    } else {
        blockAccum, _ = engine.Add(ctx, blockAccum, hidden)
    }
}

GGUF metadata ¶

Residual mode is configured via GGUF metadata keys (see inference package):

general.residual_mode: "standard" (default), "attnres", or "block_attnres"
general.attnres_blocks: number of blocks for block_attnres (default 8)

Stability ¶

This package is experimental. The API may change as the approach matures.

Index ¶

type AttnRes
- func NewAttnRes[T tensor.Numeric](name string, engine compute.Engine[T], ops numeric.Arithmetic[T], modelDim int) (*AttnRes[T], error)
type BlockAttnRes
- func NewBlockAttnRes[T tensor.Numeric](engine compute.Engine[T], ops numeric.Arithmetic[T], blockSize, modelDim int, ...) (*BlockAttnRes[T], error)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type AttnRes ¶

type AttnRes[T tensor.Numeric] struct {
	// contains filtered or unexported fields
}

AttnRes implements full Attention Residuals (arXiv:2603.15031). Each layer has a learned pseudo-query w_l that attends over all previous layer outputs via softmax attention, replacing fixed additive residuals.

Forward computes:

keys_i = RMSNorm(layerOutput_i)  for each previous layer output
logit_i = dot(w_l, keys_i)       per-layer scalar logit
alpha = softmax(logits)           attention weights over depth
h_l = sum(alpha_i * layerOutput_i)

func NewAttnRes ¶

func NewAttnRes[T tensor.Numeric](name string, engine compute.Engine[T], ops numeric.Arithmetic[T], modelDim int) (*AttnRes[T], error)

NewAttnRes creates a new AttnRes layer. modelDim is the hidden dimension of the model.

func (*AttnRes[T]) Attributes ¶

func (a *AttnRes[T]) Attributes() map[string]interface{}

Attributes returns the attributes of the AttnRes layer.

func (*AttnRes[T]) Backward ¶

func (a *AttnRes[T]) Backward(_ context.Context, _ types.BackwardMode, _ *tensor.TensorNumeric[T], _ ...*tensor.TensorNumeric[T]) ([]*tensor.TensorNumeric[T], error)

Backward computes the backward pass of the AttnRes layer.

func (*AttnRes[T]) Forward ¶

func (a *AttnRes[T]) Forward(ctx context.Context, inputs ...*tensor.TensorNumeric[T]) (*tensor.TensorNumeric[T], error)

Forward computes the attention-weighted residual combination. layerOutputs contains the outputs of all previous layers, each with shape compatible for dot product with the query (typically [1, modelDim] or [modelDim]).

func (*AttnRes[T]) OpType ¶

func (a *AttnRes[T]) OpType() string

OpType returns the operation type of the AttnRes layer.

func (*AttnRes[T]) OutputShape ¶

func (a *AttnRes[T]) OutputShape() []int

OutputShape returns the output shape of the AttnRes layer.

func (*AttnRes[T]) Parameters ¶

func (a *AttnRes[T]) Parameters() []*graph.Parameter[T]

Parameters returns the trainable parameters of the AttnRes layer.

type BlockAttnRes ¶

type BlockAttnRes[T tensor.Numeric] struct {
	// contains filtered or unexported fields
}

BlockAttnRes implements Block Attention Residuals (arXiv:2603.15031). Partitions L layers into N blocks of S layers each. Intra-block: standard residual accumulation (sum of layer outputs). Inter-block: softmax attention over N block-level representations.

Forward implements Fig 2 from the paper:

Stack block representations + partial block into value matrix V
Apply RMSNorm to get keys K
Compute logits = query^T * K (dot product)
alpha = softmax(logits)
h = sum(alpha_i * V_i)

func NewBlockAttnRes ¶

func NewBlockAttnRes[T tensor.Numeric](engine compute.Engine[T], ops numeric.Arithmetic[T], blockSize, modelDim int, epsilon T) (*BlockAttnRes[T], error)

NewBlockAttnRes creates a new BlockAttnRes layer.

Parameters:

engine: the compute engine for all arithmetic
ops: arithmetic operations for type T
blockSize: number of layers per block (S)
modelDim: hidden dimension size (for RMSNorm initialization)
epsilon: small constant for RMSNorm numerical stability

func (*BlockAttnRes[T]) AttentionWeights ¶

func (b *BlockAttnRes[T]) AttentionWeights(
	ctx context.Context,
	query *tensor.TensorNumeric[T],
	blocks []*tensor.TensorNumeric[T],
	partialBlock *tensor.TensorNumeric[T],
) (*tensor.TensorNumeric[T], error)

AttentionWeights computes and returns the softmax attention weights over blocks. This is useful for inspection/debugging. Returns weights [1, n] where n = len(blocks) + 1.

func (*BlockAttnRes[T]) BlockSize ¶

func (b *BlockAttnRes[T]) BlockSize() int

BlockSize returns the number of layers per block.

func (*BlockAttnRes[T]) Forward ¶

func (b *BlockAttnRes[T]) Forward(
	ctx context.Context,
	query *tensor.TensorNumeric[T],
	blocks []*tensor.TensorNumeric[T],
	partialBlock *tensor.TensorNumeric[T],
) (*tensor.TensorNumeric[T], error)

Forward computes the block attention residual.

Parameters:

ctx: context for cancellation
query: current layer hidden state [dim] or [1, dim]
blocks: completed block representations, each [dim] or [1, dim]
partialBlock: sum of layer outputs in the current (incomplete) block, [dim] or [1, dim]

Returns the weighted combination of all block representations via softmax attention.

func (*BlockAttnRes[T]) Parameters ¶

func (b *BlockAttnRes[T]) Parameters() []*graph.Parameter[T]

Parameters returns the learnable parameters. The RMSNorm gain is initialized to unit (ones) and not trained by BlockAttnRes, but is exposed for completeness.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL