blake3asm

package
v0.1.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 22, 2026 License: MIT Imports: 4 Imported by: 0

Documentation

Overview

Package blake3asm holds the AVX-512 + VL fused chain-absorb kernel implementation of BLAKE3 for the parent hashes/ package. The chain kernels are specialised at three input widths (20 / 36 / 68 bytes — covering the ITB SetNonceBits 128 / 256 / 512 buf shapes) and hold BLAKE3 state in ZMM registers across the absorb rounds, eliminating the per-round memory round-trip taken by the upstream BLAKE3 path.

The 4-pixel-batched lane-parallel layout matches blake2{b,s}asm — 16 ZMMs hold v[0..15] across all rounds (lanes 0..3 active in dword positions 0..3), 16 more ZMMs hold m[0..15]. The round body uses VPADDD/VPXORD/VPRORD with the BLAKE3-specific 7-round message schedule. Below the AVX-512 + VL tier the parent package falls through to github.com/zeebo/blake3, which already carries its own hand-written AVX-512 assembly for the compression — so the realistic uplift target here is amortising per-call overhead across 4 lanes rather than competing with kernel-internal SIMD work.

Reference layout: github.com/saucecontrol/Blake2Fast (MIT) — specifically src/Blake2Fast/Blake3/Blake3Scalar.g.cs for the round structure and message schedule, and Blake3HashState.cs for the flag set (CHUNK_START / CHUNK_END / ROOT / KEYED_HASH).

Index

Constants

View Source
const (
	FlagChunkStart = 0x01
	FlagChunkEnd   = 0x02
	FlagRoot       = 0x08
	FlagKeyedHash  = 0x10
)

BLAKE3 domain-separation flags (RFC §2.1).

FlagsSingleBlock — set by the 20- and 36-byte kernels (where the keyed-hash chunk is a single block, and that block is simultaneously chunk-start, chunk-end, and root).

View Source
const FlagsTwoBlockFinal = FlagKeyedHash | FlagChunkEnd | FlagRoot

FlagsTwoBlockFinal — set by block 2 of the 68-byte kernel (chunk end and root, but no longer chunk start).

View Source
const FlagsTwoBlockFirst = FlagKeyedHash | FlagChunkStart

FlagsTwoBlockFirst — set by block 1 of the 68-byte kernel (chunk start, but neither chunk end nor root yet).

Variables

View Source
var Blake3IV = [8]uint32{
	0x6a09e667,
	0xbb67ae85,
	0x3c6ef372,
	0xa54ff53a,
	0x510e527f,
	0x9b05688c,
	0x1f83d9ab,
	0x5be0cd19,
}

Blake3IV is the BLAKE3 initialization vector (RFC §2.4). The constants are bit-identical to BLAKE2s IV (and the SHA-256 IV); BLAKE3 reuses these values for v[8..11] of the compression state init, with the remaining four IV slots replaced by (t_lo, t_hi, block_len, flags) per BLAKE3's compression contract.

Unlike BLAKE2{b,s}, BLAKE3 has no parameter-block XOR'd into h[0] — the digest length is not encoded in the chaining value at init time. So no Blake3IV256Param companion variable is needed.

View Source
var HasAVX512Fused = cpu.X86.HasAVX512F

HasAVX512Fused reports whether the runtime CPU supports the fused AVX-512 + VL chain-absorb kernels. Same derivation as blake2{b,s}asm — only AVX-512F is needed at the CPUID-bit level (VPRORD is AVX-512F + VL, but on every shipping silicon where AVX-512F is present the rest of the AVX-512 baseline ships with it).

Functions

func Blake3256ChainAbsorb20x4

func Blake3256ChainAbsorb20x4(
	key *[32]byte,
	seeds *[4][4]uint64,
	dataPtrs *[4]*byte,
	out *[4][8]uint32,
)

Blake3256ChainAbsorb20x4 is the public 4-pixel-batched entry point for the BLAKE3-256 chain-absorb at the 20-byte data shape (ITB SetNonceBits(128) buf shape).

On amd64 + AVX-512 + VL hosts (HasAVX512Fused == true), dispatches to the fused ZMM-batched ASM kernel; otherwise falls through to the scalar batched reference path (which delegates to upstream github.com/zeebo/blake3).

Buffer construction is identical between the two paths and matches the bit-exact behaviour of the existing hashes.BLAKE3 closure applied to each of the four pixel inputs:

per lane:
  mixed[0:20]  = data[lane]            (per-lane, 20 bytes)
  mixed[20:32] = zero pad
  then for i in 0..3:
    mixed[i*8 : i*8+8] ^= seeds[lane][i]   (LE uint64; straddles
                                            two BLAKE3 message
                                            words m[2i], m[2i+1])

The keyed-hash mode key (32 bytes shared across all 4 lanes) is consumed by BLAKE3's state init (v[0..7] = KEY broadcast), NOT written into the mixed buffer. One BLAKE3 compression with block_len=32, flags=0x1B (KEYED_HASH|CHUNK_START|CHUNK_END|ROOT).

func Blake3256ChainAbsorb36x4

func Blake3256ChainAbsorb36x4(
	key *[32]byte,
	seeds *[4][4]uint64,
	dataPtrs *[4]*byte,
	out *[4][8]uint32,
)

Blake3256ChainAbsorb36x4 — 36-byte BLAKE3-256 batched dispatcher (ITB SetNonceBits(256) buf shape). Single compression block per lane (mixed=36 ≤ 64-byte BLAKE3 block size); same flag set as the 20-byte case but block_len=36 and the m-pack covers 9 dwords of data (m[0..8]) instead of 5.

func Blake3256ChainAbsorb68x4

func Blake3256ChainAbsorb68x4(
	key *[32]byte,
	seeds *[4][4]uint64,
	dataPtrs *[4]*byte,
	out *[4][8]uint32,
)

Blake3256ChainAbsorb68x4 — 68-byte BLAKE3-256 batched dispatcher (ITB SetNonceBits(512) buf shape). Two compression blocks per lane (mixed=68 > 64):

Block 1 (block_len=64, flags=KEYED_HASH|CHUNK_START):
    m[0..7]  = data[0:32] ⊕ seed
    m[8..15] = data[32:64]
    Output cv1 = v[k] ⊕ v[k+8] (k in 0..7) becomes block 2's v[0..7].

Block 2 (block_len=4, flags=KEYED_HASH|CHUNK_END|ROOT):
    m[0]     = data[64:68]
    m[1..15] = 0
    Final out[k] = v[k] ⊕ v[k+8] (no ⊕ chaining_value; BLAKE3's
                                   output mixing differs from
                                   BLAKE2's h0 ⊕ v[k] ⊕ v[k+8]).

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL