Documentation
¶
Overview ¶
Package blake3asm holds the AVX-512 + VL fused chain-absorb kernel implementation of BLAKE3 for the parent hashes/ package. The chain kernels are specialised at three input widths (20 / 36 / 68 bytes — covering the ITB SetNonceBits 128 / 256 / 512 buf shapes) and hold BLAKE3 state in ZMM registers across the absorb rounds, eliminating the per-round memory round-trip taken by the upstream BLAKE3 path.
The 4-pixel-batched lane-parallel layout matches blake2{b,s}asm — 16 ZMMs hold v[0..15] across all rounds (lanes 0..3 active in dword positions 0..3), 16 more ZMMs hold m[0..15]. The round body uses VPADDD/VPXORD/VPRORD with the BLAKE3-specific 7-round message schedule. Below the AVX-512 + VL tier the parent package falls through to github.com/zeebo/blake3, which already carries its own hand-written AVX-512 assembly for the compression — so the realistic uplift target here is amortising per-call overhead across 4 lanes rather than competing with kernel-internal SIMD work.
Reference layout: github.com/saucecontrol/Blake2Fast (MIT) — specifically src/Blake2Fast/Blake3/Blake3Scalar.g.cs for the round structure and message schedule, and Blake3HashState.cs for the flag set (CHUNK_START / CHUNK_END / ROOT / KEYED_HASH).
Index ¶
- Constants
- Variables
- func Blake3256ChainAbsorb20x4(key *[32]byte, seeds *[4][4]uint64, dataPtrs *[4]*byte, out *[4][8]uint32)
- func Blake3256ChainAbsorb36x4(key *[32]byte, seeds *[4][4]uint64, dataPtrs *[4]*byte, out *[4][8]uint32)
- func Blake3256ChainAbsorb68x4(key *[32]byte, seeds *[4][4]uint64, dataPtrs *[4]*byte, out *[4][8]uint32)
Constants ¶
const ( FlagChunkStart = 0x01 FlagChunkEnd = 0x02 FlagRoot = 0x08 FlagKeyedHash = 0x10 )
BLAKE3 domain-separation flags (RFC §2.1).
const FlagsSingleBlock = FlagKeyedHash | FlagChunkStart | FlagChunkEnd | FlagRoot
FlagsSingleBlock — set by the 20- and 36-byte kernels (where the keyed-hash chunk is a single block, and that block is simultaneously chunk-start, chunk-end, and root).
const FlagsTwoBlockFinal = FlagKeyedHash | FlagChunkEnd | FlagRoot
FlagsTwoBlockFinal — set by block 2 of the 68-byte kernel (chunk end and root, but no longer chunk start).
const FlagsTwoBlockFirst = FlagKeyedHash | FlagChunkStart
FlagsTwoBlockFirst — set by block 1 of the 68-byte kernel (chunk start, but neither chunk end nor root yet).
Variables ¶
var Blake3IV = [8]uint32{
0x6a09e667,
0xbb67ae85,
0x3c6ef372,
0xa54ff53a,
0x510e527f,
0x9b05688c,
0x1f83d9ab,
0x5be0cd19,
}
Blake3IV is the BLAKE3 initialization vector (RFC §2.4). The constants are bit-identical to BLAKE2s IV (and the SHA-256 IV); BLAKE3 reuses these values for v[8..11] of the compression state init, with the remaining four IV slots replaced by (t_lo, t_hi, block_len, flags) per BLAKE3's compression contract.
Unlike BLAKE2{b,s}, BLAKE3 has no parameter-block XOR'd into h[0] — the digest length is not encoded in the chaining value at init time. So no Blake3IV256Param companion variable is needed.
var HasAVX512Fused = cpu.X86.HasAVX512F
HasAVX512Fused reports whether the runtime CPU supports the fused AVX-512 + VL chain-absorb kernels. Same derivation as blake2{b,s}asm — only AVX-512F is needed at the CPUID-bit level (VPRORD is AVX-512F + VL, but on every shipping silicon where AVX-512F is present the rest of the AVX-512 baseline ships with it).
Functions ¶
func Blake3256ChainAbsorb20x4 ¶
func Blake3256ChainAbsorb20x4( key *[32]byte, seeds *[4][4]uint64, dataPtrs *[4]*byte, out *[4][8]uint32, )
Blake3256ChainAbsorb20x4 is the public 4-pixel-batched entry point for the BLAKE3-256 chain-absorb at the 20-byte data shape (ITB SetNonceBits(128) buf shape).
On amd64 + AVX-512 + VL hosts (HasAVX512Fused == true), dispatches to the fused ZMM-batched ASM kernel; otherwise falls through to the scalar batched reference path (which delegates to upstream github.com/zeebo/blake3).
Buffer construction is identical between the two paths and matches the bit-exact behaviour of the existing hashes.BLAKE3 closure applied to each of the four pixel inputs:
per lane:
mixed[0:20] = data[lane] (per-lane, 20 bytes)
mixed[20:32] = zero pad
then for i in 0..3:
mixed[i*8 : i*8+8] ^= seeds[lane][i] (LE uint64; straddles
two BLAKE3 message
words m[2i], m[2i+1])
The keyed-hash mode key (32 bytes shared across all 4 lanes) is consumed by BLAKE3's state init (v[0..7] = KEY broadcast), NOT written into the mixed buffer. One BLAKE3 compression with block_len=32, flags=0x1B (KEYED_HASH|CHUNK_START|CHUNK_END|ROOT).
func Blake3256ChainAbsorb36x4 ¶
func Blake3256ChainAbsorb36x4( key *[32]byte, seeds *[4][4]uint64, dataPtrs *[4]*byte, out *[4][8]uint32, )
Blake3256ChainAbsorb36x4 — 36-byte BLAKE3-256 batched dispatcher (ITB SetNonceBits(256) buf shape). Single compression block per lane (mixed=36 ≤ 64-byte BLAKE3 block size); same flag set as the 20-byte case but block_len=36 and the m-pack covers 9 dwords of data (m[0..8]) instead of 5.
func Blake3256ChainAbsorb68x4 ¶
func Blake3256ChainAbsorb68x4( key *[32]byte, seeds *[4][4]uint64, dataPtrs *[4]*byte, out *[4][8]uint32, )
Blake3256ChainAbsorb68x4 — 68-byte BLAKE3-256 batched dispatcher (ITB SetNonceBits(512) buf shape). Two compression blocks per lane (mixed=68 > 64):
Block 1 (block_len=64, flags=KEYED_HASH|CHUNK_START):
m[0..7] = data[0:32] ⊕ seed
m[8..15] = data[32:64]
Output cv1 = v[k] ⊕ v[k+8] (k in 0..7) becomes block 2's v[0..7].
Block 2 (block_len=4, flags=KEYED_HASH|CHUNK_END|ROOT):
m[0] = data[64:68]
m[1..15] = 0
Final out[k] = v[k] ⊕ v[k+8] (no ⊕ chaining_value; BLAKE3's
output mixing differs from
BLAKE2's h0 ⊕ v[k] ⊕ v[k+8]).
Types ¶
This section is empty.