blake2sasm

package
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 13, 2026 License: MIT Imports: 4 Imported by: 0

Documentation

Overview

Package blake2sasm holds the AVX-512 + VL fused chain-absorb kernel implementation of BLAKE2s for the parent hashes/ package. The chain kernels are specialised at three input widths (20 / 36 / 68 bytes — covering the ITB SetNonceBits 128 / 256 / 512 buf shapes) and hold BLAKE2s state in ZMM registers across the absorb rounds, eliminating the per-round memory round-trip taken by the upstream `golang.org/x/crypto/blake2s` path.

Below the AVX-512 + VL tier, the parent package's dispatch falls through to the existing `golang.org/x/crypto/blake2s` AVX2 / SSE2 / scalar paths. No AVX2 / SSE4 / SSSE3 ASM is provided here — the upstream library already covers those tiers, and the fused chain-absorb trick that motivates this package is AVX-512-only by construction (state-residency requires ZMM; VPRORD requires AVX-512 + VL).

Reference layout: github.com/saucecontrol/Blake2Fast (MIT) — specifically src/Blake2Fast/Blake2s/Blake2sScalar.g.cs for the sigma table, IV constants, G rotates, and round structure. The AVX-512 round body is structured as a 4-pixel-batched lane-parallel design (no DIAG/UNDIAG permutations) rather than the upstream single-state Bernstein layout.

Index

Constants

This section is empty.

Variables

View Source
var Blake2sIV = [8]uint32{
	0x6a09e667,
	0xbb67ae85,
	0x3c6ef372,
	0xa54ff53a,
	0x510e527f,
	0x9b05688c,
	0x1f83d9ab,
	0x5be0cd19,
}

Blake2sIV is the BLAKE2s initialization vector from RFC 7693 §3.2. The 32-bit word constants are the same as the upper 32 bits of the corresponding BLAKE2b IV entries. The compression function's initial state for BLAKE2s-256 derives from this IV with the parameter block XOR'd into h[0]; h[1..7] remain equal to IV[1..7].

View Source
var Blake2sIV256Param = [8]uint32{
	0x6b08e647,
	0xbb67ae85,
	0x3c6ef372,
	0xa54ff53a,
	0x510e527f,
	0x9b05688c,
	0x1f83d9ab,
	0x5be0cd19,
}

Blake2sIV256Param is the precomputed initial state for the hashes.BLAKE2s256 prefix-MAC construction:

paramBlock = digestLength=32, fanout=1, depth=1, keylen=0
           = 0x01010020 (LE uint32)
h[0]       = IV[0] ⊕ paramBlock = 0x6b08e647
h[1..7]    = IV[1..7] (unchanged)

The hashes.BLAKE2s256 closure passes a pointer to this array as the h0 parameter of the chain-absorb kernels. Caller-side IV setup keeps the ASM kernel digest-width-agnostic — though BLAKE2s ships only at 32-byte digest width in this repo, the same shape leaves room for a future -224 parameter set to slot in without a kernel rewrite.

View Source
var HasAVX512Fused = cpu.X86.HasAVX512F

HasAVX512Fused reports whether the runtime CPU supports the fused AVX-512 + VL chain-absorb kernels. The kernels are pure ARX (no AES rounds), so the detection requirement is weaker than the Areion-SoEM flag in `internal/areionasm` — only AVX-512F is needed at the CPUID-bit level. VPRORD requires AVX-512F + VL, but on every shipping silicon where AVX-512F is present the rest of the AVX-512 baseline (F + CD + BW + DQ + VL) ships with it (Intel Skylake-X+, AMD Zen 4+). The only chips with AVX-512F but no VL — Knights Landing / Knights Mill (Xeon Phi 2nd gen) — are extinct accelerator products that no Go runtime targets in practice.

Functions

func Blake2s256ChainAbsorb20x4

func Blake2s256ChainAbsorb20x4(
	h0 *[8]uint32,
	b2key *[32]byte,
	seeds *[4][4]uint64,
	dataPtrs *[4]*byte,
	out *[4][8]uint32,
)

Blake2s256ChainAbsorb20x4 is the public 4-pixel-batched entry point for the BLAKE2s-256 chain-absorb at the 20-byte data shape (ITB SetNonceBits(128) buf shape).

On amd64 + AVX-512 + VL hosts (HasAVX512Fused == true), dispatches to the fused ZMM-batched ASM kernel which holds four lane-isolated BLAKE2s states in 16 ZMM registers (one ZMM per v[k], 4 of 16 dword lanes used) across all 10 internal mixing rounds. No DIAG/UNDIAG permutations are required since the four states are lane-parallel rather than shuffled-into-one. On hosts without AVX-512+VL, falls through to the scalar batched reference path, which loops the per-lane scalar reference (delegating to upstream golang.org/x/crypto/blake2s).

Buffer construction is identical between the two paths and matches the bit-exact behaviour of the existing hashes.BLAKE2s256 closure applied to each of the four pixel inputs:

per lane:
  buf[0:32]   = b2key
  buf[32:52]  = data[lane]
  buf[52:64]  = zero
  then for i in 0..3: buf[32+i*8 : 40+i*8] ^= seeds[lane][i] (LE uint64)

One BLAKE2s compression with t=64, f=^0 (final). Output is the 8 × uint32 BLAKE2s state per lane (32 bytes of digest).

h0 selects the parameter-block-XOR'd initial state (digestLength=32 for hashes.BLAKE2s256). Pass &Blake2sIV256Param.

func Blake2s256ChainAbsorb36x4

func Blake2s256ChainAbsorb36x4(
	h0 *[8]uint32,
	b2key *[32]byte,
	seeds *[4][4]uint64,
	dataPtrs *[4]*byte,
	out *[4][8]uint32,
)

Blake2s256ChainAbsorb36x4 — 36-byte BLAKE2s-256 batched dispatcher (ITB SetNonceBits(256) buf shape). Two compression blocks per lane:

Block 1 (t=64,  f=0):  buf[0:64]   = b2key + (data[0:32] ⊕ seed)
Block 2 (t=68,  f=^0): buf[64:128] = data[32:36] + 60 zero pad

The ASM kernel holds all four lanes' BLAKE2s states in ZMM registers across both compressions; the inter-block fold runs in-register lane-parallel, with the block-1 chaining hash spilled to stack so the block-2 final fold can reload it.

func Blake2s256ChainAbsorb68x4

func Blake2s256ChainAbsorb68x4(
	h0 *[8]uint32,
	b2key *[32]byte,
	seeds *[4][4]uint64,
	dataPtrs *[4]*byte,
	out *[4][8]uint32,
)

Blake2s256ChainAbsorb68x4 — 68-byte BLAKE2s-256 batched dispatcher (ITB SetNonceBits(512) buf shape). Two compression blocks per lane:

Block 1 (t=64,  f=0):  buf[0:64]   = b2key + (data[0:32] ⊕ seed)
Block 2 (t=100, f=^0): buf[64:128] = data[32:68] + 28 zero pad

Same structure as the 36-byte two-block kernel; only the block-2 data fill is wider (36 bytes instead of 4), populating m[0..8] instead of m[0] alone.

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL