areionasm

package

v0.1.2 Latest Latest Go to latest Published: May 22, 2026 License: MIT Imports: 1 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/everanium/itb

Links

Open Source Insights

Documentation ¶

Rendered for

Overview ¶

Package areionasm holds the AVX-512 + VAES (and AVX-2 fallback) assembly implementation of the 4-way batched Areion family for the parent `itb` package. It lives in an internal subpackage because `itb` uses CGO (Go's build system does not allow Go assembly files in CGO-using packages).

Exported kernels:

Areion256Permutex4 / Areion512Permutex4 — per-half AVX-512 + VAES permutations. On amd64 production hot paths these are reached only via the AVX-2 sibling (see below) and the fused / chained-absorb kernels; the per-half AVX-512 entries remain primarily as the fast-known-good reference for parity tests.
Areion256Permutex4Avx2 / Areion512Permutex4Avx2 — AVX-2 + VAES fallbacks for hosts with VAES but no AVX-512 (some Alder Lake / Raptor Lake E-core configurations, certain Zen 3 SKUs).
Areion256SoEMPermutex4Interleaved / Areion512SoEMPermutex4Interleaved — fused per-half kernels that interleave state1 and state2 permutations on independent ZMM dependency chains and fold the SoEM output XOR (and Areion512's final cyclic rotation) into the writeback.
Areion256ChainAbsorb20x4 / 36x4 / 68x4 (and the Areion-SoEM-512 trio) — specialised CBC-MAC chained-absorb kernels for the three ITB SetNonceBits buf shapes (1, 2 or 3 absorb rounds on -256; 1, 1 or 2 on -512). State is held in ZMM registers across all absorb rounds; broadcast fixedKey and SoA-packed seedKey are loaded once at function entry.

Also exported: the pre-broadcast round-constant table `AreionRC4x` and the Areion-SoEM-256 domain-separation constant `AreionSoEMDomainSep256`. AoS <-> SoA pack/unpack, runtime dispatch, and the Go-side hash closures live in the parent `itb` package.

Index ¶

Variables
func Areion256ChainAbsorb20x4(fixedKey *[32]byte, seeds *[4][4]uint64, dataPtrs *[4]*byte, out *[4][4]uint64)
func Areion256ChainAbsorb36x4(fixedKey *[32]byte, seeds *[4][4]uint64, dataPtrs *[4]*byte, out *[4][4]uint64)
func Areion256ChainAbsorb68x4(fixedKey *[32]byte, seeds *[4][4]uint64, dataPtrs *[4]*byte, out *[4][4]uint64)
func Areion256Permutex4(x0, x1 *aes.Block4)
func Areion256Permutex4Avx2(x0, x1 *aes.Block4)
func Areion256SoEMPermutex4Interleaved(s1b0, s1b1, s2b0, s2b1 *aes.Block4)
func Areion512ChainAbsorb20x4(fixedKey *[64]byte, seeds *[4][8]uint64, dataPtrs *[4]*byte, out *[4][8]uint64)
func Areion512ChainAbsorb36x4(fixedKey *[64]byte, seeds *[4][8]uint64, dataPtrs *[4]*byte, out *[4][8]uint64)
func Areion512ChainAbsorb68x4(fixedKey *[64]byte, seeds *[4][8]uint64, dataPtrs *[4]*byte, out *[4][8]uint64)
func Areion512Permutex4(x0, x1, x2, x3 *aes.Block4)
func Areion512Permutex4Avx2(x0, x1, x2, x3 *aes.Block4)
func Areion512SoEMPermutex4Interleaved(a1, b1, c1, d1, a2, b2, c2, d2 *aes.Block4)

Constants ¶

This section is empty.

Variables ¶

View Source

var AreionRC4x [15 * 64]byte

AreionRC4x holds the 15 Areion round constants in pre-broadcast form (each 16-byte constant replicated four times to fill a 64-byte ZMM register). Layout: rc[r] occupies bytes [r*64 : (r+1)*64], with the 16-byte constant copied at offsets {0, 16, 32, 48} within each block.

Initialised by `init()` from the canonical 16-byte constants in `Constants`. The assembly file `areion_amd64.s` references this symbol as `·AreionRC4x(SB)`.

View Source

var AreionSoEMDomainSep256 = [64]byte{
	0x01, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
	0x01, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
	0x01, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
	0x01, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
}

AreionSoEMDomainSep256 is the SoEM-256 domain-separation constant pre-broadcast to SoA Block4 layout: 0x01 in byte[0] of each 16-byte lane slot, zero elsewhere. Used by the chained-absorb kernels to XOR `d` into state2's first u64 word per SoEM construction.

View Source

var Constants = [15][16]byte{
	{0x44, 0x73, 0x70, 0x03, 0x2e, 0x8a, 0x19, 0x13, 0xd3, 0x08, 0xa3, 0x85, 0x88, 0x6a, 0x3f, 0x24},
	{0x89, 0x6c, 0x4e, 0xec, 0x98, 0xfa, 0x2e, 0x08, 0xd0, 0x31, 0x9f, 0x29, 0x22, 0x38, 0x09, 0xa4},
	{0x6c, 0x0c, 0xe9, 0x34, 0xcf, 0x66, 0x54, 0xbe, 0x77, 0x13, 0xd0, 0x38, 0xe6, 0x21, 0x28, 0x45},
	{0x17, 0x09, 0x47, 0xb5, 0xb5, 0xd5, 0x84, 0x3f, 0xdd, 0x50, 0x7c, 0xc9, 0xb7, 0x29, 0xac, 0xc0},
	{0xac, 0xb5, 0xdf, 0x98, 0xa6, 0x0b, 0x31, 0xd1, 0x1b, 0xfb, 0x79, 0x89, 0xd9, 0xd5, 0x16, 0x92},
	{0x96, 0x7e, 0x26, 0x6a, 0xed, 0xaf, 0xe1, 0xb8, 0xb7, 0xdf, 0x1a, 0xd0, 0xdb, 0x72, 0xfd, 0x2f},
	{0xf7, 0x6c, 0x91, 0xb3, 0x47, 0x99, 0xa1, 0x24, 0x99, 0x7f, 0x2c, 0xf1, 0x45, 0x90, 0x7c, 0xba},
	{0x90, 0xe6, 0x74, 0x15, 0x87, 0x0d, 0x92, 0x36, 0x66, 0xc1, 0xef, 0x58, 0x28, 0x2e, 0x1f, 0x80},
	{0x58, 0xb6, 0x8e, 0x72, 0x8f, 0x74, 0x95, 0x0d, 0x7e, 0x3d, 0x93, 0xf4, 0xa3, 0xfe, 0x58, 0xa4},
	{0xb5, 0x59, 0x5a, 0xc2, 0x1d, 0xa4, 0x54, 0x7b, 0xee, 0x4a, 0x15, 0x82, 0x58, 0xcd, 0x8b, 0x71},
	{0xf0, 0x85, 0x60, 0x28, 0x23, 0xb0, 0xd1, 0xc5, 0x13, 0x60, 0xf2, 0x2a, 0x39, 0xd5, 0x30, 0x9c},
	{0x0e, 0x18, 0x3a, 0x60, 0xb0, 0xdc, 0x79, 0x8e, 0xef, 0x38, 0xdb, 0xb8, 0x18, 0x79, 0x41, 0xca},
	{0x27, 0x4b, 0x31, 0xbd, 0xc1, 0x77, 0x15, 0xd7, 0x3e, 0x8a, 0x1e, 0xb0, 0x8b, 0x0e, 0x9e, 0x6c},
	{0x94, 0xab, 0x55, 0xaa, 0xf3, 0x25, 0x55, 0xe6, 0x60, 0x5c, 0x60, 0x55, 0xda, 0x2f, 0xaf, 0x78},
	{0xb6, 0x10, 0xab, 0x2a, 0x6a, 0x39, 0xca, 0x55, 0x40, 0x14, 0xe8, 0x63, 0x62, 0x98, 0x48, 0x57},
}

Constants is the canonical 15-entry round constant table — digits of pi in little-endian byte order, copied verbatim from `github.com/jedisct1/go-aes/areion.go:areionRoundConstants`. Areion256 uses entries 0..9; Areion512 uses entries 0..14.

View Source

var HasARMAESBatched = false

HasARMAESBatched is always false on amd64 builds — this is the ARM Crypto Extension batched flag set by areionasm_arm64.go on arm64 hosts. Declared here so the parent itb package's gates compile uniformly across architectures without per-arch build tag fences inside the gate expression.

View Source

var HasVAESAVX2NoAVX512 = aes.CPU.HasVAES && aes.CPU.HasAVX2 && !aes.CPU.HasAVX512

HasVAESAVX2NoAVX512 is true for x86_64 CPUs that have VAES + AVX2 but lack AVX-512. The runtime dispatcher in the parent itb package picks this path when HasVAESAVX512 is false but VAES is still available, so the YMM assembly variants run instead of falling all the way back to the portable Go path.

View Source

var HasVAESAVX512 = aes.CPU.HasVAES && aes.CPU.HasAVX512

HasVAESAVX512 caches whether the runtime CPU supports VAES + AVX-512. Resolved once at init time from the upstream `aes` package's CPUID-driven detection. Both flags must be set for the AVX-512 path to be selected.

Functions ¶

func Areion256ChainAbsorb20x4 ¶

func Areion256ChainAbsorb20x4(
	fixedKey *[32]byte,
	seeds *[4][4]uint64,
	dataPtrs *[4]*byte,
	out *[4][4]uint64,
)

Areion256ChainAbsorb20x4 is the single-round specialised fused chained-absorb VAES kernel for Areion-SoEM-256 with 20-byte per-lane data shape (the ITB SetNonceBits(128) buf shape — default config).

20 bytes ≤ 24-byte Areion-SoEM-256 chunkSize, so the absorb is one SoEM round; the kernel runs the 10-round Areion256 permutation interleaved on state1 and state2, computes the SoEM XOR `state1' ⊕ state2'` in registers, and writes the 32-byte digest per lane.

Inputs:

fixedKey: shared 32-byte fixed key (Areion-SoEM-256 SoEM uses key1 = 32 B).
seeds: per-lane seed components (4 lanes × 4 uint64 = 32 B per lane).
dataPtrs: 4 pointers, each to ≥20 bytes.
out: output buffer; lane i's 32-byte digest at out[i] as 4 little-endian uint64 words.

func Areion256ChainAbsorb36x4 ¶

func Areion256ChainAbsorb36x4(
	fixedKey *[32]byte,
	seeds *[4][4]uint64,
	dataPtrs *[4]*byte,
	out *[4][4]uint64,
)

Areion256ChainAbsorb36x4 — 2-round specialisation for the 36-byte per-lane data shape (ITB SetNonceBits(256)). State is held in (Z14, Z15) ZMM registers across both CBC-MAC absorb rounds — no memory roundtrip between rounds.

func Areion256ChainAbsorb68x4 ¶

func Areion256ChainAbsorb68x4(
	fixedKey *[32]byte,
	seeds *[4][4]uint64,
	dataPtrs *[4]*byte,
	out *[4][4]uint64,
)

Areion256ChainAbsorb68x4 — 3-round specialisation for the 68-byte per-lane data shape (ITB SetNonceBits(512)). State is held in (Z14, Z15) ZMM registers across all three CBC-MAC absorb rounds — no memory roundtrip between rounds.

func Areion256Permutex4 ¶

func Areion256Permutex4(x0, x1 *aes.Block4)

Areion256Permutex4 applies the 10-round Areion256 permutation to four independent states packed in SoA layout: `*x0` holds the four lanes' first 16-byte AES blocks (Block4 = 64 bytes), `*x1` holds the second 16-byte blocks. Implemented in `areion_amd64.s` using AVX-512 + VAES instructions on ZMM registers.

func Areion256Permutex4Avx2 ¶

func Areion256Permutex4Avx2(x0, x1 *aes.Block4)

Areion256Permutex4Avx2 is the AVX2 + VAES variant of Areion256Permutex4, written for x86_64 CPUs that have VAES but no AVX-512 (some Intel Alder Lake / Raptor Lake E-core configurations when isolated, certain AMD Zen 3 SKUs). Same SoA layout and bit-exact parity invariant as the AVX-512 path; the only difference is the internal VAESENC instructions operate on YMM registers (2 AES blocks per call) instead of ZMM (4 blocks per call), so each Areion round body runs twice — once for lanes 0-1 and once for lanes 2-3.

func Areion256SoEMPermutex4Interleaved ¶

func Areion256SoEMPermutex4Interleaved(s1b0, s1b1, s2b0, s2b1 *aes.Block4)

Areion256SoEMPermutex4Interleaved runs the Areion-SoEM-256 4-way batched PRF in a single fused VAES kernel. Caller is responsible for preparing the SoEM half-states in SoA Block4 layout (lane i's two AES sub-blocks live at &Block4[i*16:i*16+16]):

s1b0, s1b1 = input ⊕ key1
s2b0, s2b1 = input ⊕ key2 ⊕ domainSep

The kernel runs both 10-round Areion256 permutations interleaved (one VAESENC of each state per critical-path step, masking the 5-cycle VAESENC latency on Intel Sunny Cove / Cypress Cove and AMD Zen 4), then computes the SoEM output `state1' ⊕ state2'` in registers and writes the result back into (s1b0, s1b1). The (s2b0, s2b1) buffers are scratch and their contents after the call are unspecified.

Compared with two back-to-back `Areion256Permutex4` calls plus a Go-side per-lane uint64 XOR loop, the fused path saves:

10 round-constant loads (RC pre-load runs once instead of twice)
the function-call boundary between the two permutes
the post-permute XOR + unpack loop in the AoS-side caller
per-round dependency stalls (interleaved VAESENC pairs hide latency)

func Areion512ChainAbsorb20x4 ¶

func Areion512ChainAbsorb20x4(
	fixedKey *[64]byte,
	seeds *[4][8]uint64,
	dataPtrs *[4]*byte,
	out *[4][8]uint64,
)

Areion512ChainAbsorb20x4 is the single-round specialised fused chained-absorb VAES kernel for Areion-SoEM-512 with 20-byte per-lane data shape (the ITB SetNonceBits(128) buf shape — default config).

20 bytes ≤ 56-byte SoEM-512 chunkSize, so the absorb is one SoEM round; the kernel runs the 15-round Areion512 permutation interleaved on state1 and state2, applies the cyclic rotation `(x0,x1,x2,x3) → (x3,x0,x1,x2)` fused with the SoEM XOR, and writes the 64-byte digest per lane.

Inputs:

fixedKey: shared 64-byte fixed key (Areion512 SoEM uses key1 = 64 B).
seeds: per-lane seed components (4 lanes × 8 uint64 = 64 B per lane).
dataPtrs: 4 pointers, each to ≥20 bytes.
out: output buffer; lane i's 64-byte digest at out[i] as 8 little-endian uint64 words.

func Areion512ChainAbsorb36x4 ¶

func Areion512ChainAbsorb36x4(
	fixedKey *[64]byte,
	seeds *[4][8]uint64,
	dataPtrs *[4]*byte,
	out *[4][8]uint64,
)

Areion512ChainAbsorb36x4 — single-round specialisation for the 36-byte per-lane data shape (ITB SetNonceBits(256)). 36 ≤ 56-byte chunkSize so still one round; only the data layout in the initial state differs from the 20-byte case.

func Areion512ChainAbsorb68x4 ¶

func Areion512ChainAbsorb68x4(
	fixedKey *[64]byte,
	seeds *[4][8]uint64,
	dataPtrs *[4]*byte,
	out *[4][8]uint64,
)

Areion512ChainAbsorb68x4 — 2-round specialisation for the 68-byte per-lane data shape (ITB SetNonceBits(512)). State held in (Z14, Z15, Z16, Z17) ZMM registers across both CBC-MAC absorb rounds — no memory roundtrip between rounds.

func Areion512Permutex4 ¶

func Areion512Permutex4(x0, x1, x2, x3 *aes.Block4)

Areion512Permutex4 applies the 15-round Areion512 permutation to four independent states packed in SoA layout: each `*xN` holds the four lanes' N-th 16-byte AES block (Block4 = 64 bytes). Includes the final cyclic state rotation `(x0,x1,x2,x3) → (x3,x0,x1,x2)` documented in the Areion paper / `areion512PermuteSoftware`. Implemented in `areion_amd64.s`.

func Areion512Permutex4Avx2 ¶

func Areion512Permutex4Avx2(x0, x1, x2, x3 *aes.Block4)

Areion512Permutex4Avx2 is the AVX2 + VAES counterpart for the 512-bit permutation. Same constraints as Areion256Permutex4Avx2 — VAES on YMM, no AVX-512 required, 2 AES blocks per VAES instruction. Each of the 15 rounds runs twice (one body per lane pair), plus the final cyclic state rotation. Bit-exact parity invariant identical to the AVX-512 path.

func Areion512SoEMPermutex4Interleaved ¶

func Areion512SoEMPermutex4Interleaved(a1, b1, c1, d1, a2, b2, c2, d2 *aes.Block4)

Areion512SoEMPermutex4Interleaved runs the Areion-SoEM-512 4-way batched PRF in a single fused VAES kernel. Caller is responsible for the SoEM input setup (in SoA Block4 layout):

(a1, b1, c1, d1) = input ⊕ key1
(a2, b2, c2, d2) = input ⊕ key2 ⊕ domainSep

Each Block4 is 64 bytes and holds the same 16-byte AES sub-block across the 4 lanes (Areion-SoEM-512's state is 64 bytes = 4 AES blocks per lane, hence 4 Block4 buffers per state).

The kernel runs both 15-round Areion512 permutations interleaved (one VAESENC of each state per critical-path step, masking the 5-cycle VAESENC latency on Intel Sunny Cove / Cypress Cove and AMD Zen 4), applies the cyclic state rotation `(x0,x1,x2,x3) → (x3,x0,x1,x2)` fused with the SoEM output XOR `state1' ⊕ state2'`, and writes the result to (a1, b1, c1, d1). The (a2, b2, c2, d2) buffers are scratch and their contents after the call are unspecified.

Compared with two back-to-back `Areion512Permutex4` calls plus a Go-side per-Block4 XOR loop, the fused path saves:

15 round-constant loads (RC pre-load runs once instead of twice)
the function-call boundary between the two permutes
the post-permute XOR + unpack work in the AoS-side caller
per-round VAESENC dependency stalls (interleaved chains hide the 5-cycle latency)
8 VMOVDQA64 final-rotation moves (rotation is folded into the SoEM XOR pattern by routing register contents directly to the correct output slots)

Types ¶

This section is empty.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL