blake3sum
A fast BLAKE3 hashing library and a
b3sum-compatible command-line tool, written in Go and structured as a
goforge Polylith workspace.
Built for content and "essentia" hashing of compiler intermediate
representations: alongside the usual streaming and one-shot APIs it provides a
parallel batch API (HashMany) tuned for hashing many small, independent
nodes at once.
go install goforge.dev/blake3sum@latest # installs the blake3sum CLI
import "goforge.dev/blake3sum/components/blake3"
Library
sum := blake3.Sum256(data) // [32]byte, one-shot
sum := blake3.Hash(data) // alias for Sum256
ext := blake3.Sum512(data) // [64]byte
h := blake3.New(32, nil) // streaming, hash.Hash
h.Write(p); digest := h.Sum(nil)
kh := blake3.NewKeyed(key32) // keyed mode (key is 32 bytes)
dk := blake3.NewDeriveKey("app 2025 ctx"); dk.Write(material)
blake3.DeriveKey(out, "app 2025 ctx", material) // one-shot KDF
xof := h.XOF() // seekable extendable output
xof.Seek(1<<20, io.SeekStart); xof.Read(buf)
// Batch: hash many independent inputs in parallel — the compiler hot path.
digests := blake3.HashMany(inputs) // [][32]byte, order-preserving
blake3.HashManyInto(dst, inputs) // no allocation of the result
keyed := blake3.HashManyKeyed(key32, inputs)
Measured on an 11th-gen Intel i7-11800H (AVX-512, 8 cores / 16 threads),
go test -bench:
| Workload |
Throughput |
| One-shot, 16 MiB (multithreaded AVX-512) |
~12.3 GB/s |
| One-shot, 1 MiB |
~4.7 GB/s |
| One-shot, 64 B (latency) |
~170 ns |
HashMany, 100k × 48 B vs sequential loop |
3.3× faster (5.2 ms vs 17.4 ms) |
Single-stream throughput is parity-to-faster versus lukechampine.com/blake3,
the fastest existing Go implementation, on the same machine. The HashMany
speedup over a sequential Sum256 loop scales with core count and is the main
win for the IR-hashing use case — no reference implementation offers a batch
API.
On already-SIMD-saturated, multithreaded BLAKE3 (optimized Rust b3sum,
lukechampine), single-stream throughput is bound by memory bandwidth and
the compression function; there is no 2× headroom there in any language. The
honest, large win for compiler workloads is the batch path.
Architecture dispatch
The compression core is selected at build/run time:
| Target |
Path |
Status |
| amd64 |
AVX-512 → AVX2 → portable |
implemented, assembly core |
| arm64 |
NEON → portable |
implemented (4-wide NEON kernel), verified under emulation |
| wasm |
host SIMD128 bridge → portable |
bridge implemented (opt-in), portable default |
| other |
portable Go |
implemented |
arm64 / NEON
blake3_arm64.s carries two hand-written NEON kernels:
- Multi-message (4-chunk) kernel (
compressChunksNEON) — the throughput
path. Compresses 4 chunks in parallel with vertical 4-lane SIMD (one lane
per chunk, no diagonalization shuffles): the block is cross-chunk
transposed (VTRN1/VTRN2) into 16 message vectors, then 7 rounds of 8 G
functions run purely vertically. compressBuffer feeds groups of 4 full
chunks through it.
- Single-compression kernel (
compressNodeNEON) — row-based
diagonalize/undiagonalize, message words gathered via 4-register VTBL.
Used for parents, partial/trailing chunks, and XOF output blocks.
Together they cover the entire arm64 path. Both are verified bit-exact against
the portable core (TestNEONMatchesGeneric, TestChunksNEONMatchesGeneric,
and TestSIMDMatchesGeneric comparing the 4-way compressBuffer to the
portable buffer for every length), and the full official-vector suite passes
under arm64 emulation. No other Go BLAKE3 library ships arm64 SIMD, so this is
a real win on Apple Silicon / Graviton.
Real on-hardware benchmark numbers are still pending an ARM CI runner;
correctness is fully established under emulation.
wasm host bridge
Go's wasm backend emits no SIMD128, so an in-wasm Go hash is scalar. The
default GOARCH=wasm build is portable and instantiates anywhere. To offload
hashing to a host-provided SIMD128 implementation, build with:
GOARCH=wasm -tags blake3_wasm_host
and supply the import blake3.hash(inputPtr, inputLen, outPtr, outLen) at
instantiation (see components/blake3/blake3_wasm_host.go). The bridge wires
the unkeyed one-shot fast path to the host; streaming, keyed, and derive-key
stay on the portable core.
CLI (blake3sum)
Mirrors the reference b3sum:
blake3sum [OPTIONS] [FILE]...
--keyed keyed mode; 32-byte key read from stdin
--derive-key CTX key-derivation mode with context string CTX
-l, --length LEN output bytes before hex (default 32)
--seek SEEK starting output offset before hex
--num-threads NUM max worker threads (default: logical CPUs)
--no-mmap do not memory-map inputs
--no-names omit file names
--raw raw output bytes instead of hex (single input)
--tag BSD-style: BLAKE3 (FILE) = HASH
-c, --check verify sums read from FILEs
--quiet with --check, suppress per-file OK lines
Large regular files are memory-mapped (unix) and hashed in one parallel pass;
--no-mmap and non-regular inputs stream in 4 MiB blocks.
Workspace layout
components/blake3 the hashing library (public API + arch dispatch + asm)
components/checkfile b3sum checksum-line parse/format
bases/cli the blake3sum CLI
projects/blake3sum project wiring; root main.go shares the same entry
Building & testing
make # goforge check + vet + host tests
make build # build the blake3sum binary
make bench # benchmarks
make test-arm64 # cross-build + run arm64 (NEON) tests under emulation
make test-wasm # cross-build wasm (scalar + host-bridge)
make ci # full local CI: host + arm64(qemu) + wasm
make test-arm64 runs the arm64 build (including the NEON assembly kernel) on
an x86 host via static qemu-aarch64 user-mode emulation — no root, no binfmt.
make qemu fetches a static emulator into .tools/; alternatively install
qemu-user-static from your package manager. Under the hood it is just:
GOOS=linux GOARCH=arm64 go test -exec qemu-aarch64-static ./components/blake3/
Offline builds: export GOPROXY=off GOFLAGS=-mod=mod (deps are in the module
cache).
Correctness
Verified against the official BLAKE3 test vectors
(components/blake3/testdata/test_vectors.json) for the regular, keyed, and
derive-key modes including extended output, plus chunked-streaming, batch,
XOF-seek, and SIMD-vs-portable equivalence tests. The arm64 NEON kernel is
checked bit-exact against the portable core and runs the full vector suite under
emulation (make test-arm64).
Attribution
The amd64 AVX-512/AVX2 assembly core (blake3_amd64.s) and its avo generator
(_avo/gen.go) are adapted from
lukechampine.com/blake3 (MIT,
Copyright 2020 Luke Champine). The library API, batch hashing, tree/streaming
driver, architecture dispatch, wasm bridge, and CLI are original to this
project.
go vet reports benign asmdecl warnings on the generated blake3_amd64.s
(high-dword access of the 64-bit counter); these are present in the upstream
generated assembly and do not affect correctness — the test suite validates the
assembly against the official vectors.