blake3sum

command module

v1.0.0 Latest Latest Go to latest Published: Jun 9, 2026 License: MIT Imports: 2 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/brain-fuel/blake3sum

Links

Open Source Insights

README ¶

blake3sum

A fast BLAKE3 hashing library and a b3sum-compatible command-line tool, written in Go and structured as a goforge Polylith workspace.

Built for content and "essentia" hashing of compiler intermediate representations: alongside the usual streaming and one-shot APIs it provides a parallel batch API (HashMany) tuned for hashing many small, independent nodes at once.

go install goforge.dev/blake3sum@latest      # installs the blake3sum CLI

import "goforge.dev/blake3sum/components/blake3"

Library

sum := blake3.Sum256(data)            // [32]byte, one-shot
sum := blake3.Hash(data)              // alias for Sum256
ext := blake3.Sum512(data)            // [64]byte

h := blake3.New(32, nil)              // streaming, hash.Hash
h.Write(p); digest := h.Sum(nil)

kh := blake3.NewKeyed(key32)          // keyed mode (key is 32 bytes)
dk := blake3.NewDeriveKey("app 2025 ctx"); dk.Write(material)
blake3.DeriveKey(out, "app 2025 ctx", material)  // one-shot KDF

xof := h.XOF()                        // seekable extendable output
xof.Seek(1<<20, io.SeekStart); xof.Read(buf)

// Batch: hash many independent inputs in parallel — the compiler hot path.
digests := blake3.HashMany(inputs)            // [][32]byte, order-preserving
blake3.HashManyInto(dst, inputs)              // no allocation of the result
keyed := blake3.HashManyKeyed(key32, inputs)

Performance

Measured on an 11th-gen Intel i7-11800H (AVX-512, 8 cores / 16 threads), go test -bench:

Workload	Throughput
One-shot, 16 MiB (multithreaded AVX-512)	~12.3 GB/s
One-shot, 1 MiB	~4.7 GB/s
One-shot, 64 B (latency)	~170 ns
`HashMany`, 100k × 48 B vs sequential loop	3.3× faster (5.2 ms vs 17.4 ms)

Single-stream throughput is parity-to-faster versus lukechampine.com/blake3, the fastest existing Go implementation, on the same machine. The HashMany speedup over a sequential Sum256 loop scales with core count and is the main win for the IR-hashing use case — no reference implementation offers a batch API.

On already-SIMD-saturated, multithreaded BLAKE3 (optimized Rust b3sum, lukechampine), single-stream throughput is bound by memory bandwidth and the compression function; there is no 2× headroom there in any language. The honest, large win for compiler workloads is the batch path.

Architecture dispatch

The compression core is selected at build/run time:

Target	Path	Status
amd64	AVX-512 → AVX2 → portable	implemented, assembly core
arm64	NEON → portable	implemented (4-wide NEON kernel), verified under emulation
wasm	host SIMD128 bridge → portable	bridge implemented (opt-in), portable default
other	portable Go	implemented

arm64 / NEON

blake3_arm64.s carries two hand-written NEON kernels:

Multi-message (4-chunk) kernel (compressChunksNEON) — the throughput path. Compresses 4 chunks in parallel with vertical 4-lane SIMD (one lane per chunk, no diagonalization shuffles): the block is cross-chunk transposed (VTRN1/VTRN2) into 16 message vectors, then 7 rounds of 8 G functions run purely vertically. compressBuffer feeds groups of 4 full chunks through it.
Single-compression kernel (compressNodeNEON) — row-based diagonalize/undiagonalize, message words gathered via 4-register VTBL. Used for parents, partial/trailing chunks, and XOF output blocks.

Together they cover the entire arm64 path. Both are verified bit-exact against the portable core (TestNEONMatchesGeneric, TestChunksNEONMatchesGeneric, and TestSIMDMatchesGeneric comparing the 4-way compressBuffer to the portable buffer for every length), and the full official-vector suite passes under arm64 emulation. No other Go BLAKE3 library ships arm64 SIMD, so this is a real win on Apple Silicon / Graviton.

Real on-hardware benchmark numbers are still pending an ARM CI runner; correctness is fully established under emulation.

wasm host bridge

Go's wasm backend emits no SIMD128, so an in-wasm Go hash is scalar. The default GOARCH=wasm build is portable and instantiates anywhere. To offload hashing to a host-provided SIMD128 implementation, build with:

GOARCH=wasm -tags blake3_wasm_host

and supply the import blake3.hash(inputPtr, inputLen, outPtr, outLen) at instantiation (see components/blake3/blake3_wasm_host.go). The bridge wires the unkeyed one-shot fast path to the host; streaming, keyed, and derive-key stay on the portable core.

CLI (`blake3sum`)

Mirrors the reference b3sum:

blake3sum [OPTIONS] [FILE]...

--keyed             keyed mode; 32-byte key read from stdin
--derive-key CTX    key-derivation mode with context string CTX
-l, --length LEN    output bytes before hex (default 32)
--seek SEEK         starting output offset before hex
--num-threads NUM   max worker threads (default: logical CPUs)
--no-mmap           do not memory-map inputs
--no-names          omit file names
--raw               raw output bytes instead of hex (single input)
--tag               BSD-style: BLAKE3 (FILE) = HASH
-c, --check         verify sums read from FILEs
--quiet             with --check, suppress per-file OK lines

Large regular files are memory-mapped (unix) and hashed in one parallel pass; --no-mmap and non-regular inputs stream in 4 MiB blocks.

Workspace layout

components/blake3      the hashing library (public API + arch dispatch + asm)
components/checkfile    b3sum checksum-line parse/format
bases/cli               the blake3sum CLI
projects/blake3sum      project wiring; root main.go shares the same entry

Building & testing

make            # goforge check + vet + host tests
make build      # build the blake3sum binary
make bench      # benchmarks
make test-arm64 # cross-build + run arm64 (NEON) tests under emulation
make test-wasm  # cross-build wasm (scalar + host-bridge)
make ci         # full local CI: host + arm64(qemu) + wasm

make test-arm64 runs the arm64 build (including the NEON assembly kernel) on an x86 host via static qemu-aarch64 user-mode emulation — no root, no binfmt. make qemu fetches a static emulator into .tools/; alternatively install qemu-user-static from your package manager. Under the hood it is just:

GOOS=linux GOARCH=arm64 go test -exec qemu-aarch64-static ./components/blake3/

Offline builds: export GOPROXY=off GOFLAGS=-mod=mod (deps are in the module cache).

Correctness

Verified against the official BLAKE3 test vectors (components/blake3/testdata/test_vectors.json) for the regular, keyed, and derive-key modes including extended output, plus chunked-streaming, batch, XOF-seek, and SIMD-vs-portable equivalence tests. The arm64 NEON kernel is checked bit-exact against the portable core and runs the full vector suite under emulation (make test-arm64).

Attribution

The amd64 AVX-512/AVX2 assembly core (blake3_amd64.s) and its avo generator (_avo/gen.go) are adapted from lukechampine.com/blake3 (MIT, Copyright 2020 Luke Champine). The library API, batch hashing, tree/streaming driver, architecture dispatch, wasm bridge, and CLI are original to this project.

go vet reports benign asmdecl warnings on the generated blake3_amd64.s (high-dword access of the 64-bit counter); these are present in the upstream generated assembly and do not affect correctness — the test suite validates the assembly against the official vectors.

Documentation ¶

Overview ¶

Command blake3sum is a BLAKE3 checksum tool that mirrors the reference b3sum CLI. It hashes files or standard input, supports keyed and key-derivation modes, extendable/seekable output, and --check verification.

This root entry point exists so `go install goforge.dev/blake3sum@latest` works. The canonical project lives under projects/blake3sum (goforge Polylith layout); both share the same bases/cli entry point.

Source Files ¶

View all Source files

main.go

Directories ¶

Path	Synopsis
bases
cli Package cli is the blake3sum command: a BLAKE3 checksum tool that mirrors the reference b3sum CLI (hash files or stdin, keyed and key-derivation modes, extendable/seekable output, and --check verification).	Package cli is the blake3sum command: a BLAKE3 checksum tool that mirrors the reference b3sum CLI (hash files or stdin, keyed and key-derivation modes, extendable/seekable output, and --check verification).
components
blake3 Package blake3 implements the BLAKE3 cryptographic hash function.	Package blake3 implements the BLAKE3 cryptographic hash function.
checkfile Package checkfile parses and formats BLAKE3 checksum lines, matching the output and --check input of the reference b3sum tool.	Package checkfile parses and formats BLAKE3 checksum lines, matching the output and --check input of the reference b3sum tool.
development Package development wires every brick into one place for IDE and REPL use.	Package development wires every brick into one place for IDE and REPL use.
projects
blake3sum command Command blake3sum is the blake3sum project entry point.	Command blake3sum is the blake3sum project entry point.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

blake3sum

Library

Performance

Architecture dispatch

arm64 / NEON

wasm host bridge

CLI (blake3sum)

Workspace layout

Building & testing

Correctness

Attribution

Documentation ¶

Overview ¶

Source Files ¶

Directories ¶

CLI (`blake3sum`)