Documentation
¶
Overview ¶
Package buzhash implements the buzhash rolling hash algorithm for content-defined chunking.
Content-defined chunking splits a byte stream into variable-size chunks based on the data content rather than fixed boundaries. This enables deduplication: identical data produces identical chunk boundaries, so unchanged regions between backups yield the same chunks.
Configuration ¶
Use NewConfig to create a Config from a desired average chunk size. The config determines minimum, maximum, and average chunk sizes as well as the hash mask and threshold used for boundary detection:
cfg, err := buzhash.NewConfig(4096) // ~4 KiB average chunks
if err != nil {
log.Fatal(err)
}
Chunking a Stream ¶
Create a Chunker from an io.Reader and call Next repeatedly:
chunker := buzhash.NewChunker(reader, cfg)
for {
chunk, err := chunker.Next()
if err == io.EOF {
break
}
if err != nil {
log.Fatal(err)
}
// process chunk
}
Low-level Hasher ¶
For custom chunking logic, use Hasher directly. It provides a 32-bit rolling hash over a 64-byte sliding window:
h := buzhash.NewHasher() h.Update(byte) sum := h.Sum()
Index ¶
Constants ¶
const WindowSize = 64
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Chunker ¶
type Chunker struct {
// contains filtered or unexported fields
}
Chunker splits a data stream into variable-size chunks using buzhash content-defined chunking. It performs zero heap allocations during scanning.
The returned slice from Next references internal buffers and is valid only until the next call to Next. Callers must copy the data if they need to retain it.
func NewChunker ¶
NewChunker creates a chunker that reads from r with the given config.
type Config ¶
type Config struct {
AvgChunkSize int
MinChunkSize int
MaxChunkSize int
Mask uint32
Threshold uint32
}
Config holds chunking parameters derived from an average chunk size.
func DefaultConfig ¶
func DefaultConfig() Config
DefaultConfig returns the standard 4MB average chunk size configuration.
type Hasher ¶
type Hasher struct {
// contains filtered or unexported fields
}
Hasher implements the buzhash rolling hash.
func (*Hasher) BytesProcessed ¶
BytesProcessed returns the total number of bytes fed to the hasher.
func (*Hasher) InitFromData ¶
InitFromData initializes the hash from the first WindowSize bytes of data. If data has fewer than WindowSize bytes, all bytes are consumed.
type Scanner ¶ added in v0.28.5
type Scanner struct {
Config Config
// contains filtered or unexported fields
}
Scanner scans a byte buffer for content-defined chunk boundaries using the same buzhash rolling hash as Chunker, but without owning a reader. This mirrors the Rust pbs-datastore `Chunker` trait / `ChunkerImpl` used by the payload ChunkStream: the caller controls buffering and feeds data slices via Scan, which returns the offset of the next boundary (or 0 when none is found in the provided data, so more data should be supplied later). State persists across calls until Reset.
func NewScanner ¶ added in v0.28.5
NewScanner creates a Scanner for the given config.
func (*Scanner) Reset ¶ added in v0.28.5
func (s *Scanner) Reset()
Reset returns the scanner to its initial state, mirroring ChunkerImpl::reset (h=0, chunk_size=0, window_size=0). Called after a chunk boundary (natural or forced) so the next chunk starts hashing from scratch.
func (*Scanner) Scan ¶ added in v0.28.5
Scan scans data for a chunk boundary, mirroring ChunkerImpl::scan exactly. Returns 0 if no boundary was found within data (call again with more data), or a value > 0 indicating the position of the boundary relative to the start of data. State persists across calls so consecutive slices are treated as a contiguous stream until Reset is called.