buzhash

package

v0.31.4 Latest Latest Go to latest Published: Jun 19, 2026 License: MIT Imports: 2 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/pbs-plus/pxar

Links

Open Source Insights

Documentation ¶

Overview ¶

Package buzhash implements the buzhash rolling hash algorithm for content-defined chunking.

Content-defined chunking splits a byte stream into variable-size chunks based on the data content rather than fixed boundaries. This enables deduplication: identical data produces identical chunk boundaries, so unchanged regions between backups yield the same chunks.

Configuration ¶

Use NewConfig to create a Config from a desired average chunk size. The config determines minimum, maximum, and average chunk sizes as well as the hash mask and threshold used for boundary detection:

cfg, err := buzhash.NewConfig(4096) // ~4 KiB average chunks
if err != nil {
    log.Fatal(err)
}

Chunking a Stream ¶

Create a Chunker from an io.Reader and call Next repeatedly:

chunker := buzhash.NewChunker(reader, cfg)
for {
    chunk, err := chunker.Next()
    if err == io.EOF {
        break
    }
    if err != nil {
        log.Fatal(err)
    }
    // process chunk
}

Low-level Hasher ¶

For custom chunking logic, use Hasher directly. It provides a 32-bit rolling hash over a 64-byte sliding window:

h := buzhash.NewHasher()
h.Update(byte)
sum := h.Sum()

Index ¶

Constants
type Chunker
- func NewChunker(r io.Reader, config Config) *Chunker
- func (c *Chunker) Next() ([]byte, error)
- func (c *Chunker) Reset(r io.Reader)
type Config
- func DefaultConfig() Config
- func NewConfig(avgChunkSize int) (Config, error)
type Hasher
- func NewHasher() *Hasher
type Scanner
- func NewScanner(config Config) *Scanner
- func (s *Scanner) Reset()
- func (s *Scanner) Scan(data []byte) int

Constants ¶

View Source

const WindowSize = 64

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Chunker ¶

type Chunker struct {
	// contains filtered or unexported fields
}

Chunker splits a data stream into variable-size chunks using buzhash content-defined chunking. It performs zero heap allocations during scanning.

The returned slice from Next references internal buffers and is valid only until the next call to Next. Callers must copy the data if they need to retain it.

func NewChunker ¶

func NewChunker(r io.Reader, config Config) *Chunker

NewChunker creates a chunker that reads from r with the given config.

func (*Chunker) Next ¶

func (c *Chunker) Next() ([]byte, error)

Next returns the next chunk of data. The returned slice references internal buffers and is valid only until the next call to Next. Returns io.EOF when there is no more data.

func (*Chunker) Reset ¶

func (c *Chunker) Reset(r io.Reader)

Reset resets the chunker to process a new stream.

type Config ¶

type Config struct {
	AvgChunkSize int
	MinChunkSize int
	MaxChunkSize int
	Mask         uint32
	Threshold    uint32
}

Config holds chunking parameters derived from an average chunk size.

func DefaultConfig ¶

func DefaultConfig() Config

DefaultConfig returns the standard 4MB average chunk size configuration.

func NewConfig ¶

func NewConfig(avgChunkSize int) (Config, error)

NewConfig creates a Config from an average chunk size, which must be a power of two.

type Hasher ¶

type Hasher struct {
	// contains filtered or unexported fields
}

Hasher implements the buzhash rolling hash.

func NewHasher ¶

func NewHasher() *Hasher

NewHasher creates a new Hasher.

func (*Hasher) BytesProcessed ¶

func (h *Hasher) BytesProcessed() int

BytesProcessed returns the total number of bytes fed to the hasher.

func (*Hasher) InitFromData ¶

func (h *Hasher) InitFromData(data []byte)

InitFromData initializes the hash from the first WindowSize bytes of data. If data has fewer than WindowSize bytes, all bytes are consumed.

func (*Hasher) Reset ¶

func (h *Hasher) Reset()

Reset returns the hasher to its initial state.

func (*Hasher) Sum ¶

func (h *Hasher) Sum() uint32

Sum returns the current hash value.

func (*Hasher) Update ¶

func (h *Hasher) Update(in byte)

Update slides the window by one byte.

type Scanner ¶ added in v0.28.5

type Scanner struct {
	Config Config
	// contains filtered or unexported fields
}

Scanner scans a byte buffer for content-defined chunk boundaries using the same buzhash rolling hash as Chunker, but without owning a reader. This mirrors the Rust pbs-datastore `Chunker` trait / `ChunkerImpl` used by the payload ChunkStream: the caller controls buffering and feeds data slices via Scan, which returns the offset of the next boundary (or 0 when none is found in the provided data, so more data should be supplied later). State persists across calls until Reset.

func NewScanner ¶ added in v0.28.5

func NewScanner(config Config) *Scanner

NewScanner creates a Scanner for the given config.

func (*Scanner) Reset ¶ added in v0.28.5

func (s *Scanner) Reset()

Reset returns the scanner to its initial state, mirroring ChunkerImpl::reset (h=0, chunk_size=0, window_size=0). Called after a chunk boundary (natural or forced) so the next chunk starts hashing from scratch.

func (*Scanner) Scan ¶ added in v0.28.5

func (s *Scanner) Scan(data []byte) int

Scan scans data for a chunk boundary, mirroring ChunkerImpl::scan exactly. Returns 0 if no boundary was found within data (call again with more data), or a value > 0 indicating the position of the boundary relative to the start of data. State persists across calls so consecutive slices are treated as a contiguous stream until Reset is called.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL