chunker

package module
Version: v0.4.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 20, 2020 License: BSD-2-Clause Imports: 8 Imported by: 41

README

GoDoc Build Status

The package chunker implements content-defined-chunking (CDC) based on a rolling Rabin Hash. The library is part of the restic backup program.

An introduction to Content Defined Chunking can be found in the restic blog post Foundation - Introducing Content Defined Chunking (CDC).

You can find the API documentation at https://godoc.org/github.com/restic/chunker

Documentation

Overview

Package chunker implements Content Defined Chunking (CDC) based on a rolling Rabin Checksum.

Choosing a Random Irreducible Polynomial

The function RandomPolynomial() returns a new random polynomial of degree 53 for use with the chunker. The degree 53 is chosen because it is the largest prime below 64-8 = 56, so that the top 8 bits of an uint64 can be used for optimising calculations in the chunker.

A random polynomial is chosen selecting 64 random bits, masking away bits 64..54 and setting bit 53 to one (otherwise the polynomial is not of the desired degree) and bit 0 to one (otherwise the polynomial is trivially reducible), so that 51 bits are chosen at random.

This process is repeated until Irreducible() returns true, then this polynomials is returned. If this doesn't happen after 1 million tries, the function returns an error. The probability for selecting an irreducible polynomial at random is about 7.5% ( (2^53-2)/53 / 2^51), so the probability that no irreducible polynomial has been found after 100 tries is lower than 0.04%.

Verifying Irreducible Polynomials

During development the results have been verified using the computational discrete algebra system GAP, which can be obtained from the website at http://www.gap-system.org/.

For filtering a given list of polynomials in hexadecimal coefficient notation, the following script can be used:

# create x over F_2 = GF(2)
x := Indeterminate(GF(2), "x");

# test if polynomial is irreducible, i.e. the number of factors is one
IrredPoly := function (poly)
	return (Length(Factors(poly)) = 1);
end;;

# create a polynomial in x from the hexadecimal representation of the
# coefficients
Hex2Poly := function (s)
	return ValuePol(CoefficientsQadic(IntHexString(s), 2), x);
end;;

# list of candidates, in hex
candidates := [ "3DA3358B4DC173" ];

# create real polynomials
L := List(candidates, Hex2Poly);

# filter and display the list of irreducible polynomials contained in L
Display(Filtered(L, x -> (IrredPoly(x))));

All irreducible polynomials from the list are written to the output.

Background Literature

An introduction to Rabin Fingerprints/Checksums can be found in the following articles:

Michael O. Rabin (1981): "Fingerprinting by Random Polynomials" http://www.xmailserver.org/rabin.pdf

Ross N. Williams (1993): "A Painless Guide to CRC Error Detection Algorithms" http://www.zlib.net/crc_v3.txt

Andrei Z. Broder (1993): "Some Applications of Rabin's Fingerprinting Method" http://www.xmailserver.org/rabin_apps.pdf

Shuhong Gao and Daniel Panario (1997): "Tests and Constructions of Irreducible Polynomials over Finite Fields" http://www.math.clemson.edu/~sgao/papers/GP97a.pdf

Andrew Kadatch, Bob Jenkins (2007): "Everything we know about CRC but afraid to forget" http://crcutil.googlecode.com/files/crc-doc.1.0.pdf

Index

Examples

Constants

View Source
const (

	// MinSize is the default minimal size of a chunk.
	MinSize = 512 * kiB
	// MaxSize is the default maximal size of a chunk.
	MaxSize = 8 * miB
)

Variables

This section is empty.

Functions

This section is empty.

Types

type Chunk

type Chunk struct {
	Start  uint
	Length uint
	Cut    uint64
	Data   []byte
}

Chunk is one content-dependent chunk of bytes whose end was cut when the Rabin Fingerprint had the value stored in Cut.

type Chunker

type Chunker struct {
	// contains filtered or unexported fields
}

Chunker splits content with Rabin Fingerprints.

Example
// generate 32MiB of deterministic pseudo-random data
data := getRandom(23, 32*1024*1024)

// create a chunker
chunker := New(bytes.NewReader(data), Pol(0x3DA3358B4DC173))

// reuse this buffer
buf := make([]byte, 8*1024*1024)

for i := 0; i < 5; i++ {
	chunk, err := chunker.Next(buf)
	if err == io.EOF {
		break
	}

	if err != nil {
		panic(err)
	}

	fmt.Printf("%d %02x\n", chunk.Length, sha256.Sum256(chunk.Data))
}
Output:

2163460 4b94cb2cf293855ea43bf766731c74969b91aa6bf3c078719aabdd19860d590d
643703 5727a63c0964f365ab8ed2ccf604912f2ea7be29759a2b53ede4d6841e397407
1528956 a73759636a1e7a2758767791c69e81b69fb49236c6929e5d1b654e06e37674ba
1955808 c955fb059409b25f07e5ae09defbbc2aadf117c97a3724e06ad4abd2787e6824
2222372 6ba5e9f7e1b310722be3627716cf469be941f7f3e39a4c3bcefea492ec31ee56

func New

func New(rd io.Reader, pol Pol) *Chunker

New returns a new Chunker based on polynomial p that reads from rd.

func NewWithBoundaries added in v0.2.0

func NewWithBoundaries(rd io.Reader, pol Pol, min, max uint) *Chunker

NewWithBoundaries returns a new Chunker based on polynomial p that reads from rd and custom min and max size boundaries.

func (*Chunker) Next

func (c *Chunker) Next(data []byte) (Chunk, error)

Next returns the position and length of the next chunk of data. If an error occurs while reading, the error is returned. Afterwards, the state of the current chunk is undefined. When the last chunk has been returned, all subsequent calls yield an io.EOF error.

func (*Chunker) Reset

func (c *Chunker) Reset(rd io.Reader, pol Pol)

Reset reinitializes the chunker with a new reader and polynomial.

func (*Chunker) ResetWithBoundaries added in v0.2.0

func (c *Chunker) ResetWithBoundaries(rd io.Reader, pol Pol, min, max uint)

ResetWithBoundaries reinitializes the chunker with a new reader, polynomial and custom min and max size boundaries.

func (*Chunker) SetAverageBits added in v0.2.0

func (c *Chunker) SetAverageBits(averageBits int)

SetAverageBits allows to control the frequency of chunk discovery: the lower averageBits, the higher amount of chunks will be identified. The default value is 20 bits, so chunks will be of 1MiB size on average.

type Pol

type Pol uint64

Pol is a polynomial from F_2[X].

func DerivePolynomial

func DerivePolynomial(source io.Reader) (Pol, error)

DerivePolynomial returns an irreducible polynomial of degree 53 (largest prime number below 64-8) by reading bytes from source. There are (2^53-2/53) irreducible polynomials of degree 53 in F_2[X], c.f. Michael O. Rabin (1981): "Fingerprinting by Random Polynomials", page 4. If no polynomial could be found in one million tries, an error is returned.

func RandomPolynomial

func RandomPolynomial() (Pol, error)

RandomPolynomial returns a new random irreducible polynomial of degree 53 using the default System CSPRNG as source. It is equivalent to calling DerivePolynomial(rand.Reader).

func (Pol) Add

func (x Pol) Add(y Pol) Pol

Add returns x+y.

func (Pol) Deg

func (x Pol) Deg() int

Deg returns the degree of the polynomial x. If x is zero, -1 is returned.

func (Pol) Div

func (x Pol) Div(d Pol) Pol

Div returns the integer division result x / d.

func (Pol) DivMod

func (x Pol) DivMod(d Pol) (Pol, Pol)

DivMod returns x / d = q, and remainder r, see https://en.wikipedia.org/wiki/Division_algorithm

func (Pol) Expand

func (x Pol) Expand() string

Expand returns the string representation of the polynomial x.

func (Pol) GCD

func (x Pol) GCD(f Pol) Pol

GCD computes the Greatest Common Divisor x and f.

func (Pol) Irreducible

func (x Pol) Irreducible() bool

Irreducible returns true iff x is irreducible over F_2. This function uses Ben Or's reducibility test.

For details see "Tests and Constructions of Irreducible Polynomials over Finite Fields".

func (Pol) MarshalJSON

func (x Pol) MarshalJSON() ([]byte, error)

MarshalJSON returns the JSON representation of the Pol.

func (Pol) Mod

func (x Pol) Mod(d Pol) Pol

Mod returns the remainder of x / d

func (Pol) Mul

func (x Pol) Mul(y Pol) Pol

Mul returns x*y. When an overflow occurs, Mul panics.

func (Pol) MulMod

func (x Pol) MulMod(f, g Pol) Pol

MulMod computes x*f mod g

func (Pol) String

func (x Pol) String() string

String returns the coefficients in hex.

func (*Pol) UnmarshalJSON

func (x *Pol) UnmarshalJSON(data []byte) error

UnmarshalJSON parses a Pol from the JSON data.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL