fastcdc

package module
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 5, 2020 License: Apache-2.0 Imports: 3 Imported by: 5

README

FastCDC-Go

Docs Build Status codecov Go Report Card

FastCDC-Go is a Go library implementing the FastCDC content-defined chunking algorithm.

Install:

go get -u github.com/jotfs/fastcdc-go

Example

import (
  "bytes"
  "fmt"
  "log"
  "math/rand"
  "io"

  "github.com/jotfs/fastcdc-go"
)

opts := fastcdc.Options{
  MinSize:     256 * 1024
  AverageSize: 1 * 1024 * 1024
  MaxSize:     4 * 1024 * 1024
}

data := make([]byte, 10 * 1024 * 1024)
rand.Read(data)
chunker, _ := fastcdc.NewChunker(bytes.NewReader(data), opts)

for {
  chunk, err := chunker.Next()
  if err == io.EOF {
    break
  }
  if err != nil {
    log.Fatal(err)
  }

  fmt.Printf("%x  %d\n", chunk.Data[:10], chunk.Length)
}

Command line tool

This package also includes a useful CLI for testing the chunking output. Install it by running:

go install ./cmd/fastcdc

Example:

# Outputs the position and size of each chunk to stdout 
fastcdc -csv -file random.txt

Performance

FastCDC-Go is fast. Chunking speed on an Intel i5 7200U is >1GiB/s. Compared to restic/chunker, another CDC library for Go, it's about 2.9 times faster.

Benchmark (code):

BenchmarkRestic-4     23384482467 ns/op	 448.41 MB/s	 8943320 B/op   15 allocs/op
BenchmarkFastCDC-4    8080957045 ns/op	1297.59 MB/s	16777336 B/op    3 allocs/op

Normalization

A key feature of FastCDC is chunk size normalization. Normalization helps to improve the distribution of chunk sizes, increasing the number of chunks close to the target average size and reducing the number of chunks clipped by the maximum chunk size, as compared to the Rabin-based chunking algorithm used in restic/chunker.

The histograms below show the chunk size distribution for fastcdc-go and restic/chunker on 1GiB of random data, each with average chunk size 1MiB, minimum chunk size 256 KiB and maximum chunk size 4MiB. The normalization level for fastcdc-go is set to 2.

Compared the restic/chunker, the distribution of fastcdc-go is less skewed (standard deviation 345KiB vs. 964KiB).

License

FastCDC-Go is licensed unser the Apache 2.0 License. See LICENSE for details.

References

  • Xia, Wen, et al. "Fastcdc: a fast and efficient content-defined chunking approach for data deduplication." 2016 USENIX Annual Technical Conference pdf

Documentation

Overview

Package fastcdc is a Go implementation of the FastCDC content defined chunking algorithm. See https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf for details.

Example (Basic)
data := make([]byte, 10*1024*1024)
rand.Seed(4542)
rand.Read(data)
rd := bytes.NewReader(data)

chunker, err := fastcdc.NewChunker(rd, fastcdc.Options{
	AverageSize: 1024 * 1024, // target 1 MiB average chunk size
})
if err != nil {
	log.Fatal(err)
}

fmt.Printf("%-32s  %s\n", "CHECKSUM", "CHUNK SIZE")

for {
	chunk, err := chunker.Next()
	if err == io.EOF {
		break
	}
	if err != nil {
		log.Fatal(err)
	}

	fmt.Printf("%x  %d\n", md5.Sum(chunk.Data), chunk.Length)
}
Output:

CHECKSUM                          CHUNK SIZE
d5bb40f862d68f4c3a2682e6d433f0d7  1788060
113a0aa2023d7dce6a2aac1f807b5bd2  1117240
5b9147b10d4fe6f96282da481ce848ca  1180487
dcc4644befb599fa644635b0c5a1ea2c  1655501
224db3de422ad0dd2c840e3e24e0cb03  363172
e071658eccda587789f1dabb6f773851  1227750
215868103f0b4ea7f715e179e5b9a6c7  1451026
21e65e40970ec22f5b13ddf60493b746  1150129
b8209a1dbef955ef64636af796450252  552395

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Chunk

type Chunk struct {
	// Offset is the number of bytes from the start of the reader to the beginning of
	// the chunk.
	Offset int

	// Length is the length of the chunk in bytes. Same as len(Data).
	Length int

	// Data is the chunk data.
	Data []byte

	// Fingerprint is the value of the rolling hash algorithm for the chunk data.
	Fingerprint uint64
}

Chunk stores a content-defined chunk returned by a Chunker.

type Chunker

type Chunker struct {
	// contains filtered or unexported fields
}

Chunker implements the FastCDC content defined chunking algorithm. See https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf.

func NewChunker

func NewChunker(rd io.Reader, opts Options) (*Chunker, error)

NewChunker returns a Chunker with the given Options.

func (*Chunker) Next

func (c *Chunker) Next() (Chunk, error)

Next returns the next Chunk from the reader or io.EOF after the last chunk has been read. The chunk data is invalidated when Next is called again.

type Options

type Options struct {
	// NormalSize is the target chunk size. Typically a power of 2. It must be in the
	// range 64B to 1GiB.
	AverageSize int

	// (Optional) MinSize is the minimum allowed chunk size. By default, it's set to
	// AverageSize / 4.
	MinSize int

	// (Optional) MaxSize is the maximum allowed chunk size. By default, it's set to
	// AverageSize * 4.
	MaxSize int

	// (Optional) Sets the chunk normalization level. It may be set to 1, 2 or 3,
	// unless DisableNormalization is set, in which case it's ignored. By default,
	// it's set to 2.
	Normalization int

	// (Optional) DisableNormalization turns normalization off. By default, it's set to
	// false.
	DisableNormalization bool

	// (Optional) Seed alters the lookup table of the rolling hash algorithm to mitigate
	// chunk-size based fingerprinting attacks. It may be set to a random uint64.
	Seed uint64

	// (Optional) BufSize is the size of the internal buffer used while chunking. It has
	// no effect on the chuking output, but performance is improved with larger buffers.
	// It must be at least MaxSize. Recommended values are 1 to 3 times MaxSize. By
	// default it is set to MaxSize * 2.
	BufSize int
}

Options configures the options for the Chunker.

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL