fastcdc

package module
v1.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 4, 2023 License: BSD-2-Clause Imports: 3 Imported by: 0

README

FastCDC

This is a Go implementation of the FastCDC algorithm for content-defined chunking. CDC is a technique used in data deduplication and data storage systems to break data into variable-sized chunks based on its content rather than fixed block sizes. This approach aims to improve the efficiency of deduplication.

Usage

go get -u codeberg.org/mhofmann/fastcdc

Evaluation

The implementation can be used with the same parameters as in the FastCDC paper as well as user-provided values for minimum, average and maximum sizes of chunks. For comparison, the following table shows statistics about the number and size of chunks generated by chunking sets of test files with different parameters. The numbers in the chunker names refer to the parameters used. For example "2k-8k-64k" is a chunker with 2KB minSize 8KB avgSize and 64k maxSize. The test corpus had a total uncompressed size of 8182081670 bytes (~7.6GB) and consisted of technical manuals and drawings in PDF format and tarballs containing the source code of 5 different versions of the Linux kernel.

Results
Chunker Num. of Chunks Avg. chunk size Deduplicated size Deduplication ratio
reference 480942 9831 4727992736 1.73
2k-16k-64k 271140 19136 5188457451 1.58
2k-32k-64k 150041 37254 5589600419 1.46
2k-64k-128k 80946 73123 5919028334 1.38
4k-8k-64k 471195 10107 4762223233 1.72
4k-16k-64k 266596 19503 5199438463 1.57
4k-32k-64k 148487 37669 5593355619 1.46
4k-64k-128k 80332 73701 5920577574 1.38

In terms of pure deduplication performance, the reference parameters (2k-8k-64k) yielded the best result on the test dataset. For storage systems where chunks are stored compressed, El-Shimi et al. suggest that using larger chunk sizes for CDC can improve the performance of the compression algorithm and thereby reduce the effective storage size. If and how far this applies to the FastCDC algorithm remains to be tested in the future.

License

BSD-2-Clause. See LICENSE for details.

Documentation

Overview

Package fastcdc implements the FastCDC algorithm for content-defined chunking.

Example
package main

import (
	"fmt"
	"os"

	"codeberg.org/mhofmann/fastcdc"
)

func main() {
	f, err := os.Open("testdata/testfile.bin")
	if err != nil {
		panic(err)
	}
	defer f.Close()

	var nChunks, nBytes int

	chunker := fastcdc.NewRefChunker(f)

	for chunker.Next() {
		chunk := chunker.Chunk()
		nBytes += len(chunk)
		nChunks++
	}

	if err = chunker.Err(); err != nil {
		panic(err)
	}

	fmt.Printf("read %d chunks with %d bytes total\n", nChunks, nBytes)

}
Output:

read 14 chunks with 130873 bytes total

Index

Examples

Constants

This section is empty.

Variables

View Source
var ErrParam = errors.New("invalid parameter")

ErrParam is returned by NewChunker when one or more of the size parameters are invalid.

Functions

This section is empty.

Types

type Chunker

type Chunker struct {
	// contains filtered or unexported fields
}

Chunker wraps an io.Reader and returns data in chunks of bytes. Successive calls to Next will split the input data into variable sized chunks according to the FastCDC algorithm. The Chunk method returns the current chunk. The process stops at EOF or the first I/O error. Use the Err method to see if a non-EOF error occurred.

func NewChunker

func NewChunker(r io.Reader, minSize, avgSize, maxSize int) (*Chunker, error)

NewChunker returns a new chunker that wraps r and uses the given parameters for FastCDC. minSize is the minimum size, in bytes, of chunks, that the Chunker should return. Likewise, avgSize and maxSize set the average (expected) and maximum size of chunks in bytes. avgSize must be a power of two and the parameters must be ordered so that 1 <= minSize <= avgSize <= maxSize.

func NewRefChunker

func NewRefChunker(r io.Reader) *Chunker

NewRefChunker returns a new chunker that wraps r and uses the reference parameters for FastCDC. The minimal, maximal and average sizes of chunks returned by this chunker are 2KB, 64KB and 8KB respectively.

func (*Chunker) Chunk

func (c *Chunker) Chunk() []byte

Chunk returns the most recent chunk of data generated by a call to Next. The underlying array points to data that might be overwritten by subsequent calls to Next.

func (*Chunker) Err

func (c *Chunker) Err() error

Err returns the first non-EOF error that was encountered by the Chunker.

func (*Chunker) Next

func (c *Chunker) Next() bool

Next retrieves the next chunk of data, which will then be available through the Chunk method. It returns false when the chunking process stops, either by reaching the end of the input stream or an error. After Next returns false, the Err method will return any error that occurred during scanning, except that if it was io.EOF, Err will return nil.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL