dejavu

package module
v1.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 14, 2017 License: MIT Imports: 7 Imported by: 2

README

Build Status Issues Go Report Card License GoDoc Donate PayPal Donate Bitcoin Avaivable For Hire

Déjà vu

Quickly detect already witnessed data, ideal for deduplication.

Limited memory of witnessed data, oldest are forgotten. Library is thread safe. Offers deterministic and probabilistic (over an order of magnitude less memory consuming) implementation. The probabilistic implementation uses bloom filters, meaning false positives are possible but not false negatives.

Command line usage

$ dejavu -h
Usage: dejavu [OPTION]... [FILE]...

Concatenate FILE(s) and filter or output duplicate lines.

With no FILE, or when FILE is -, read standard input.

Options:
  -D	use deterministic mode instead of probabilistic
	WARNING requires order of magnitude more memory
  -d	output only duplicates instead of filtering
  -f float
    	chance of false positive, between 0.0 and 1.0
	only for probabilistic mode (default 1e-06)
  -l uint
    	limit after which entries are forgotton (default 1000000)
  -o string
    	output file, defaults to stdout
  -v	output version information and exit

Examples:
  dejavu
	default probabilistic deduplication from stdin to std out with
	1mil entry limit and 1/1mil chance of false positive (~8M mem usage)
  dejavu -o s f - g
	deduplicat f, then stdin, then g, to output s
  dejavu -l 10000000 -fp 0.000000001
	probabilistic deduplication with 10mil entry limit
	and 1/1bil chance of false positive (~70M mem usage)
  dejavu -d -D -l 65536
	output duplicates and avoid false positives with deterministic mode
	lower entry limit to avoid excessive memory usage

Implementation:
  Efficient probabilistic and deterministic duplicate detection with O(1) 
  detection time and O(n) memory usage in relation to entry limit. Default
  probabilistic implementation uses bloom filters, meaning false
  positives are possible but not false negatives.

Author: Fabian Barkhau <f483@protonmail.com>
Project: https://github.com/f483/dejavu
License: MIT https://raw.githubusercontent.com/f483/dejavu/master/LICENSE

Library usage (golang)

Probabilistic example

package main

import (
	"fmt"
	"github.com/f483/dejavu"
)

func main() {

	// probably remembers last 65536 with 0.000001 chance of false positive
	p := dejavu.NewProbabilistic(65536, 0.000001)

	fmt.Println(p.Witness([]byte("bar"))) // entry added
	fmt.Println(p.Witness([]byte("bar"))) // probably remembers entry
}

Deterministic example

package main

import (
	"fmt"
	"github.com/f483/dejavu"
)

func main() {

	// always remembers last 1024 entries
	d := dejavu.NewDeterministic(1024)

	fmt.Println(d.Witness([]byte("foo"))) // entry added
	fmt.Println(d.Witness([]byte("foo"))) // remembers entry
}

Performance

Linear memory usage: O(n)

Probabilistic

0.000001 chance of false positive.

Benchmark Memory

Deterministic

Benchmark Memory

Constant witness time: O(1)

Probabilistic

0.000001 chance of false positive.

Benchmark Time

Deterministic

Benchmark Time

Documentation

Overview

Package dejavu offers quick detection of already witnessed data.

Limited memory of witnessed data, oldest are forgotten. Library is thread safe. Offers deterministic and probabilistic (over an order of magnatude less memory consuming) implementation.

Index

Constants

View Source
const Version string = "1.0.0"

Version information

Variables

This section is empty.

Functions

func Process

func Process(d DejaVu, filter bool, out io.Writer, inputs ...io.Reader)

Process given inputs as text to output with dejavu instance. If filter is true duplicates are filtered, otherwise only duplicates sent to output.

func ProcessPaths

func ProcessPaths(d DejaVu, filter bool, out string, inputs ...string) error

ProcessPaths is equivalent to Process, only that file paths are given. If - in inputs to use stdin and empty out to use stdout.

Types

type DejaVu

type DejaVu interface {

	// Witness data and add to memory. Returns true if previously seen.
	Witness(data []byte) bool

	// WitnessDigest is equivalent to the Winness method but bypasses
	// hashing the data. Use this to improve performance if you already
	// happen to have the sha256 digest.
	WitnessDigest(digest [sha256.Size]byte) bool
}

DejaVu witnesses data and recalls if seen before.

func New

func New(probabilistic bool, limit uint32, fpRatio float64) DejaVu

New creates a probabilistic or deterministic DejaVu memory with given entrie limit and false positive ratio (only used for probabilistic).

func NewDeterministic

func NewDeterministic(limit uint32) DejaVu

NewDeterministic creates a deterministic DejaVu memory. Will remember most recent entries within given entrie limit and forget older entries.

func NewProbabilistic

func NewProbabilistic(limit uint32, falsePositiveRatio float64) DejaVu

NewProbabilistic creates a probabilistic DejaVu memory. Probably remembers most recent entries within given entrie limit and false positive ratio. False positive ratio should be between 0.0 and 1.0.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL