bbhash

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 29, 2019 License: GPL-2.0 Imports: 21 Imported by: 1

README

GoDoc Go Report Card

go-bbhash - Fast, Scalable Minimal Perfect Hash for Large Sets

What is it?

A library to create, query and serialize/de-serialize minimal perfect hash functions over very large key sets.

This is an implementation of this paper. It is in part inspired by Damien Gryski's Boomphf - this implementation differs from Boomphf in one significant way - this library adds an efficient serialization & deserialization API.

The library exposes the following types:

  • BBHash: Represents an instance of a minimal perfect hash function as described in the paper above.
  • DBWriter: Used to construct a constant database of key-value pairs - where the lookup of a given key is done in constant time using BBHash. Essentially, this type serializes a collection of key-value pairs using BBHash as the underlying index.
  • DBReader: Used for looking up key-values from a previously constructed (serialized) database.

NOTE Minimal Perfect Hash functions take a fixed input and generate a mapping to lookup the items in constant time. In particular, they are NOT a replacement for a traditional hash-table; i.e., it may yield false-positives when queried using keys not present during construction. In concrete terms:

Let S = {k0, k1, ... kn} be your input key set.

If H: S -> {0, .. n} is a minimal perfect hash function, then H(kx) for kx NOT in S may yield an integer result (indicating that kx was successfully "looked up").

Thus, if users of BBHash are unsure of the input being passed to such a Lookup() function, they should add an additional comparison against the actual key to verify. Look at dbreader.go:Find() for an example.

How do I use it?

Like any other golang library: go get github.com/opencoff/go-bbhash.

Example Program

There is a working example of the DBWriter and DBReader interfaces in the file example/mphdb.go. This example demonstrates the following functionality:

  • add one or more space delimited key/value files (first field is key, second field is value)
  • add one or more CSV files (first field is key, second field is value)
  • Write the resulting MPH DB to disk
  • Read the DB and verify its integrity

First, lets run some tests and make sure bbhash is working fine:


  $ git clone https://github.com/opencoff/go-bbhash
  $ cd go-bbhash
  $ make test

Now, lets build and run the example program:


  $ make
  $ ./mphdb -h

There is a helper python script to generate a very large text file of hostnames and IP addresses: genhosts.py. You can run it like so:


  $ python ./example/genhosts.py 192.168.0.0/16 > a.txt

The above example generates 65535 hostnames and corresponding IP addresses; each of the IP addresses is sequentially drawn from the subnet.

NOTE If you use a "/8" subnet mask you will generate a lot of data (~430MB in size).

Once you have the input generated, you can feed it to the example program above to generate a MPH DB:


  $ ./mphdb foo.db a.txt
  $ ./mphdb -V foo.db

It is possible that "mphdb" fails to construct a DB and complains of gamma being too small. In that case, try increasing "g" like so:

  $ ./mphdb -g 2.75 foo.db a.txt

Basic Usage of BBHash

Assuming you have read your keys, hashed them into uint64, this is how you can use the library:


        bb, err := bbhash.New(2.0, keys)
        if err != nil { panic(err) }

        // Now, call Find() with each key to gets its unique mapping.
        // Note: Find() returns values in the range closed-interval [1, len(keys)]
        for i, k := range keys {
                j := bb.Find(k)
                fmt.Printf("%d: %#x maps to %d\n", i, k, j)
        }

Writing a DB Once, but lookup many times

One can construct an on-disk constant-time lookup using BBHash as the underlying indexing mechanism. Such a DB is useful in situations where the key/value pairs are NOT changed frequently; i.e., read-dominant workloads. The typical pattern in such situations is to build the constant-DB once for efficient retrieval and do lookups multiple times.

Step-1: Construct the DB from multiple sources

For example, let us suppose that file a.txt and b.csv have lots of key,value pairs. We will build a constant DB using this.


    wr, err := bbhash.NewDBWriter("file.db")
    if err != nil { panic(err) }

    // add a.txt and a.csv to this db

    // txt file delimited by white space;
    // first token is the key, second token is the value
    n, err := wr.AddTextFile("a.txt", " \t")
    if err != nil { panic(err) }
    fmt.Printf("a.txt: %d records added\n", n)

    // CSV file - comma delimited
    // lines starting with '#' are considered comments
    // field 0 is the key; and field 1 is the value.
    // The first line is assumed to be a header and ignored.
    n, err := wr.AddCSVFile("b.csv", ',', '#', 0, 1)
    if err != nil { panic(err) }
    fmt.Printf("b.csv: %d records added\n", n)

    // Now, freeze the DB and write to disk.
    // We will use a larger "gamma" value to increase chances of
    // finding a minimal perfect hash function.
    err = wr.Freeze(3.0)
    if err != nil { panic(err) }

Now, file.db has the key/value pairs from the two input files stored in an efficient format for constant-time retrieval.

Step-2: Looking up Key in the DB

Continuing the above example, suppose that you want to use the constructed DB for repeated lookups of various keys and retrieve their corresponding values:


    // read 'file.db' and cache upto 10,000
    // records in memory.
    rd, err := bbhash.NewDBReader("file.db", 10000)
    if err != nil { panic(err) }

Now, given a key k, we can use rd to lookup the corresponding value:


    val, err := rd.Find(k)

    if err != nil {
        if err == bbhash.ErrNoKey {
            fmt.Printf("Key %x is not in the DB\n", k)
        } else {
            fmt.Printf("Error: %s\n", err)
        }
    }

    fmt.Printf("Key %x => Value %x\n", k, val)

Implementation Notes

  • For constructing the BBHash, keys are uint64; the DBWriter implementation uses Zi Long Tan's superfast hash function to transform arbitary bytes to uint64.

  • The perfect-hash index for each key is "1" based (i.e., it is in the closed interval [1, len(keys)].

License

GPL v2.0

Documentation

Overview

Package bbhash implements BBHash - a new algorithm for creating fast, minimal perfect hash functions as described in: https://arxiv.org/abs/1702.03154. This implementation builds the perfect hash table concurrently if the number of keys are larger than 'MinParallelKeys'. Additionally, BBHash instances can be marshaled and unmarshaled from byte buffers. This package also implements a constant database (read only) built on top of BBHash. The DB supports constant time lookups of arbitrary keys from the disk.

Index

Constants

View Source
const Gamma float64 = 2.0

Gamma is an expansion factor for each of the bitvectors we build. Empirically, 2.0 is found to be a good balance between speed and space usage. See paper for more details.

View Source
const MaxLevel uint = 200

Maximum number of attempts (level) at making a perfect hash function. Per the paper, each successive level exponentially reduces the probability of collision.

View Source
const MinParallelKeys int = 20000

Minimum number of keys before we use a concurrent algorithm

Variables

View Source
var ErrFrozen = errors.New("DB already frozen")

ErrFrozen is returned when attempting to add new records to an already frozen DB It is also returned when trying to freeze a DB that's already frozen.

View Source
var ErrMPHFail = errors.New("failed to build MPH; gamma possibly small")

ErrMPHFail is returned when the gamma value provided to Freeze() is too small to build a minimal perfect hash table.

View Source
var ErrNoKey = errors.New("No such key")

ErrNoKey is returned when a key cannot be found in the DB

Functions

This section is empty.

Types

type BBHash

type BBHash struct {
	// contains filtered or unexported fields
}

BBHash represents a computed minimal perfect hash for a given set of keys.

func New

func New(g float64, keys []uint64) (*BBHash, error)

New creates a new minimal hash function to represent the keys in 'keys'. This constructor selects a faster concurrent algorithm if the number of keys are greater than 'MinParallelKeys'. Once the construction is complete, callers can use "Find()" to find the unique mapping for each key in 'keys'.

func NewConcurrent

func NewConcurrent(g float64, keys []uint64) (*BBHash, error)

NewConcurrent creates a new minimal hash function to represent the keys in 'keys'. This gives callers explicit control over when to use a concurrent algorithm vs. serial.

func NewSerial

func NewSerial(g float64, keys []uint64) (*BBHash, error)

NewSerial creates a new minimal hash function to represent the keys in 'keys'. This constructor explicitly uses a single-threaded (non-concurrent) construction.

func UnmarshalBBHash

func UnmarshalBBHash(r io.Reader) (*BBHash, error)

UnmarshalBBHash reads a previously marshalled binary stream from 'r' and recreates the in-memory instance of BBHash.

func (*BBHash) Find

func (bb *BBHash) Find(k uint64) uint64

Find returns a unique integer representing the minimal hash for key 'k'. The return value is meaningful ONLY for keys in the original key set (provided at the time of construction of the minimal-hash). If the key is in the original key-set

func (*BBHash) MarshalBinary

func (bb *BBHash) MarshalBinary(w io.Writer) error

MarshalBinary encodes the hash into a binary form suitable for durable storage. A subsequent call to UnmarshalBinary() will reconstruct the BBHash instance.

func (*BBHash) MarshalBinarySize

func (bb *BBHash) MarshalBinarySize() uint64

MarshalBinarySize returns the size of the marshaled bbhash (in bytes)

func (BBHash) String

func (bb BBHash) String() string

Stringer interface for BBHash

type DBReader

type DBReader struct {
	// contains filtered or unexported fields
}

DBReader represents the query interface for a previously constructed constant database (built using NewDBWriter()). The only meaningful operation on such a database is Lookup().

func NewDBReader

func NewDBReader(fn string, cache int) (rd *DBReader, err error)

NewDBReader reads a previously construct database in file 'fn' and prepares it for querying. Records are opportunistically cached after reading from disk. We retain upto 'cache' number of records in memory (default 128).

func (*DBReader) Close

func (rd *DBReader) Close()

Close closes the db

func (*DBReader) Find

func (rd *DBReader) Find(key []byte) ([]byte, error)

Find looks up 'key' in the table and returns the corresponding value. It returns an error if the key is not found or the disk i/o failed or the record checksum failed.

func (*DBReader) Lookup

func (rd *DBReader) Lookup(key []byte) ([]byte, bool)

Lookup looks up 'key' in the table and returns the corresponding value. If the key is not found, value is nil and returns false.

func (*DBReader) TotalKeys

func (rd *DBReader) TotalKeys() int

TotalKeys returns the total number of distinct keys in the DB

type DBWriter

type DBWriter struct {
	// contains filtered or unexported fields
}

DBWriter represents an abstraction to construct a read-only constant database. This database uses BBHash as the underlying mechanism for constant time lookups of keys; keys and values are represented as arbitrary byte sequences ([]byte). The DB meta-data is protected by strong checksum (SHA512-256) and each key/value record is protected by a distinct siphash-2-4. Records can be added to the DB via plain delimited text files or CSV files. Once all addition of key/val is complete, the DB is written to disk via the Freeze() function.

The DB has the following general structure:

  • 64 byte file header:

  • magic [4]byte "BBHH"

  • flags uint32 for now, all zeros

  • salt uint64 random salt for hash functions

  • nkeys uint64 Number of keys in the DB

  • offtbl uint64 file offset where the 'key/val' offsets start

  • Contiguous series of records; each record is a key/value pair:

  • keylen uint16 length of the key

  • vallen uint32 length of the value

  • cksum uint64 Siphash checksum of key, value, offset

  • key []byte keylen bytes of key

  • val []byte vallen bytes of value

  • Possibly a gap until the next PageSize boundary (4096 bytes)

  • Offset table: nkeys worth of file offsets. Entry 'i' is the perfect hash index for some key 'k' and offset[i] is the offset in the DB where the key and value can be found.

  • Marshaled BBHash bytes (BBHash:MarshalBinary())

  • 32 bytes of strong checksum (SHA512_256); this checksum is done over the file header, offset-table and marshaled bbhash.

func NewDBWriter

func NewDBWriter(fn string) (*DBWriter, error)

NewDBWriter prepares file 'fn' to hold a constant DB built using BBHash minimal perfect hash function. Once written, the DB is "frozen" and readers will open it using NewDBReader() to do constant time lookups of key to value.

func (*DBWriter) Abort

func (w *DBWriter) Abort()

Abort stops the construction of the perfect hash db

func (*DBWriter) AddCSVFile

func (w *DBWriter) AddCSVFile(fn string, comma, comment rune, kwfield, valfield int) (uint64, error)

AddCSVFile adds contents from CSV file 'fn'. If 'kwfield' and 'valfield' are non-negative, they indicate the field# of the key and value respectively; the default value for 'kwfield' & 'valfield' is 0 and 1 respectively. If 'comma' is not 0, the default CSV delimiter is ','. If 'comment' is not 0, then lines beginning with that rune are discarded. Records where the 'kwfield' and 'valfield' can't be evaluated are discarded. Returns number of records added.

func (*DBWriter) AddCSVStream

func (w *DBWriter) AddCSVStream(fd io.Reader, comma, comment rune, kwfield, valfield int) (uint64, error)

AddCSVStream adds contents from CSV file 'fn'. If 'kwfield' and 'valfield' are non-negative, they indicate the field# of the key and value respectively; the default value for 'kwfield' & 'valfield' is 0 and 1 respectively. If 'comma' is not 0, the default CSV delimiter is ','. If 'comment' is not 0, then lines beginning with that rune are discarded. Records where the 'kwfield' and 'valfield' can't be evaluated are discarded. Returns number of records added.

func (*DBWriter) AddKeyVals

func (w *DBWriter) AddKeyVals(keys [][]byte, vals [][]byte) (uint64, error)

AddKeyVals adds a series of key-value matched pairs to the db. If they are of unequal length, only the smaller of the lengths are used. Records with duplicate keys are discarded. Returns number of records added.

func (*DBWriter) AddTextFile

func (w *DBWriter) AddTextFile(fn string, delim string) (uint64, error)

AddTextFile adds contents from text file 'fn' where key and value are separated by one of the characters in 'delim'. Duplicates, Empty lines or lines with no value are skipped. This function just opens the file and calls AddTextStream() Returns number of records added.

func (*DBWriter) AddTextStream

func (w *DBWriter) AddTextStream(fd io.Reader, delim string) (uint64, error)

AddTextStream adds contents from text stream 'fd' where key and value are separated by one of the characters in 'delim'. Duplicates, Empty lines or lines with no value are skipped. Returns number of records added.

func (*DBWriter) Freeze

func (w *DBWriter) Freeze(g float64) error

Freeze builds the minimal perfect hash, writes the DB and closes it. For very large key spaces, a higher 'g' value is recommended (2.5~4.0); otherwise, the Freeze() function will fail to generate an MPH.

func (*DBWriter) TotalKeys

func (w *DBWriter) TotalKeys() int

TotalKeys returns the total number of distinct keys in the DB

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL