chd

package module

v0.3.1 Latest Latest Go to latest Published: Jan 28, 2022 License: GPL-2.0 Imports: 14 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/pedroalbanese/go-chd

README ¶

go-chd - Minimal Perfect Hash Function using Compress Hash Displace

What is it?

A library to create, query and serialize/de-serialize minimal perfect hash function ("MPHF").

This is an implementation of CHD - inspired by this gist.

The library exposes the following types:

ChdBuilder: Represents the construction phase of the MPHF. function as described in the paper above.
Chd: Represents a frozen MPHF over a given set of keys. You can only do lookups on this type.
DBWriter: Used to construct a constant database of key-value pairs - where the lookup of a given key is done in constant time using ChdBuilder. Essentially, this type serializes a collection of key-value pairs using ChdBuilder as the underlying index.
DBReader: Used for looking up key-values from a previously constructed (serialized) database.

NOTE Minimal Perfect Hash functions take a fixed input and generate a mapping to lookup the items in constant time. In particular, they are NOT a replacement for a traditional hash-table; i.e., it may yield false-positives when queried using keys not present during construction. In concrete terms:

Let S = {k0, k1, ... kn} be your input key set.

If H: S -> {0, .. n} is a minimal perfect hash function, then H(kx) for kx NOT in S may yield an integer result (indicating that kx was successfully "looked up").

Thus, if users of Chd are unsure of the input being passed to such a Lookup() function, they should add an additional comparison against the actual key to verify. Look at dbreader.go:Find() for an example.

DBWriter optimizes the database if there are no values present - i.e., keys-only. This optimization significantly reduces the file-size.

How do I use it?

Like any other golang library: go get github.com/opencoff/go-chd.

Example Program

There is a working example of the DBWriter and DBReader interfaces in the file example/mphdb.go. This example demonstrates the following functionality:

add one or more space delimited key/value files (first field is key, second field is value)
add one or more CSV files (first field is key, second field is value)
Write the resulting MPH DB to disk
Read the DB and verify its integrity

First, lets run some tests and make sure chd is working fine:


  $ git clone https://github.com/opencoff/go-chd
  $ cd go-chd
  $ make test

Now, lets build and run the example program:


  $ make
  $ ./mphdb -h

There is a helper python script to generate a very large text file of hostnames and IP addresses: genhosts.py. You can run it like so:


  $ python ./example/genhosts.py 192.168.0.0/16 > a.txt

The above example generates 65535 hostnames and corresponding IP addresses; each of the IP addresses is sequentially drawn from the given subnet.

NOTE If you use a "/8" subnet mask you will generate a lot of data (~430MB in size).

Once you have the input generated, you can feed it to the example program above to generate a MPH DB:


  $ ./mphdb foo.db a.txt
  $ ./mphdb -V foo.db

It is possible that "mphdb" fails to construct a DB after trying 1,000,000 times. In that case, try lowering the "load" factor (default is 0.85).

  $ ./mphdb -l 0.75 foo.db a.txt

Basic Usage of ChdBuilder

Assuming you have read your keys, hashed them into uint64, this is how you can use the library:


        builder, err := chd.New(0.9)
        if err != nil { panic(err) }

        for i := range keys {
            builder.Add(keys[i])
        }

        lookup, err := builder.Freeze()

        // Now, call Find() with each key to gets its unique mapping.
        // Note: Find() returns values in the range closed-interval [1, len(keys)]
        for i, k := range keys {
                j := lookup.Find(k)
                fmt.Printf("%d: %#x maps to %d\n", i, k, j)
        }

Writing a DB Once, but lookup many times

One can construct an on-disk constant-time lookup using ChdBuilder as the underlying indexing mechanism. Such a DB is useful in situations where the key/value pairs are NOT changed frequently; i.e., read-dominant workloads. The typical pattern in such situations is to build the constant-DB once for efficient retrieval and do lookups multiple times.

The example program in example/ has helper routines to add from a text or CSV delimited file: see example/text.go.

Implementation Notes

chd.go: The main implementation of the CHD algorithm. It has two types: one to construct and freeze a MPHF (ChdBuilder) and another to do constant time lookups from a frozen CHD MPHF (Chd).
dbwriter.go: Create a read-only, constant-time MPH lookup DB. It can store arbitrary byte stream "values" - each of which is identified by a unique uint64 key. The DB structure is optimized for reading on the most common architectures - little-endian: amd64, arm64 etc.
dbreader.go: Provides a constant-time lookup of a previously constructed CHD MPH DB. DB reads use mmap(2) to reduce I/O bottlenecks. For little-endian architectures, there is no data "parsing" of the lookup tables, offset tables etc. They are interpreted in-situ from the mmap'd data. To keep the code generic, every multi-byte int is converted to little-endian order before use. These conversion routines are in endian_XX.go.
mmap.go: Utility functions to map byte-slices to uintXX slices and vice versa.
marshal.go: Marshal/Unmarshal CHD MPH

License

GPL v2.0

Documentation ¶

Rendered for

Overview ¶

Package chd implements ChdBuilder - to create fast, minimal perfect hash functions from a given set of keys. This is an implementation of CHD in http://cmph.sourceforge.net/papers/esa09.pdf -

Additionally, DBWriter enables creating a fast, constant-time DB for read-only workloads. It serializes the key,value pairs and builds a CHD minimal perfect hash function over the given keys. The serialized DB can be read back via DBReader for constant time lookups of the MPH DB.

Index ¶

Variables
type Chd
type ChdBuilder
- func New() (*ChdBuilder, error)
- func (c *ChdBuilder) Add(key uint64) error
- func (c *ChdBuilder) Freeze(load float64) (*Chd, error)
type DBReader
- func NewDBReader(fn string, cache int) (rd *DBReader, err error)
type DBWriter
- func NewDBWriter(fn string) (*DBWriter, error)

Constants ¶

This section is empty.

Variables ¶

View Source

var (
	// ErrMPHFail is returned when the gamma value provided to Freeze() is too small to
	// build a minimal perfect hash table.
	ErrMPHFail = errors.New("failed to build MPH")

	// ErrFrozen is returned when attempting to add new records to an already frozen DB
	// It is also returned when trying to freeze a DB that's already frozen.
	ErrFrozen = errors.New("DB already frozen")

	// ErrValueTooLarge is returned if the value-length is larger than 2^32-1 bytes
	ErrValueTooLarge = errors.New("value is larger than 2^32-1 bytes")

	// ErrExists is returned if a duplicate key is added to the DB
	ErrExists = errors.New("key exists in DB")

	// ErrNoKey is returned when a key cannot be found in the DB
	ErrNoKey = errors.New("No such key")
)

Functions ¶

This section is empty.

Types ¶

type Chd ¶

type Chd struct {
	// contains filtered or unexported fields
}

Chd represents a frozen PHF for the given set of keys

func (*Chd) DumpMeta ¶

func (c *Chd) DumpMeta(w io.Writer)

Dump CHD meta-data to io.Writer 'w'

func (*Chd) Find ¶

func (c *Chd) Find(k uint64) uint64

Find returns a unique integer representing the minimal hash for key 'k'. The return value is meaningful ONLY for keys in the original key set (provided at the time of construction of the minimal-hash). Callers should verify that the key at the returned index == k.

func (*Chd) Len ¶

func (c *Chd) Len() int

Len returns the actual length of the PHF lookup table

func (*Chd) MarshalBinary ¶

func (c *Chd) MarshalBinary(w io.Writer) (int, error)

MarshalBinary encodes the hash into a binary form suitable for durable storage. A subsequent call to UnmarshalBinary() will reconstruct the CHD instance.

func (*Chd) SeedSize ¶

func (c *Chd) SeedSize() byte

func (*Chd) UnmarshalBinaryMmap ¶

func (c *Chd) UnmarshalBinaryMmap(buf []byte) error

UnmarshalBinaryMmap reads a previously marshalled Chd instance and returns a lookup table. It assumes that buf is memory-mapped and aligned at the right boundaries.

type ChdBuilder ¶

type ChdBuilder struct {
	// contains filtered or unexported fields
}

ChdBuilder is used to create a MPHF from a given set of uint64 keys

func New ¶

func New() (*ChdBuilder, error)

New enables creation of a minimal perfect hash function via the Compress Hash Displace algorithm. Once created, callers can add keys to it before Freezing the MPH and generating a constant time lookup table. This implementation of CHD uses uint64 keys. Callers can use any good hash function (murmur hash etc.) to map their data into these keys. Once the construction is frozen, callers can use "Find()" to find the unique mapping for each key in 'keys'.

func (*ChdBuilder) Add ¶

func (c *ChdBuilder) Add(key uint64) error

Add a new key to the MPH builder

func (*ChdBuilder) Freeze ¶

func (c *ChdBuilder) Freeze(load float64) (*Chd, error)

Freeze builds a constant-time lookup table using the CMD algorithm and the given load factor. Lower load factors speeds up the construction of the MPHF. Suggested value for load is between 0.75-0.9

type DBReader ¶

type DBReader struct {
	// contains filtered or unexported fields
}

DBReader represents the query interface for a previously constructed constant database (built using NewDBWriter()). The only meaningful operation on such a database is Lookup().

func NewDBReader ¶

func NewDBReader(fn string, cache int) (rd *DBReader, err error)

NewDBReader reads a previously construct database in file 'fn' and prepares it for querying. Records are opportunistically cached after reading from disk. We retain upto 'cache' number of records in memory (default 128).

func (*DBReader) Close ¶

func (rd *DBReader) Close()

Close closes the db

func (*DBReader) DumpMeta ¶

func (rd *DBReader) DumpMeta(w io.Writer)

Dump the metadata to io.Writer 'w'

func (*DBReader) Find ¶

func (rd *DBReader) Find(key uint64) ([]byte, error)

Find looks up 'key' in the table and returns the corresponding value. It returns an error if the key is not found or the disk i/o failed or the record checksum failed.

func (*DBReader) Len ¶

func (rd *DBReader) Len() int

TotalKeys returns the total number of distinct keys in the DB

func (*DBReader) Lookup ¶

func (rd *DBReader) Lookup(key uint64) ([]byte, bool)

Lookup looks up 'key' in the table and returns the corresponding value. If the key is not found, value is nil and returns false.

type DBWriter ¶

type DBWriter struct {
	// contains filtered or unexported fields
}

DBWriter represents an abstraction to construct a read-only constant database. This database uses CHD as the underlying mechanism for constant time lookups of keys; keys and values are represented as arbitrary byte sequences ([]byte). The DB meta-data is protected by strong checksum (SHA512-256) and each stored value is protected by a distinct siphash-2-4. Once all addition of key/val is complete, the DB is written to disk via the Freeze() function.

We don't want to use SHA512-256 over the entire file - because it will mean reading a potentially large file in DBReader(). By using checksums separately per record, we increase the overhead a bit - but speeds up DBReader initialization for the common case; we will be verifying actual records opportunistically.

The DB has the following general structure:

64 byte file header: big-endian encoding of all multibyte ints
magic [4]byte "CHDB"
flags uint32 for now, all zeros
salt [16]byte random salt for siphash record integrity
nkeys uint64 Number of keys in the DB
offtbl uint64 File offset of <offset, hash> table
Contiguous series of records; each record is a key/value pair:
cksum uint64 Siphash checksum of value, offset (big endian)
val []byte value bytes
Possibly a gap until the next PageSize boundary (4096 bytes)
Offset table: nkeys worth of offsets, hash pairs. Everything in this table is little-endian encoded so we can mmap() it into memory. Entry 'i' has two 64-bit words:
offset in the file where the corresponding value can be found
hash key corresponding to the value
Val_len table: nkeys worth of value lengths corresponding to each key.
Marshaled Chd bytes (Chd:MarshalBinary())
32 bytes of strong checksum (SHA512_256); this checksum is done over the file header, offset-table and marshaled chd.

func NewDBWriter ¶

func NewDBWriter(fn string) (*DBWriter, error)

NewDBWriter prepares file 'fn' to hold a constant DB built using CHD minimal perfect hash function. Once written, the DB is "frozen" and readers will open it using NewDBReader() to do constant time lookups of key to value.

func (*DBWriter) Abort ¶

func (w *DBWriter) Abort()

Abort stops the construction of the perfect hash db

func (*DBWriter) Add ¶

func (w *DBWriter) Add(key uint64, val []byte) error

Adds adds a single key,value pair.

func (*DBWriter) AddKeyVals ¶

func (w *DBWriter) AddKeyVals(keys []uint64, vals [][]byte) (int, error)

AddKeyVals adds a series of key-value matched pairs to the db. If they are of unequal length, only the smaller of the lengths are used. Records with duplicate keys are discarded. Returns number of records added.

func (*DBWriter) Freeze ¶

func (w *DBWriter) Freeze(load float64) (err error)

Freeze builds the minimal perfect hash, writes the DB and closes it. The parameter 'load' controls the MPHF table size (load): 0 < load < 1. If space is not an issue, use a lower value of load. Typical values are between 0.75 and 0.9.

func (*DBWriter) Len ¶

func (w *DBWriter) Len() int

Len returns the total number of distinct keys in the DB

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
example

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL