blobloom

package module
v0.8.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 2, 2024 License: Apache-2.0 Imports: 10 Imported by: 6

README

Blobloom

A Bloom filter package for Go (golang) with no compile-time dependencies.

This package implements a version of Bloom filters called blocked Bloom filters, which get a speed boost from using the CPU cache more efficiently than regular Bloom filters.

Unlike most Bloom filter packages for Go, this one doesn't run a hash function for you. That's a benefit if you need a custom hash or you want pick the fastest one for an application.

Usage

To construct a Bloom filter, you need to know how many keys you want to store and what rate of false positives you find acceptable.

f := blobloom.NewOptimized(blobloom.Config{
	Capacity: nkeys, // Expected number of keys.
	FPRate:   1e-4,  // Accept one false positive per 10,000 lookups.
})

To add a key:

// import "github.com/cespare/xxhash/v2"
f.Add(xxhash.Sum64(key))

To test for the presence of a key in the filter:

if f.Has(xxhash.Sum64(key)) {
	// Key is probably in f.
} else {
	// Key is certainly not in f.
}

The false positive rate is defined as usual: if you look up 10,000 random keys in a Bloom filter filled to capacity, an expected one of those is a false positive for FPRate 1e-4.

See the examples/ directory and the package documentation for further usage information and examples.

Hash functions

Blobloom does not provide hash functions. Instead, it requires client code to represent each key as a single 64-bit hash value, leaving it to the user to pick the right hash function for a particular problem. Here are some general suggestions:

  • If you use Bloom filters to speed up access to a key-value store, you might want to look at xxh3 or xxhash.
  • If your keys are cryptographic hashes, consider using the first 8 bytes of those hashes.
  • If you use Bloom filters to make probabilistic decisions, a randomized hash function such as maphash should prevent the same false positives occurring every time.

When evaluating a hash function, or designing a custom one, make sure it is a 64-bit hash that properly mixes its input bits. Casting a 32-bit hash to uint64 gives suboptimal results. So does passing integer keys in without running them through a mixing function.

License

Copyright © 2020-2023 the Blobloom authors

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Documentation

Overview

Package blobloom implements blocked Bloom filters.

Blocked Bloom filters are an approximate set data structure: if a key has been added to a filter, a lookup of that key returns true, but if the key has not been added, there is a non-zero probability that the lookup still returns true (a false positive). False negatives are impossible: if the lookup for a key returns false, that key has not been added.

In this package, keys are represented exclusively as hashes. Client code is responsible for supplying a 64-bit hash value.

Compared to standard Bloom filters, blocked Bloom filters use the CPU cache more efficiently. A blocked Bloom filter is an array of ordinary Bloom filters of fixed size BlockBits (the blocks). The lower half of the hash selects the block to use.

To achieve the same false positive rate (FPR) as a standard Bloom filter, a blocked Bloom filter requires more memory. For an FPR of at most 2e-6 (two in a million), it uses ~20% more memory. At 1e-10, the space required is double that of standard Bloom filter.

For more details, see the 2010 paper by Putze, Sanders and Singler, https://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf.

Example (Fnv)
package main

import (
	"fmt"
	"hash/fnv"
	"io"

	"github.com/greatroar/blobloom"
)

func main() {
	// This example uses the hash/fnv package from the standard Go library.

	f := blobloom.New(10000, 5)
	h := fnv.New64()

	messages := []string{
		"Hello!",
		"Welcome!",
		"Mind your step!",
		"Have fun!",
		"Goodbye!",
	}

	for _, msg := range messages {
		h.Reset()
		io.WriteString(h, msg)
		f.Add(h.Sum64())
	}

	for _, msg := range messages {
		h.Reset()
		io.WriteString(h, msg)
		if f.Has(h.Sum64()) {
			fmt.Println(msg)
		} else {
			panic("Bloom filter didn't get the message")
		}
	}

}
Output:

Hello!
Welcome!
Mind your step!
Have fun!
Goodbye!
Example (Sha224)
package main

import (
	"crypto/sha256"
	"encoding/binary"
	"fmt"

	"github.com/greatroar/blobloom"
)

func main() {
	// If you have items addressed by a cryptographic hash,
	// you can use a prefix of it as the hash value for a Bloom filter.
	//
	// If the cryptohashes denote objects from an untrusted source,
	// the Bloom filter can be tricked into giving false positives for
	// chosen objects, because it only uses a small part of the hash
	// that can easily be broken (by a birthday attack). If that can
	// cause problems in your application, first run SipHash on the
	// full cryptohash to get the hash value for the Bloom filter:
	//
	//	import "github.com/dchest/siphash"
	//	h := siphash.Hash(secret1, secret2, key[:])

	// A list of files, identified by their SHA-224.
	files := []string{
		"\x85\x52\xd8\xb7\xa7\xdc\x54\x76\xcb\x9e\x25\xde\xe6\x9a\x80\x91\x29\x07\x64\xb7\xf2\xa6\x4f\xe6\xe7\x8e\x95\x68",
		"\xa0\xad\x8f\x63\x90\x72\x74\x7b\xc3\x43\x09\x45\x94\x0e\x7c\x73\xb8\x34\x93\xf1\x77\x90\x0f\xd2\x7d\x09\x65\x94",
		"\x7b\xd3\xdb\x48\x1e\x7b\x05\x2c\x88\x18\x68\xcc\x13\xc3\x04\x34\x43\x2d\x7b\x49\x24\x74\x70\x33\xd2\xe8\x6e\x73",
	}

	// first64 extracts the first 64 bits of a key as a uint64.
	// The choice of big vs. little-endian is arbitrary.
	first64 := func(key []byte) uint64 {
		return binary.BigEndian.Uint64(key[:8])
	}

	f := blobloom.NewOptimized(blobloom.Config{Capacity: 600, FPRate: .002})

	for _, filehash := range files {
		f.Add(first64([]byte(filehash)))
	}

	for _, s := range []string{"Hello, world!", "Goodbye"} {
		h := sha256.Sum224([]byte(s))
		found := f.Has(first64(h[:]))
		if found {
			fmt.Printf("Found: %v\n", s)
		}
	}

}
Output:

Found: Hello, world!

Index

Examples

Constants

View Source
const BlockBits = 512

BlockBits is the number of bits per block and the minimum number of bits in a Filter.

The value of this constant is chosen to match the L1 cache line size of popular architectures (386, amd64, arm64).

View Source
const MaxBits = BlockBits << 32 // 256GiB.

MaxBits is the maximum number of bits supported by a Filter.

Variables

This section is empty.

Functions

func Dump added in v0.8.0

func Dump(w io.Writer, f *Filter, comment string) (int64, error)

Dump writes f to w, with an optional comment string, in the binary format that a Loader accepts. It returns the number of bytes written to w.

The comment may contain arbitrary data, within the limits layed out by the format description. It can be used to record the hash function to be used with a Filter.

func DumpSync added in v0.8.0

func DumpSync(w io.Writer, f *SyncFilter, comment string) (n int64, err error)

DumpSync is like Dump, but for SyncFilters.

If other goroutines are simultaneously modifying f, their modifications may not be reflected in the dump. Separate synchronization is required to prevent this.

The format produced is the same as Dump's. The fact that the argument is a SyncFilter is not encoded in the dump.

func FPRate

func FPRate(nkeys, nbits uint64, nhashes int) float64

FPRate computes an estimate of the false positive rate of a Bloom filter after nkeys distinct keys have been added.

func Optimize

func Optimize(config Config) (nbits uint64, nhashes int)

Optimize returns numbers of keys and hash functions that achieve the desired false positive described by config.

Optimize panics when config.FPRate is invalid.

The estimated number of bits is imprecise for false positives rates below ca. 1e-15.

Example
package main

import (
	"fmt"

	"github.com/greatroar/blobloom"
)

func main() {
	cfg := blobloom.Config{
		// We want to insert a billion keys and get a false positive rate of
		// one in a million, but we only have 2GiB (= 2^31 bytes) to spare.
		Capacity: 1e9,
		FPRate:   1e-6,
		MaxBits:  8 * 1 << 31,
	}
	nbits, nhashes := blobloom.Optimize(cfg)
	fpr := blobloom.FPRate(cfg.Capacity, nbits, nhashes)

	// How big will the filter be and what FP rate will we achieve?
	fmt.Printf("size = %dMiB\nfpr = %.3f\n", nbits/(8<<20), fpr)

}
Output:

size = 2048MiB
fpr = 0.001

Types

type Config

type Config struct {

	// Capacity is the expected number of distinct keys to be added.
	// More keys can always be added, but the false positive rate can be
	// expected to drop below FPRate if their number exceeds the Capacity.
	Capacity uint64

	// Desired lower bound on the false positive rate when the Bloom filter
	// has been filled to its capacity. FPRate must be between zero
	// (exclusive) and one (inclusive).
	FPRate float64

	// Maximum size of the Bloom filter in bits. Zero means the global
	// MaxBits constant. A value less than BlockBits means BlockBits.
	MaxBits uint64
	// contains filtered or unexported fields
}

A Config holds parameters for Optimize or NewOptimized.

type Filter

type Filter struct {
	// contains filtered or unexported fields
}

A Filter is a blocked Bloom filter.

func New

func New(nbits uint64, nhashes int) *Filter

New constructs a Bloom filter with given numbers of bits and hash functions.

The number of bits should be at least BlockBits; smaller values are silently increased.

The number of hashes reflects the number of hashes synthesized from the single hash passed in by the client. It is silently increased to two if a lower value is given.

func NewOptimized

func NewOptimized(config Config) *Filter

NewOptimized is shorthand for New(Optimize(config)).

func (*Filter) Add

func (f *Filter) Add(h uint64)

Add insert a key with hash value h into f.

func (*Filter) Cardinality added in v0.2.0

func (f *Filter) Cardinality() float64

Cardinality estimates the number of distinct keys added to f.

The estimate is most reliable when f is filled to roughly its capacity. It gets worse as f gets more densely filled. When one of the blocks is entirely filled, the estimate becomes +Inf.

The return value is the maximum likelihood estimate of Papapetrou, Siberski and Nejdl, summed over the blocks (https://www.win.tue.nl/~opapapetrou/papers/Bloomfilters-DAPD.pdf).

Example (Infinity)
package main

import (
	"fmt"
	"math"

	"github.com/greatroar/blobloom"
)

var hashes [200]uint64

func main() {
	// To handle the case of Cardinality returning +Inf, track the number of
	// calls to Add and compute the minimum.

	// This Bloom filter is constructed with too many hash functions
	// to force +Inf.
	f := blobloom.New(512, 100)
	var numAdded int

	add := func(h uint64) {
		f.Add(h)
		numAdded++
	}

	for _, h := range hashes {
		add(h)
	}

	estimate := f.Cardinality()
	fmt.Printf("blobloom's estimate:    %.2f\n", estimate)
	fmt.Printf("number of calls to Add: %d\n", numAdded)
	estimate = math.Min(estimate, float64(numAdded))
	fmt.Printf("combined estimate:      %.2f\n", estimate)

}
Output:

blobloom's estimate:    +Inf
number of calls to Add: 200
combined estimate:      200.00

func (*Filter) Clear

func (f *Filter) Clear()

Clear resets f to its empty state.

func (*Filter) Empty added in v0.6.0

func (f *Filter) Empty() bool

Empty reports whether f contains no keys.

func (*Filter) Equals added in v0.8.0

func (f *Filter) Equals(g *Filter) bool

Equals returns true if f and g contain the same keys (in terms of Has) when used with the same hash function.

func (*Filter) FPRate

func (f *Filter) FPRate(nkeys uint64) float64

FPRate computes an estimate of f's false positive rate after nkeys distinct keys have been added.

func (*Filter) Fill added in v0.6.0

func (f *Filter) Fill()

Fill set f to a completely full filter. After Fill, Has returns true for any key.

func (*Filter) Has

func (f *Filter) Has(h uint64) bool

Has reports whether a key with hash value h has been added. It may return a false positive.

func (*Filter) Intersect added in v0.4.0

func (f *Filter) Intersect(g *Filter)

Intersect sets f to the intersection of f and g.

Intersect panics when f and g do not have the same number of bits and hash functions. Both Filters must be using the same hash function(s), but Intersect cannot check this.

Since Bloom filters may return false positives, Has may return true for a key that was not in both f and g.

After Intersect, the estimates from f.Cardinality and f.FPRate should be considered unreliable.

func (*Filter) NumBits

func (f *Filter) NumBits() uint64

NumBits returns the number of bits of f.

func (*Filter) Union

func (f *Filter) Union(g *Filter)

Union sets f to the union of f and g.

Union panics when f and g do not have the same number of bits and hash functions. Both Filters must be using the same hash function(s), but Union cannot check this.

Example
package main

import (
	"hash/fnv"
	"io"

	"github.com/greatroar/blobloom"
)

const nworkers = 4

func getKeys(keys chan<- string) {
	keys <- "hello"
	keys <- "goodbye"
	close(keys)
}

func hash(key string) uint64 {
	h := fnv.New64()
	io.WriteString(h, key)
	return h.Sum64()
}

func main() {
	// Union can be used to fill a Bloom filter using multiple goroutines.
	//
	// Each goroutine allocates a filter, so the memory use increases
	// by a factor nworkers-1 compared to a sequential version
	// or a SyncFilter.

	keys := make(chan string, nworkers)
	filters := make(chan *blobloom.Filter, nworkers)

	go getKeys(keys)

	for i := 0; i < nworkers; i++ {
		go func() {
			f := blobloom.New(1<<20, 6)
			for key := range keys {
				f.Add(hash(key))
			}

			filters <- f
		}()
	}

	f := <-filters
	for i := 1; i < nworkers; i++ {
		f.Union(<-filters)
	}

}
Output:

type Loader added in v0.8.0

type Loader struct {
	Comment string // Comment field. Filled in by NewLoader.
	// contains filtered or unexported fields
}

A Loader reads a Filter or SyncFilter from an io.Reader.

A Loader accepts the binary format produced by Dump. The format starts with a 64-byte header:

  • the string "blobloom", in ASCII;
  • a four-byte version number, which must be zero;
  • the number of Bloom filter blocks, minus one, as a 32-bit integer;
  • the number of hashes, as a 32-bit integer;
  • a comment of at most 44 non-zero bytes, padded to 44 bytes with zeros.

After the header come the 512-bit blocks, divided into sixteen 32-bit limbs. All integers are little-endian.

func NewLoader added in v0.8.0

func NewLoader(r io.Reader) (*Loader, error)

NewLoader parses the format header from r and returns a Loader that can be used to load a Filter from it.

func (*Loader) Load added in v0.8.0

func (l *Loader) Load(f *Filter) (*Filter, error)

Load sets f to the union of f and the Loader's filter, then returns f. If f is nil, a new Filter of the appropriate size is constructed.

If f is not nil and an error occurs while reading from the Loader, f may end up in an inconsistent state.

func (*Loader) LoadSync added in v0.8.0

func (l *Loader) LoadSync(f *SyncFilter) (*SyncFilter, error)

Load sets f to the union of f and the Loader's filter, then returns f. If f is nil, a new SyncFilter of the appropriate size is constructed. Else, LoadSync may run concurrently with other modifications to f.

If f is not nil and an error occurs while reading from the Loader, f may end up in an inconsistent state.

type SyncFilter added in v0.7.0

type SyncFilter struct {
	// contains filtered or unexported fields
}

A SyncFilter is a Bloom filter that can be accessed and updated by multiple goroutines concurrently.

A SyncFilter mostly behaves as a regular filter protected by a lock,

type SyncFilter struct {
	Filter
	lock sync.Mutex
}

with each method taking and releasing the lock, but is implemented much more efficiently. See the method descriptions for exceptions to the previous rule.

Example
package main

import (
	"fmt"
	"sync"

	"github.com/greatroar/blobloom"
)

var hashes [200]uint64

func main() {
	// Multiple goroutines can Add to a SyncFilter concurrently,
	// without requiring separate synchronization.

	f := blobloom.NewSync(1<<20, 6)
	var wg sync.WaitGroup

	add := func(hs []uint64) {
		for _, h := range hs {
			f.Add(h)
		}
		wg.Done()
	}

	wg.Add(2)
	half := len(hashes) / 2
	go add(hashes[:half])
	go add(hashes[half:])

	wg.Wait() // Wait for updating goroutines to complete.

	for _, h := range hashes {
		if !f.Has(h) {
			fmt.Printf("hash %d added but not retrieved\n", h)
		}
	}

}
Output:

func NewSync added in v0.7.0

func NewSync(nbits uint64, nhashes int) *SyncFilter

NewSync constructs a Bloom filter with given numbers of bits and hash functions.

The number of bits should be at least BlockBits; smaller values are silently increased.

The number of hashes reflects the number of hashes synthesized from the single hash passed in by the client. It is silently increased to two if a lower value is given.

func NewSyncOptimized added in v0.7.0

func NewSyncOptimized(config Config) *SyncFilter

NewSyncOptimized is shorthand for New(Optimize(config)).

func (*SyncFilter) Add added in v0.7.0

func (f *SyncFilter) Add(h uint64)

Add insert a key with hash value h into f.

func (*SyncFilter) Cardinality added in v0.7.0

func (f *SyncFilter) Cardinality() float64

Cardinality estimates the number of distinct keys added to f.

The estimate is most reliable when f is filled to roughly its capacity. It gets worse as f gets more densely filled. When one of the blocks is entirely filled, the estimate becomes +Inf.

The return value is the maximum likelihood estimate of Papapetrou, Siberski and Nejdl, summed over the blocks (https://www.win.tue.nl/~opapapetrou/papers/Bloomfilters-DAPD.pdf).

If other goroutines are concurrently adding keys, the estimate may lie in between what would have been returned before the concurrent updates started and what is returned after the updates complete.

func (*SyncFilter) Empty added in v0.7.0

func (f *SyncFilter) Empty() bool

Empty reports whether f contains no keys.

If other goroutines are concurrently adding keys, Empty may return a false positive.

func (*SyncFilter) Fill added in v0.7.0

func (f *SyncFilter) Fill()

Fill sets f to a completely full filter. After Fill, Has returns true for any key.

func (*SyncFilter) Has added in v0.7.0

func (f *SyncFilter) Has(h uint64) bool

Has reports whether a key with hash value h has been added. It may return a false positive.

Directories

Path Synopsis
benchmarks module
examples
bloomstat
Bloomstat is a utility for estimating Bloom filter sizes.
Bloomstat is a utility for estimating Bloom filter sizes.
spellcheck
This package implements a toy interactive spell checker.
This package implements a toy interactive spell checker.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL