bloom

package module
v1.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 25, 2022 License: BSD-2-Clause Imports: 10 Imported by: 0

README

Redis Bloom filters

Go Reference

A Bloom filter is a concise/compressed representation of a set, where the main requirement is to make membership queries; i.e., whether an item is a member of a set. A Bloom filter will always correctly report the presence of an element in the set when the element is indeed present. A Bloom filter can use much less storage than the original set, but it allows for some 'false positives': it may sometimes report that an element is in the set whereas it is not.

When you construct, you need to know how many elements you have (the desired capacity), and what is the desired false positive rate you are willing to tolerate. A common false-positive rate is 1%. The lower the false-positive rate, the more memory you are going to require. Similarly, the higher the capacity, the more memory you will use. You may construct the Bloom filter capable of receiving 1 million elements with a false-positive rate of 1% in the following manner.

    redisClient := redis.NewUniversalClient(&redis.UniversalOptions{Addrs: []string{":6379"}})
    bitset := NewRedisBitSet(redisClient, uuid.New().String(), time.Minute)
    filter := bloom.NewWithEstimates(1000000, 0.01, bitset) 

You should call NewWithEstimates conservatively: if you specify a number of elements that it is too small, the false-positive bound might be exceeded. A Bloom filter is not a dynamic data structure: you must know ahead of time what your desired capacity is.

Our implementation accepts keys for setting and testing as []byte. Thus, to add a string item, "Love":

    filter.Add([]byte("Love"))

Similarly, to test if "Love" is in bloom:

    if filter.Test([]byte("Love"))

For numerical data, we recommend that you look into the encoding/binary library. But, for example, to add a uint32 to the filter:

    i := uint32(100)
    n1 := make([]byte, 4)
    binary.BigEndian.PutUint32(n1, i)
    filter.Add(n1)

Godoc documentation: https://pkg.go.dev/github.com/HoangViet144/bloom

Installation

go get -u github.com/HoangViet144/bloom

Verifying the False Positive Rate

Sometimes, the actual false positive rate may differ (slightly) from the theoretical false positive rate. We have a function to estimate the false positive rate of a Bloom filter with m bits and k hashing functions for a set of size n:

    if bloom.EstimateFalsePositiveRate(20*n, 5, n) > 0.001 ...

You can use it to validate the computed m, k parameters:

    m, k := bloom.EstimateParameters(n, fp)
    ActualfpRate := bloom.EstimateFalsePositiveRate(m, k, n, bitset)

or

    f := bloom.NewWithEstimates(n, fp, bitset)
    ActualfpRate := bloom.EstimateFalsePositiveRate(f.m, f.k, n, bitset)

You would expect ActualfpRate to be close to the desired false-positive rate fp in these cases.

The EstimateFalsePositiveRate function creates a temporary Bloom filter. It is also relatively expensive and only meant for validation.

Contributing

If you wish to contribute to this project, please branch and issue a pull request against master ("GitHub Flow")

This project includes a Makefile that allows you to test and build the project with simple commands. To see all available options:

make help

Running all tests

Before committing the code, please check if it passes all tests using (note: this will install some dependencies):

make deps
make qa

Design

A Bloom filter has two parameters: m, the number of bits used in storage, and k, the number of hashing functions on elements of the set. (The actual hashing functions are important, too, but this is not a parameter for this implementation). A Bloom filter is backed by a BitSet; a key is represented in the filter by setting the bits at each value of the hashing functions (modulo m). Set membership is done by testing whether the bits at each value of the hashing functions (again, modulo m) are set. If so, the item is in the set. If the item is actually in the set, a Bloom filter will never fail (the true positive rate is 1.0); but it is susceptible to false positives. The art is to choose k and m correctly.

In this implementation, the hashing functions used is murmurhash, a non-cryptographic hashing function.

Given the particular hashing scheme, it's best to be empirical about this. Note that estimating the FP rate will clear the Bloom filter.

Documentation

Overview

In this implementation, the hashing functions used is murmurhash, a non-cryptographic hashing function.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func EstimateFalsePositiveRate

func EstimateFalsePositiveRate(m, k, n uint, b BitSet) (fpRate float64)

EstimateFalsePositiveRate returns, for a BloomFilter of m bits and k hash functions, an estimation of the false positive rate when

storing n entries. This is an empirical, relatively slow

test using integers as keys. This function is useful to validate the implementation.

func EstimateParameters

func EstimateParameters(n uint, p float64) (m uint, k uint)

EstimateParameters estimates requirements for m and k. Based on https://bitbucket.org/ww/bloom/src/829aa19d01d9/bloom.go used with permission.

func Locations

func Locations(data []byte, k uint) []uint64

Locations returns a list of hash locations representing a data item.

Types

type BitSet

type BitSet interface {
	// Init allocate bit set based on bit length
	Init(length uint) BitSet
	// Set bit i to 1, the capacity of the bitset is automatically
	// increased accordingly.
	// If i>= Cap(), this function will panic.
	// Warning: using a very large value for 'i'
	// may lead to a memory shortage and a panic: the caller is responsible
	// for providing sensible parameters in line with their memory capacity.
	Set(i uint) BitSet
	// UnSet bit i to 0
	UnSet(i uint) BitSet
	// InPlaceUnion creates the destructive union of base set and compare set.
	// This is the BitSet equivalent of | (or).
	InPlaceUnion(compare BitSet)
	// Test whether bit i is set.
	Test(i uint) bool
	// ClearAll clears the entire BitSet
	ClearAll() BitSet
	// Count (number of set bits).
	// Also known as "popcount" or "population count".
	Count() uint
	// WriteTo writes a BitSet to a stream
	WriteTo(stream io.Writer) (int64, error)
	// Equal tests the equivalence of two BitSets.
	// False if they are of different sizes, otherwise true
	// only if all the same bits are set
	Equal(c BitSet) bool
	// GetBitSetKey returns the key of redis bitset. It is used only for redis bitset
	GetBitSetKey() string
	// ReadFrom reads a BitSet from a stream written using WriteTo
	ReadFrom(stream io.Reader) (int64, error)
	// From is a constructor used to create a BitSet from an array of integers
	From(buf []uint64) BitSet
}

func NewRedisBitSet

func NewRedisBitSet(redisClient redis.UniversalClient, bitsetKey string, expiration time.Duration) BitSet

type BloomFilter

type BloomFilter interface {
	// Cap returns the capacity, _m_, of a Bloom filter
	Cap() uint
	// K returns the number of hash functions used in the BloomFilter
	K() uint
	// BitSet returns the underlying bitset for this filter.
	BitSet() BitSet
	// Add data to the Bloom Filter. Returns the filter (allows chaining)
	Add(data []byte) BloomFilter
	// AddString to the Bloom Filter. Returns the filter (allows chaining)
	AddString(data string) BloomFilter
	// Test returns true if the data is in the BloomFilter, false otherwise.
	// If true, the result might be a false positive. If false, the data
	// is definitely not in the set.
	Test(data []byte) bool
	// TestString returns true if the string is in the BloomFilter, false otherwise.
	// If true, the result might be a false positive. If false, the data
	// is definitely not in the set.
	TestString(data string) bool
	// TestLocations returns true if all locations are set in the BloomFilter, false
	// otherwise.
	TestLocations(locs []uint64) bool
	// TestAndAdd is the equivalent to calling Test(data) then Add(data).
	// Returns the result of Test.
	TestAndAdd(data []byte) bool
	// TestAndAddString is the equivalent to calling Test(string) then Add(string).
	// Returns the result of Test.
	TestAndAddString(data string) bool
	// TestOrAdd is the equivalent to calling Test(data) then if not present Add(data).
	// Returns the result of Test.
	TestOrAdd(data []byte) bool
	// TestOrAddString is the equivalent to calling Test(string) then if not present Add(string).
	// Returns the result of Test.
	TestOrAddString(data string) bool
	// ClearAll clears all the data in a Bloom filter, removing all keys
	ClearAll() BloomFilter
	// ApproximatedSize approximates the number of items
	// https://en.wikipedia.org/wiki/Bloom_filter#Approximating_the_number_of_items_in_a_Bloom_filter
	ApproximatedSize() uint32
	// MarshalJSON implements json.Marshaler interface.
	MarshalJSON() ([]byte, error)
	// UnmarshalJSON implements json.Unmarshaler interface.
	UnmarshalJSON(data []byte) error
	// WriteTo writes a binary representation of the BloomFilter to an i/o stream.
	// It returns the number of bytes written.
	WriteTo(stream io.Writer) (int64, error)
	// ReadFrom reads a binary representation of the BloomFilter (such as might
	// have been written by WriteTo()) from an i/o stream. It returns the number
	// of bytes read.
	ReadFrom(stream io.Reader) (int64, error)
	// GobEncode implements gob.GobEncoder interface.
	GobEncode() ([]byte, error)
	// GobDecode implements gob.GobDecoder interface.
	GobDecode(data []byte) error
	// Equal tests for the equality of two Bloom filters
	Equal(g BloomFilter) bool
}

func From

func From(data []uint64, k uint, b BitSet) BloomFilter

From creates a new Bloom filter with len(_data_) * 64 bits and _k_ hashing functions. The data slice is not going to be reset.

func FromWithM

func FromWithM(data []uint64, m, k uint, b BitSet) BloomFilter

FromWithM creates a new Bloom filter with _m_ length, _k_ hashing functions. The data slice is not going to be reset.

func New

func New(m uint, k uint, b BitSet) BloomFilter

New creates a new Bloom filter with _m_ bits and _k_ hashing functions We force _m_ and _k_ to be at least one to avoid panics.

func NewWithEstimates

func NewWithEstimates(n uint, fp float64, b BitSet) BloomFilter

NewWithEstimates creates a new Bloom filter for about n items with fp false positive rate

type RedisBitSet

type RedisBitSet struct {
	// contains filtered or unexported fields
}

func (*RedisBitSet) ClearAll

func (r *RedisBitSet) ClearAll() BitSet

func (*RedisBitSet) Count

func (r *RedisBitSet) Count() uint

func (*RedisBitSet) Equal

func (r *RedisBitSet) Equal(c BitSet) bool

func (*RedisBitSet) From

func (r *RedisBitSet) From(buf []uint64) BitSet

func (*RedisBitSet) GetBitSetKey

func (r *RedisBitSet) GetBitSetKey() string

func (*RedisBitSet) InPlaceUnion

func (r *RedisBitSet) InPlaceUnion(compare BitSet)

func (*RedisBitSet) Init

func (r *RedisBitSet) Init(length uint) BitSet

func (*RedisBitSet) ReadFrom

func (r *RedisBitSet) ReadFrom(stream io.Reader) (int64, error)

func (*RedisBitSet) Set

func (r *RedisBitSet) Set(i uint) BitSet

func (*RedisBitSet) Test

func (r *RedisBitSet) Test(i uint) bool

func (*RedisBitSet) UnSet

func (r *RedisBitSet) UnSet(i uint) BitSet

func (*RedisBitSet) WriteTo

func (r *RedisBitSet) WriteTo(stream io.Writer) (int64, error)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL