Documentation
¶
Overview ¶
Package minhash provides probabilistic data structures for computing MinHash signatures for streaming data from a min-wise independent parametric family of hash functions.
The reader should also conf https://github.com/dgryski/go-minhash and https://github.com/tylertreat/BoomFilters. In fact, most of the implementation details of this package are based off the former.
MinHash signatures can be used to estimate the Jaccard index J(A, B) := |A & B| / |A || B| of two sets that are subsets of some finite, totally ordered set U. If s is a permutation of U chosen uniformly at random, then x := argmin s(A || B) is just a random element chosen uniformly from A || B. It's clear that P(x in A & B) = J(A, B). Since min s(A) = min s(B) if and only if x is in A & B, we have just shown that P(min s(A) = min S(B)) = J(A, B).
The central idea of minhash signatures is to repeatedly perform the above experiment with different permutations as a way to estimate the underlying probability J(A, B) = P(an element x in A || B is also in A & B).
A length k minhash signature S(A) is theoretically generated by randomly choosing k permutations si (i=1, ..., k) in the symmetric group of U (group of bijective endofunctions on U) and computing hi(A) := min si(A) for each permutation. We take S(A) := [h1(A), ..., hk(A)]. Since each permutation is a bijection, min si(A) = min si(B) if and only if argmin si(A) = argmin si(B) and so we could just as well use these argmins, which is sometimes how the signature S(A) is defined.
Specifying permutations for large U is not efficient, and so we often take a family of integer-valued hash functions that are minwise independent, in the sense that for most sets A, min h(A) ! = min g(A) for two distinct hash functions in the family. Frequently this family is parametrically generated.
For more information,
http://research.neustar.biz/2012/07/09/sketch-of-the-day-k-minimum-values/ MinHashing: http://infolab.stanford.edu/~ullman/mmds/ch3.pdf https://en.wikipedia.org/wiki/MinHash BottomK: http://www.math.tau.ac.il/~haimk/papers/p225-cohen.pdf http://cohenwang.org/edith/Papers/metrics394-cohen.pdf http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer07distinct.pdf
This package works best when provided with a strong 64-bit hash function, such as CityHash, Spooky, Murmur3, or SipHash.
Index ¶
- Constants
- func Cardinality(m Interface) int
- func IntersectionCardinality(m1, m2 Interface) int
- func IsEmpty(m Interface) bool
- func LessCardinality(m1, m2 Interface) int
- func Similarity(x, y Interface) float64
- func SymmetricDifferenceCardinality(m1, m2 Interface) int
- func UnionCardinality(m1, m2 Interface) int
- type HashFunc
- type Interface
- type MinHash
- func (m *MinHash) Cardinality() int
- func (m *MinHash) Copy() *MinHash
- func (m *MinHash) IntersectionCardinality(m2 Interface) int
- func (m *MinHash) IsEmpty() bool
- func (m *MinHash) LessCardinality(m2 Interface) int
- func (m *MinHash) Merge(m2 Interface) error
- func (m *MinHash) Push(x interface{})
- func (m *MinHash) PushBytes(b []byte)
- func (m *MinHash) PushString(s string)
- func (m *MinHash) PushStringInt(s string)
- func (m *MinHash) PushStrings(ss ...string)
- func (m *MinHash) SetHashes(h1 HashFunc, h2 HashFunc)
- func (m *MinHash) SetSignature(signature []uint64)
- func (m *MinHash) Signature() []uint64
- func (m *MinHash) Similarity(m2 Interface) float64
- func (m *MinHash) SymmetricDifferenceCardinality(m2 Interface) int
- func (m *MinHash) UnionCardinality(m2 Interface) int
Constants ¶
Variables ¶
This section is empty.
Functions ¶
func Cardinality ¶
Cardinality estimates the cardinality of the set. Both the signature for the empty set and the zero signature have cardinality 0.
func IntersectionCardinality ¶
IntersectionCardinality estimates the cardinality of the intersection.
func LessCardinality ¶
LessCardinality estimates the cardinality of the left set minus the right set. This operator is not symmetric.
func Similarity ¶
MinHashSimilarity computes an estimate for the Jaccard similarity of two sets given their MinHash signatures.
func SymmetricDifferenceCardinality ¶
SymmetricDifferenceCardinality estimates the difference between the cardinality of the union and intersection.
func UnionCardinality ¶
UnionCardinality estimates the cardinality of the union.
Types ¶
type Interface ¶
type Interface interface { // Signature returns the signature itself. Signature() []uint64 }
Interface is an a probabilistic data structure used to compute a similarity preserving signature for a set. It ingests a stream of the set's elements and continuously updates the signature.
type MinHash ¶
type MinHash struct {
// contains filtered or unexported fields
}
MinHash is a data structure for generating a min-wise independent parametric family of hash functions of the form h1 + i*h2 for i=1, ..., k in order to to compute a MinHash signature. Each instance is tied to a single streamed set and hence single signature. As the instance ingests elements it keeps track of the minimum for the ith hash function.
func NewFromSignature ¶
NewFromSignature creates a MinHash initialzed with the input signature.
func NewMinHash ¶
NewMinHash is deprecated. Use New instead.
func NewMinHashFromSignature ¶
NewMinHashFromSignature is deprecated. Use NewFromSignature instead.
func (*MinHash) Cardinality ¶
Cardinality estimates the cardinality of the set. Both the signature for the empty set and the zero signature have 0 cardinality.
func (*MinHash) IntersectionCardinality ¶
IntersectionCardinality estimates the cardinality of the intersection.
func (*MinHash) IsEmpty ¶
IsEmpty reports whether the MinHash instance represents a signature of the empty set. Note that it's possible for a signature of a non-empty set to equal the signature for the empty set in rare circumstances (e.g., when the hash family is not min-wise independent).
func (*MinHash) LessCardinality ¶
LessCardinality estimates the cardinality of the left set minus the right set. This operator is not symmetric.
func (*MinHash) Merge ¶
Merge combines the signatures of the second set, creating the signature of their union.
func (*MinHash) Push ¶
func (m *MinHash) Push(x interface{})
Push deals with generic data by handling byte conversion. It first hashes the input with each function in the instance's family, and then compares these values to the set of current mins, updating them as necessary.
func (*MinHash) PushBytes ¶
PushBytes updates the set's signature. It hashes the input with each function in the family and compares these values with the current set of mins, replacing them as necessary.
func (*MinHash) PushString ¶
PushString casts the input as a []byte and pushes the element.
func (*MinHash) PushStringInt ¶
PushStringInt first converts a string into a uint64 before pushing.
func (*MinHash) PushStrings ¶
func (*MinHash) SetHashes ¶
SetHashes sets the two hash functions the MinHash uses to generate a parametric family of hash functions from which the set signature is derived.
func (*MinHash) SetSignature ¶
SetSignature sets the instance's underlying signature. This can change the type of the MinHash if the input signature length differs from the length of the instance's signature.
func (*MinHash) Similarity ¶
Similarity computes the similarity of two signatures represented as MinHash instances. This estimates the Jaccard index of the two underlying sets.
func (*MinHash) SymmetricDifferenceCardinality ¶
SymmetricDifferenceCardinality estimates the difference between the cardinality of the union and intersection.
func (*MinHash) UnionCardinality ¶
UnionCardinality estimates the cardinality of the union.