lshensemble

package module

v1.1.0 Latest Latest Go to latest Published: Dec 27, 2018 License: MIT Imports: 12 Imported by: 3

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/ekzhu/lshensemble

Links

Open Source Insights

README ¶

LSH Ensemble

Documentation

Please cite this paper if you use this library in your work:

Erkang Zhu, Fatemeh Nargesian, Ken Q. Pu, Renée J. Miller: LSH Ensemble: Internet-Scale Domain Search. PVLDB 9(12): 1185-1196 (2016) Link

Presentation slides @ VLDB 2016, New Delhi.

Datasets

We used two datasets for evaluation. The datasets are all from public domains and can be downloaded directly from the original publisher.

Canadian Open Data, tabular domains only (as of June 2015): Each file corresponds to a single domain extracted from a column of a table, which could be a spreadsheet or an CSV file. The filenames follow the <data file name>.<column id> format.
2015 WDC Web Tables, English Relational, 51 compressed files: See the data format here.

By using these datasets you agree to use them for academic research purpose only, and we shall not be held responisble for any inaccuracy or error that may exist in the datasets, nor we shall be responsible for any consequence of usage of these datasets.

Quick Start Guide

Install this library by running:

go get github.com/ekzhu/lshensemble

Import the library in your import:

import (
	"github.com/ekzhu/lshensemble"
)

First you need to obtain the domains, and each domain should have a string ID. For simplicity we represent a domain as map[string]bool, whose keys are distinct values. Assuming you have obtained the domains from some dataset, you can generate the MinHash signatures from the domains.

domains []map[string]bool
keys []string // each key corresponds to the domain at the same index

// ... 
// obtaining domains and keys
// ...

// initializing the domain records to hold the MinHash signatures
domainRecords := make([]*lshensemble.DomainRecord, len(domains))

// set the minhash seed
seed := 42

// set the number of hash functions
numHash := 256

// create the domain records with the signatures
for i := range domains {
	mh := lshensemble.NewMinhash(seed, numHash)
	for v := range domains[i] {
		mh.Push([]byte(v))
	}
	domainRecords[i] := &lshensemble.DomainRecord {
		Key       : keys[i],
		Size      : len(domains[i]),
		Signature : mh.Signature()
	}
}

Before you can index the domains, you need to sort them in increasing order by their sizes. BySize wrapper allows the domains to tbe sorted using the build-in sort package.

sort.Sort(lshensemble.BySize(domainRecords))

Now you can use BootstrapLshEnsembleOptimal/BootstrapLshEnsembleEquiDepth (or BootstrapLshEnsemblePlusOptimal/BootstrapLshEnsemblePlusEquiDepth) for better accuracy at higher memory cost*) to create an LSH Ensemble index. BootstrapLshEnsembleOptimal uses dynamic programming to create partitions that are optimal in the sense that the total number of false positives generated from all the partitions are minimized. This method can be a bit slower due to the dynamic programming overhead, however, it creates optimized partitions for any kind of data distribution. BootstrapLshEnsembleEquiDepth uses simple equi-depth -- same number of domains in every partition. This is method is described in the original paper as suitable for power-law distributed domain sizes, which is common in real-world domains. You need to specify the number of partitions to use and some other parameters. The LSH parameter K (number of hash functions per band) is dynamically tuned at query-time, but the maximum value should be specified here.

* See explanation for the reason for the "Plus" version.

// Set the number of partitions
numPart := 8

// Set the maximum value for the MinHash LSH parameter K 
// (number of hash functions per band).
maxK := 4

// Create index using equi-depth partitioning
// You can also use BootstrapLshEnsemblePlusEquiDepth for better accuracy
index_eqd, err := lshensemble.BootstrapLshEnsembleEquiDepth(numPart, numHash, maxK, 
    len(domainRecords), lshensemble.Recs2Chan(domainRecords))
if err != nil {
	panic(err)
}

// Create index using optimal partitioning
// You can also use BootstrapLshEnsemblePlusOptimal for better accuracy
index_opt, err := lshensemble.BootstrapLshEnsembleOptimal(numPart, numHash, maxK,
    func () <-chan *lshensemble.DomainRecord { 
        return lshensemble.Recs2Chan(domainRecords); 
    })
if err != nil {
	panic(err)
}

For better memory efficiency when the number of domains is large, it's wiser to use Golang channels and goroutines to pipeline the generation of the signatures, and then use disk-based sorting to sort the domain records. This is why BootstrapLshEnsembleEquiDepth accepts a channel of *DomainRecord as input. For a small number of domains, you simply use Recs2Chan to convert the sorted slice of *DomainRecord into a chan *DomainRecord. To help serializing the domain records to disk, you can use SerializeSignature to serialize the signatures. You need to come up with your own serialization schema for the keys and sizes.

Lastly, you can use Query function, which returns a Golang channel of the keys of the candidates domains, which may contain false positives - domains that do not meet the containment threshold. Therefore, you can optionally include a post-processing step to remove the false positive domains using the original domain values.

// pick a domain to use as the query
querySig := domainRecords[0].Signature
querySize := domainRecords[0].Size

// set the containment threshold
threshold := 0.5

// get the keys of the candidate domains (may contain false positives)
// through a channel with option to cancel early.
done := make(chan struct{})
defer close(done) // Important!!
results := index.Query(querySig, querySize, threshold, done)

for key := range results {
	// ...
	// You may want to include a post-processing step here to remove 
	// false positive domains using the actual domain values.
	// ...
	// You can call break here to stop processing results.
}

Run Canadian Open Data Benchmark

First you need to download the Canadian Open Data domains and extract the domain files into a directory called _cod_domains by running the following command.

tar xzf canadian_open_data_tabular_domains_only.tar.gz

Use Golang's test tool to start the benchmark:

go test -bench=Benchmark_CanadianOpenData -timeout=24h

The benchmark process is in the following order:

Read the domain files into memory
Run Linear Scan to get the ground truth
Run LSH Ensemble to get the query results
Run the accuracy analysis to generate a report on precisions and recalls

Explanation for the Parameter `MaxK` and Bootstrap Options

MinHash LSH has two parameters K and L (in the paper I used r and b respectively). L is the number of "bands" and K is the number of hash functions per band. The details about the two parameters can be found in Chapter 3 of the textbook, Mining of Massive Datasets.

In LSH Ensemble, we want to allow the K and L of the LSH index in every partition to vary at query time, in order to optimize them for any given query (see Section 5.5 of the paper). We can use achive this by using multiple MinHash LSH, one for each value of K. This allows us to vary the parameter K and L in the following space:

K * L <= number of hash functions (let this be H)
1 <= K <= H

However, this means that for every possible value of K from 1 to H, we need to create a MinHash LSH -- very expensive. So it is not wise to allow K to vary from 1 to H, and that's why we have a MaxK parameter, which bounds K and saves memory. So the new parameter space is:

K * L <= H
1 <= K <= MaxK

It is important to note that it is not the case for L, because we can choose how many "bands" to use at query time.

Now, if we use LSH Forest, we can vary the parameter K from 1 to MaxK at query time with just one LSH. You can read the paper to understand how this can be done (hint: prefix tree). This comes at a price -- the parameter space is more restricted:

MaxK * L <= H
1 <= K <= MaxK

Essentially, we have less freedom in varying L, as 1 <= L <= min{H / MaxK, H} base on the above constraints.

In this library for LSH Ensemble, we provide both implmentations (LSH Forest and "vanilla" MinHash LSH ). Specifically,

BootstrapLshEnsembleEquiDepth and BootstrapLshEnsembleOptimal build the index using the LSH Forest implementation, which use less memory but with a more restricted parameter space for optimization.
BootstrapLshEnsemblePlusEquiDepth and BootstrapLshEnsemblePlusOptimal build the index using the "vanilla" MinHash LSH implementation (one LSH for every K), which uses more memory (bounded by MaxK) but with no restriction on L.

We found that the optimal K for most queries are less than 4. So in practice you can just set MaxK to 4.

Documentation ¶

Index ¶

Constants
Variables
func BytesToSig(data []byte) ([]uint64, error)
func Containment(q, x []uint64, qSize, xSize int) float64
func Recs2Chan(recs []*DomainRecord) <-chan *DomainRecord
func SigToBytes(sig []uint64) []byte
type BySize
- func (rs BySize) Len() int
- func (rs BySize) Less(i, j int) bool
- func (rs BySize) Subset(lower, upper int) []*DomainRecord
- func (rs BySize) Swap(i, j int)
type DomainRecord
type Lsh
type LshEnsemble
- func BootstrapLshEnsembleEquiDepth(numPart, numHash, maxK, totalNumDomains int, ...) (*LshEnsemble, error)
- func BootstrapLshEnsembleOptimal(numPart, numHash, maxK int, sortedDomainFactory func() <-chan *DomainRecord) (*LshEnsemble, error)
- func BootstrapLshEnsemblePlusEquiDepth(numPart, numHash, maxK, totalNumDomains int, ...) (*LshEnsemble, error)
- func BootstrapLshEnsemblePlusOptimal(numPart, numHash, maxK int, sortedDomainFactory func() <-chan *DomainRecord) (*LshEnsemble, error)
- func NewLshEnsemble(parts []Partition, numHash, maxK int) *LshEnsemble
- func NewLshEnsemblePlus(parts []Partition, numHash, maxK int) *LshEnsemble
- func (e *LshEnsemble) Add(key interface{}, sig []uint64, partInd int)
- func (e *LshEnsemble) Index()
- func (e *LshEnsemble) Prepare(key interface{}, sig []uint64, size int) error
- func (e *LshEnsemble) Query(sig []uint64, size int, threshold float64, done <-chan struct{}) <-chan interface{}
- func (e *LshEnsemble) QueryTimed(sig []uint64, size int, threshold float64) (result []interface{}, dur time.Duration)
type LshForest
- func NewLshForest16(k, l int) *LshForest
- func NewLshForest32(k, l int) *LshForest
- func NewLshForest64(k, l int) *LshForest
- func (f *LshForest) Add(key interface{}, sig []uint64)
- func (f *LshForest) Index()
- func (f *LshForest) OptimalKL(x, q int, t float64) (optK, optL int, fp, fn float64)
- func (f *LshForest) Query(sig []uint64, K, L int, out chan<- interface{}, done <-chan struct{})
type LshForestArray
- func NewLshForestArray(maxK, numHash int) *LshForestArray
- func (a *LshForestArray) Add(key interface{}, sig []uint64)
- func (a *LshForestArray) Index()
- func (a *LshForestArray) OptimalKL(x, q int, t float64) (optK, optL int, fp, fn float64)
- func (a *LshForestArray) Query(sig []uint64, K, L int, out chan<- interface{}, done <-chan struct{})
type Minhash
- func NewMinhash(seed int64, numHash int) *Minhash
- func (m *Minhash) Push(b []byte)
- func (m *Minhash) Signature() []uint64
type Partition

Constants ¶

View Source

const HashValueSize = 8

HashValueSize is 8, the number of byte used for each hash value

Variables ¶

View Source

var NewLshForest = NewLshForest32

NewLshForest default constructor uses 32 bit hash value

Functions ¶

func BytesToSig ¶

func BytesToSig(data []byte) ([]uint64, error)

BytesToSig converts a byte slice into a signature

func Containment ¶

func Containment(q, x []uint64, qSize, xSize int) float64

Containment returns the estimated containment of |Q \intersect X| / |Q|. q and x are the signatures of Q and X respectively. If either size is 0, the result is defined to be 0.

func Recs2Chan ¶

func Recs2Chan(recs []*DomainRecord) <-chan *DomainRecord

Recs2Chan is a utility function that converts a DomainRecord slice in memory to a DomainRecord channel.

func SigToBytes ¶

func SigToBytes(sig []uint64) []byte

SigToBytes serializes the signature into byte slice

Types ¶

type BySize ¶

type BySize []*DomainRecord

BySize is a wrapper for sorting domains.

func (BySize) Len ¶

func (rs BySize) Len() int

func (BySize) Less ¶

func (rs BySize) Less(i, j int) bool

func (BySize) Subset ¶

func (rs BySize) Subset(lower, upper int) []*DomainRecord

Subset returns a subset of the domains given the size lower bound and upper bound.

func (BySize) Swap ¶

func (rs BySize) Swap(i, j int)

type DomainRecord ¶

type DomainRecord struct {
	// The unique key of this domain.
	Key interface{}
	// The domain size.
	Size int
	// The MinHash signature of this domain.
	Signature []uint64
}

DomainRecord represents a domain record.

type Lsh ¶

type Lsh interface {
	// Add addes a new key into the index, it won't be searchable
	// until the next time Index() is called since the add.
	Add(key interface{}, sig []uint64)
	// Index makes all keys added so far searchable.
	Index()
	// Query searches the index given a minhash signature, and
	// the LSH parameters k and l. Result keys will be written to
	// the channel out.
	// Closing channel done will cancels the query execution.
	Query(sig []uint64, k, l int, out chan<- interface{}, done <-chan struct{})
	// OptimalKL computes the optimal LSH parameters k and l given
	// x, the index domain size, q, the query domain size, and t,
	// the containment threshold. The resulting false positive (fp)
	// and false negative (fn) probabilities are returned as well.
	OptimalKL(x, q int, t float64) (optK, optL int, fp, fn float64)
}

Lsh interface is implemented by LshForst and LshForestArray.

type LshEnsemble ¶

type LshEnsemble struct {
	Partitions []Partition
	// contains filtered or unexported fields
}

LshEnsemble represents an LSH Ensemble index.

func BootstrapLshEnsembleEquiDepth ¶

func BootstrapLshEnsembleEquiDepth(numPart, numHash, maxK, totalNumDomains int,
	sortedDomains <-chan *DomainRecord) (*LshEnsemble, error)

BootstrapLshEnsembleEquiDepth builds an index from a channel of domains using equi-depth partitions -- partitions have approximately the same number of domains. The returned index consists of MinHash LSH implemented using LshForest. numPart is the number of partitions to create. numHash is the number of hash functions in MinHash. maxK is the maximum value for the MinHash parameter K - the number of hash functions per "band". sortedDomains is a DomainRecord channel emitting domains in sorted order by their sizes.

func BootstrapLshEnsembleOptimal ¶

func BootstrapLshEnsembleOptimal(numPart, numHash, maxK int,
	sortedDomainFactory func() <-chan *DomainRecord) (*LshEnsemble, error)

BootstrapLshEnsembleOptimal builds an index from domains using optimal partitioning. The returned index consists of MinHash LSH implemented using LshForest. numPart is the number of partitions to create. numHash is the number of hash functions in MinHash. maxK is the maximum value for the MinHash parameter K - the number of hash functions per "band". sortedDomainFactory is factory function that returns a DomainRecord channel emitting domains in sorted order by their sizes.

func BootstrapLshEnsemblePlusEquiDepth ¶

func BootstrapLshEnsemblePlusEquiDepth(numPart, numHash, maxK,
	totalNumDomains int, sortedDomains <-chan *DomainRecord) (*LshEnsemble, error)

BootstrapLshEnsemblePlusEquiDepth builds an index from a channel of domains using equi-depth partitions -- partitions have approximately the same number of domains. The returned index consists of MinHash LSH implemented using LshForestArray. numPart is the number of partitions to create. numHash is the number of hash functions in MinHash. maxK is the maximum value for the MinHash parameter K - the number of hash functions per "band". sortedDomains is a DomainRecord channel emitting domains in sorted order by their sizes.

func BootstrapLshEnsemblePlusOptimal ¶

func BootstrapLshEnsemblePlusOptimal(numPart, numHash, maxK int,
	sortedDomainFactory func() <-chan *DomainRecord) (*LshEnsemble, error)

BootstrapLshEnsemblePlusOptimal builds an index from domains using optimal partitioning. The returned index consists of MinHash LSH implemented using LshForestArray. numPart is the number of partitions to create. numHash is the number of hash functions in MinHash. maxK is the maximum value for the MinHash parameter K - the number of hash functions per "band". sortedDomainFactory is factory function that returns a DomainRecord channel emitting domains in sorted order by their sizes.

func NewLshEnsemble ¶

func NewLshEnsemble(parts []Partition, numHash, maxK int) *LshEnsemble

NewLshEnsemble initializes a new index consists of MinHash LSH implemented using LshForest. numHash is the number of hash functions in MinHash. maxK is the maximum value for the MinHash parameter K - the number of hash functions per "band".

func NewLshEnsemblePlus ¶

func NewLshEnsemblePlus(parts []Partition, numHash, maxK int) *LshEnsemble

NewLshEnsemblePlus initializes a new index consists of MinHash LSH implemented using LshForestArray. numHash is the number of hash functions in MinHash. maxK is the maximum value for the MinHash parameter K - the number of hash functions per "band".

func (*LshEnsemble) Add ¶

func (e *LshEnsemble) Add(key interface{}, sig []uint64, partInd int)

Add a new domain to the index given its partition ID - the index of the partition. The added domain won't be searchable until the Index() function is called.

func (*LshEnsemble) Index ¶

func (e *LshEnsemble) Index()

Index makes all added domains searchable.

func (*LshEnsemble) Prepare ¶

func (e *LshEnsemble) Prepare(key interface{}, sig []uint64, size int) error

Prepare adds a new domain to the index given its size, and partition will be selected automatically. It could be more efficient to use Add(). The added domain won't be searchable until the Index() function is called.

func (*LshEnsemble) Query ¶

func (e *LshEnsemble) Query(sig []uint64, size int, threshold float64, done <-chan struct{}) <-chan interface{}

Query returns the candidate domain keys in a channel. This function is given the MinHash signature of the query domain, sig, the domain size, the containment threshold, and a cancellation channel. Closing channel done will cancel the query execution. The query signature must be generated using the same seed as the signatures of the indexed domains, and have the same number of hash functions.

func (*LshEnsemble) QueryTimed ¶

func (e *LshEnsemble) QueryTimed(sig []uint64, size int, threshold float64) (result []interface{}, dur time.Duration)

QueryTimed is similar to Query, returns the candidate domain keys in a slice as well as the running time.

type LshForest ¶

type LshForest struct {
	// contains filtered or unexported fields
}

LshForest represents a MinHash LSH implemented using LSH Forest (http://ilpubs.stanford.edu:8090/678/1/2005-14.pdf). It supports query-time setting of the MinHash LSH parameters L (number of bands) and K (number of hash functions per band).

func NewLshForest16 ¶

func NewLshForest16(k, l int) *LshForest

NewLshForest16 uses 16-bit hash values. MinHash signatures with 64 or 32 bit hash values will have their hash values trimed.

func NewLshForest32 ¶

func NewLshForest32(k, l int) *LshForest

NewLshForest32 uses 32-bit hash values. MinHash signatures with 64 bit hash values will have their hash values trimed.

func NewLshForest64 ¶

func NewLshForest64(k, l int) *LshForest

NewLshForest64 uses 64-bit hash values.

func (*LshForest) Add ¶

func (f *LshForest) Add(key interface{}, sig []uint64)

Add a key with MinHash signature into the index. The key won't be searchable until Index() is called.

func (*LshForest) Index ¶

func (f *LshForest) Index()

Index makes all the keys added searchable.

func (*LshForest) OptimalKL ¶

func (f *LshForest) OptimalKL(x, q int, t float64) (optK, optL int, fp, fn float64)

OptimalKL returns the optimal K and L for containment search, and the false positive and negative probabilities. where x is the indexed domain size, q is the query domain size, and t is the containment threshold.

func (*LshForest) Query ¶

func (f *LshForest) Query(sig []uint64, K, L int, out chan<- interface{}, done <-chan struct{})

Query returns candidate keys given the query signature and parameters.

type LshForestArray ¶

type LshForestArray struct {
	// contains filtered or unexported fields
}

LshForestArray represents a MinHash LSH implemented using an array of LshForest. It allows a wider range for the K and L parameters.

func NewLshForestArray ¶

func NewLshForestArray(maxK, numHash int) *LshForestArray

NewLshForestArray initializes with parameters: maxK is the maximum value for the MinHash parameter K - the number of hash functions per "band". numHash is the number of hash functions in MinHash.

func (*LshForestArray) Add ¶

func (a *LshForestArray) Add(key interface{}, sig []uint64)