hll

package module
v0.0.0-...-0e45c28 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 4, 2016 License: Apache-2.0 Imports: 12 Imported by: 0

README

HyperLogLog++ for Go

This is a Go implementation of the HyperLogLog++ algorithm from "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm" by Heule, Nunkesser and Hall of Google. This is a cardinality estimation algorithm: given a stream of input elements, it will estimate the number of unique items in the stream. The estimation error can be controlled by choosing how much memory to use. HyperLogLog++ improves on the basic HyperLogLog algorithm by using less space, improving accuracy, and correcting bias.

This code is a translation of the pseudocode contained in Figures 6 and 7 of the Google paper. Not all algorithms are provided in the paper, but we've tried our best to be true to the authors' intent when writing the omitted algorithms. We're not trying to be creative, we're just implementing the algorithm described in the paper as directly as possible. Our deviations are described here.

The HyperLogLog++ paper is available here

Instructions

See the docs.

Documentation

Overview

This is a Go implementation of the HyperLogLog++ algorithm from "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm" by Heule, Nunkesser and Hall of Google. This is a cardinality estimation algorithm: given a stream of input elements, it will estimate the number of unique items in the stream. The estimation error can be controlled by choosing how much memory to use. HyperLogLog++ improves on the basic HyperLogLog algorithm by using less space, improving accuracy, and correcting bias.

This code is a translation of the pseudocode contained in Figures 6 and 7 of the Google paper. Not all algorithms are provided in the paper, but we've tried our best to be true to the authors' intent when writing the omitted algorithms. We're not trying to be creative, we're just implementing the algorithm described in the paper as directly as possible.

The HyperLogLog++ paper is available at http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40671.pdf

Package hll is a generated protocol buffer package.

It is generated from these files:
	hll.proto

It has these top-level messages:
	HllPb
Example
const (
	p           = 14 // Max memory usage is 0.75 * 2^p bytes
	pPrime      = 25 // Setting this is a bit more complicated, Google recommends 25.
	numToInsert = 1000000
)

// You can use any good hash function, truncated to 8 bytes to fit in a uint64.
hashU64 := func(s string) uint64 {
	sha1Hash := sha1.Sum([]byte(s))
	return binary.LittleEndian.Uint64(sha1Hash[0:8])
}

hll := NewHll(p, pPrime)

// For this example, our inputs will just be strings, e.g. "1", "2"
for i := 0; i < numToInsert; i++ {
	inputString := strconv.Itoa(i)

	// To use HLL, you hash your item, convert the hash to uint64, and pass it to Add().
	hll.Add(hashU64(inputString))
}

// Duplicates do not affect the cardinality. The following loop has no effect.
for i := 0; i < 10000; i++ {
	hll.Add(hashU64("1"))
}

// We inserted 1M unique elements, the cardinality should be roughly 1M.
fmt.Printf("%d\n", hll.Cardinality())
Output:

989546

Index

Examples

Constants

This section is empty.

Variables

View Source
var (
	ErrInvalidLengthHll = fmt.Errorf("proto: negative length found during unmarshaling")
)

Functions

This section is empty.

Types

type Hll

type Hll struct {
	// contains filtered or unexported fields
}

func NewHll

func NewHll(p, pPrime uint) *Hll

Initialize a new hyper-log-log struct based on inputs p and p'. Google recommends that p be set to 14, and p' to equal either 20 or 25.

func (*Hll) Add

func (h *Hll) Add(x uint64)

Add takes a hash and updates the cardinality estimation data structures.

The input should be a hash of whatever type you're estimating of. For example, if you're estimating the cardinality of a stream of strings, you'd pass the hash of each string to this function.

func (*Hll) Cardinality

func (h *Hll) Cardinality() uint64

Returns the estimated cardinality (the number of unique inputs seen so far).

func (*Hll) Combine

func (h *Hll) Combine(other *Hll)

Combine() merges two HyperLogLog++ calculations. This allows you to parallelize cardinality estimation: each thread can process a shard of the input, then the results can be merged later to give the cardinality of the entire data set (the union of the shards).

WARNING: The "other" parameter may be mutated during this call! It may be converted from a sparse to dense representation, which may affect its space usage and precision. This is a deliberate design decision that helps to minimize memory consumption.

The inputs must have the same p and pPrime or this function will panic. The Google paper doesn't give an algorithm for this operation, but its existence is implied, and the ability to do this combine operation is one of the main benefits of using a HyperLogLog-type algorithm in the first place.

func (*Hll) GetBSON

func (h *Hll) GetBSON() (interface{}, error)

func (*Hll) GobDecode

func (h *Hll) GobDecode(data []byte) error

func (*Hll) GobEncode

func (h *Hll) GobEncode() ([]byte, error)

custom gob encoder and decoders

func (*Hll) MarshalJSON

func (h *Hll) MarshalJSON() ([]byte, error)

func (*Hll) MarshalPb

func (h *Hll) MarshalPb() ([]byte, error)

func (*Hll) SetBSON

func (h *Hll) SetBSON(raw bson.Raw) error

func (*Hll) UnmarshalJSON

func (h *Hll) UnmarshalJSON(buf []byte) error

func (*Hll) UnmarshalPb

func (h *Hll) UnmarshalPb(buf []byte) error

type HllPb

type HllPb struct {
	P                *int32       `protobuf:"varint,1,req,name=p" json:"p,omitempty"`
	Pp               *int32       `protobuf:"varint,2,req,name=pp" json:"pp,omitempty"`
	M                []byte       `protobuf:"bytes,3,opt" json:"M,omitempty"`
	S                *HllPbSparse `protobuf:"bytes,4,opt,name=s" json:"s,omitempty"`
	XXX_unrecognized []byte       `json:"-"`
}

func (*HllPb) GetM

func (m *HllPb) GetM() []byte

func (*HllPb) GetP

func (m *HllPb) GetP() int32

func (*HllPb) GetPp

func (m *HllPb) GetPp() int32

func (*HllPb) GetS

func (m *HllPb) GetS() *HllPbSparse

func (*HllPb) Marshal

func (m *HllPb) Marshal() (data []byte, err error)

func (*HllPb) MarshalTo

func (m *HllPb) MarshalTo(data []byte) (int, error)

func (*HllPb) ProtoMessage

func (*HllPb) ProtoMessage()

func (*HllPb) Reset

func (m *HllPb) Reset()

func (*HllPb) Size

func (m *HllPb) Size() (n int)

func (*HllPb) String

func (m *HllPb) String() string

func (*HllPb) Unmarshal

func (m *HllPb) Unmarshal(data []byte) error

type HllPbSparse

type HllPbSparse struct {
	Buf              []byte  `protobuf:"bytes,1,opt,name=buf" json:"buf,omitempty"`
	LastVal          *uint64 `protobuf:"varint,2,req,name=lastVal" json:"lastVal,omitempty"`
	NumElements      *uint64 `protobuf:"varint,3,req,name=numElements" json:"numElements,omitempty"`
	XXX_unrecognized []byte  `json:"-"`
}

func (*HllPbSparse) GetBuf

func (m *HllPbSparse) GetBuf() []byte

func (*HllPbSparse) GetLastVal

func (m *HllPbSparse) GetLastVal() uint64

func (*HllPbSparse) GetNumElements

func (m *HllPbSparse) GetNumElements() uint64

func (*HllPbSparse) Marshal

func (m *HllPbSparse) Marshal() (data []byte, err error)

func (*HllPbSparse) MarshalTo

func (m *HllPbSparse) MarshalTo(data []byte) (int, error)

func (*HllPbSparse) ProtoMessage

func (*HllPbSparse) ProtoMessage()

func (*HllPbSparse) Reset

func (m *HllPbSparse) Reset()

func (*HllPbSparse) Size

func (m *HllPbSparse) Size() (n int)

func (*HllPbSparse) String

func (m *HllPbSparse) String() string

func (*HllPbSparse) Unmarshal

func (m *HllPbSparse) Unmarshal(data []byte) error

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL