tdigest

package module
v3.1.0+incompatible Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 30, 2019 License: MIT Imports: 7 Imported by: 15

README

T-Digest

A fast map-reduce and parallel streaming friendly data-structure for accurate quantile approximation.

This package provides an implementation of Ted Dunning's t-digest data structure in Go.

GoDoc Go Report Card

Project Status

This project is actively maintained. We are happy to collaborate on features and issues if/when they arrive.

Installation

Our releases are tagged and signed following the Semantic Versioning scheme. If you are using a dependency manager such as dep, the recommended way to is go about your business normally:

go get github.com/caio/go-tdigest

Otherwise we recommend to use the following so that you don't risk breaking your build because of an API change:

go get gopkg.in/caio/go-tdigest.v2

Example Usage

package main

import (
	"fmt"
	"math/rand"

	"github.com/caio/go-tdigest"
)

func main() {
	// Analogue to tdigest.New(tdigest.Compression(100))
	t, _ := tdigest.New()

	for i := 0; i < 10000; i++ {
		// Analogue to t.AddWeighted(rand.Float64(), 1)
		t.Add(rand.Float64())
	}

	fmt.Printf("p(.5) = %.6f\n", t.Quantile(0.5))
	fmt.Printf("CDF(Quantile(.5)) = %.6f\n", t.CDF(t.Quantile(0.5)))
}

Configuration

You can configure your digest upon creation with options documented at options.go. Example:

// Construct a digest with compression=200 and its own
// (thread-unsafe) RNG seeded with 0xCA10:
digest, _ := tdigest.New(
        tdigest.Compression(200),
        tdigest.LocalRandomNumberGenerator(0xCA10),
)

Porting Existing Code to the v2 API

It's very easy to migrate to the new API:

  • Replace tdigest.New(100) with tdigest.New()
  • Replace tdigest.New(number) with tdigest.New(tdigest.Compression(number))
  • Replace Add(x,1) with Add(x)
  • Replace Add(x, weight) with AddWeighted(x, weight)
  • Remove any use of tdigest.Len() (or open an issue)

References

This is a port of the reference implementation with some ideas borrowed from the python version. If you wanna get a quick grasp of how it works and why it's useful, this video and companion article is pretty helpful.

Documentation

Overview

Package tdigest provides a highly accurate mergeable data-structure for quantile estimation.

Typical T-Digest use cases involve accumulating metrics on several distinct nodes of a cluster and then merging them together to get a system-wide quantile overview. Things such as: sensory data from IoT devices, quantiles over enormous document datasets (think ElasticSearch), performance metrics for distributed systems, etc.

After you create (and configure, if desired) the digest:

digest, err := tdigest.New(tdigest.Compression(100))

You can then use it for registering measurements:

digest.Add(number)

Estimating quantiles:

digest.Quantile(0.99)

And merging with another digest:

digest.Merge(otherDigest)

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Compression

func Compression(compression float64) tdigestOption

Compression sets the digest compression

The compression parameter rules the threshold in which samples are merged together - the more often distinct samples are merged the more precision is lost. Compression should be tuned according to your data distribution, but a value of 100 (the default) is often good enough.

A higher compression value means holding more centroids in memory (thus: better precision), which means a bigger serialization payload, higher memory footprint and slower addition of new samples.

Compression must be a value greater of equal to 1, will yield an error otherwise.

func LocalRandomNumberGenerator

func LocalRandomNumberGenerator(seed int64) tdigestOption

LocalRandomNumberGenerator makes the TDigest use the default `math/random` functions but with an unshared source that is seeded with the given `seed` parameter.

func RandomNumberGenerator

func RandomNumberGenerator(rng RNG) tdigestOption

RandomNumberGenerator sets the RNG to be used internally

This allows changing which random number source is used when using the TDigest structure (rngs are used when deciding which candidate centroid to merge with and when compressing or merging with another digest for it increases accuracy). This functionality is particularly useful for testing or when you want to disconnect your sample collection from the (default) shared random source to minimize lock contention.

Types

type RNG

type RNG interface {
	Float32() float32
	Intn(int) int
}

RNG is an interface that wraps the needed random number generator calls that tdigest uses during its runtime

type TDigest

type TDigest struct {
	// contains filtered or unexported fields
}

TDigest is a quantile approximation data structure.

func FromBytes

func FromBytes(buf *bytes.Reader, options ...tdigestOption) (*TDigest, error)

FromBytes reads a byte buffer with a serialized digest (from AsBytes) and deserializes it.

This function creates a new tdigest instance with the provided options, but ignores the compression setting since the correct value comes from the buffer.

func New

func New(options ...tdigestOption) (*TDigest, error)

New creates a new digest.

By default the digest is constructed with a configuration that should be useful for most use-cases. It comes with compression set to 100 and uses a local random number generator for performance reasons.

func (*TDigest) Add added in v0.3.0

func (t *TDigest) Add(value float64) error

Add is an alias for AddWeighted(x,1) Read the documentation for AddWeighted for more details.

func (*TDigest) AddWeighted

func (t *TDigest) AddWeighted(value float64, count uint64) (err error)

AddWeighted registers a new sample in the digest.

It's the main entry point for the digest and very likely the only method to be used for collecting samples. The count parameter is for when you are registering a sample that occurred multiple times - the most common value for this is 1.

This will emit an error if `value` is NaN of if `count` is zero.

func (TDigest) AsBytes

func (t TDigest) AsBytes() ([]byte, error)

AsBytes serializes the digest into a byte array so it can be saved to disk or sent over the wire.

func (*TDigest) CDF

func (t *TDigest) CDF(value float64) float64

CDF computes the fraction in which all samples are less than or equal to the given value.

func (*TDigest) Clone

func (t *TDigest) Clone() *TDigest

Clone returns a deep copy of a TDigest.

func (*TDigest) Compress

func (t *TDigest) Compress() (err error)

Compress tries to reduce the number of individual centroids stored in the digest.

Compression trades off accuracy for performance and happens automatically after a certain amount of distinct samples have been stored.

At any point in time you may call Compress on a digest, but you may completely ignore this and it will compress itself automatically after it grows too much. If you are minimizing network traffic it might be a good idea to compress before serializing.

func (*TDigest) Compression

func (t *TDigest) Compression() float64

Compression returns the TDigest compression.

func (TDigest) Count

func (t TDigest) Count() uint64

Count returns the total number of samples this digest represents

The result represents how many times Add() was called on a digest plus how many samples the digests it has been merged with had. This is useful mainly for two scenarios:

- Knowing if there is enough data so you can trust the quantiles

- Knowing if you've registered too many samples already and deciding what to do about it.

For the second case one approach would be to create a side empty digest and start registering samples on it as well as on the old (big) one and then discard the bigger one after a certain criterion is reached (say, minimum number of samples or a small relative error between new and old digests).

func (*TDigest) ForEachCentroid added in v1.1.0

func (t *TDigest) ForEachCentroid(f func(mean float64, count uint64) bool)

ForEachCentroid calls the specified function for each centroid.

Iteration stops when the supplied function returns false, or when all centroids have been iterated.

func (*TDigest) FromBytes

func (t *TDigest) FromBytes(buf []byte) error

FromBytes deserializes into the supplied TDigest struct, re-using and overwriting any existing buffers.

This method reinitializes the digest from the provided buffer discarding any previously collected data. Notice that in case of errors this may leave the digest in a unusable state.

func (*TDigest) Merge

func (t *TDigest) Merge(other *TDigest) (err error)

Merge joins a given digest into itself.

Merging is useful when you have multiple TDigest instances running in separate threads and you want to compute quantiles over all the samples. This is particularly important on a scatter-gather/map-reduce scenario.

func (*TDigest) MergeDestructive

func (t *TDigest) MergeDestructive(other *TDigest) (err error)

MergeDestructive joins a given digest into itself rendering the other digest invalid.

This works as Merge above but its faster. Using this method requires caution as it makes 'other' useless - you must make sure you discard it without making further uses of it.

func (*TDigest) Quantile added in v1.0.0

func (t *TDigest) Quantile(q float64) float64

Quantile returns the desired percentile estimation.

Values of p must be between 0 and 1 (inclusive), will panic otherwise.

func (*TDigest) ToBytes

func (t *TDigest) ToBytes(b []byte) []byte

ToBytes serializes into the supplied slice, avoiding allocation if the slice is large enough. The result slice is returned.

func (*TDigest) TrimmedMean

func (t *TDigest) TrimmedMean(p1, p2 float64) float64

TrimmedMean returns the mean of the distribution between the two percentiles p1 and p2.

Values of p1 and p2 must be beetween 0 and 1 (inclusive) and p1 must be less than p2. Will panic otherwise.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL