polyarray

package module
v0.1.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 13, 2020 License: Apache-2.0 Imports: 7 Imported by: 0

README

polyarray

Travis AppVeyor test

Report card Coverage Status

GoDoc Sourcegraph

PolyArray: space efficient uint32 array. It uses polynomial to compress and store an array. A uint32 costs only 5 bits in a sorted array of a million number in range [0, 1000*1000](17% of original data).

Why

  • Space efficient: In a sorted array, an elt only takes about 10 bits to store a 32-bit int.
== Memory cost stats of sorted random uint array ==

n=1000 rng=[0, 1000]:

           n: 1000
   mem_total: 824
    bits/elt: 6

n=1000000 rng=[0, 1000000]:

           n: 1000000
   mem_total: 702624
    bits/elt: 5

n=1000000 rng=[0, 1000000000]:

           n: 1000000
   mem_total: 2078304
    bits/elt: 16
  • Fast access: A Get takes 10 ns. Run and see the benchmark: go test . -bench=..

  • Adaptive: It does not require the data to be totally sorted to compress it. E.g., PolyArray is perfect to store online user histogram data.

What It Is And What It Is Not

Another space efficient data structure to store uint32 array is trie(Aka prefix tree or radix tree). It is possible to use bitmap-based btree like structure to reduce space(very likely in such case it provides higher compression rate). But it requires the array to be sorted.

PolyArray does not have such restriction. It is more adaptive with data layout. To achieve high compression rate, it only requires the data has a overall trend, e.g., roughly sorted.

Additionally, it also accept duplicated element in the array, which a bitmap based or tree-like data structure does not allow.

Install

go get github.com/openacid/polyarray

Synopsis

Build a PolyArray

package polyarray_test

import (
	"fmt"

	"github.com/openacid/polyarray"
)

func ExamplePolyArray() {

	nums := []uint32{
		0, 16, 32, 48, 64, 79, 95, 111, 126, 142, 158, 174, 190, 206, 222, 236,
		252, 268, 275, 278, 281, 283, 285, 289, 296, 301, 304, 307, 311, 313, 318,
		321, 325, 328, 335, 339, 344, 348, 353, 357, 360, 364, 369, 372, 377, 383,
		387, 393, 399, 404, 407, 410, 415, 418, 420, 422, 426, 430, 434, 439, 444,
		446, 448, 451, 456, 459, 462, 465, 470, 473, 479, 482, 488, 490, 494, 500,
		506, 509, 513, 519, 521, 528, 530, 534, 537, 540, 544, 546, 551, 556, 560,
		566, 568, 572, 574, 576, 580, 585, 588, 592, 594, 600, 603, 606, 608, 610,
		614, 620, 623, 628, 630, 632, 638, 644, 647, 653, 658, 660, 662, 665, 670,
		672, 676, 681, 683, 687, 689, 691, 693, 695, 697, 703, 706, 710, 715, 719,
		722, 726, 731, 735, 737, 741, 748, 750, 753, 757, 763, 766, 768, 775, 777,
		782, 785, 791, 795, 798, 800, 806, 811, 815, 818, 821, 824, 829, 832, 836,
		838, 842, 846, 850, 855, 860, 865, 870, 875, 878, 882, 886, 890, 895, 900,
		906, 910, 913, 916, 921, 925, 929, 932, 937, 940, 942, 944, 946, 952, 954,
		956, 958, 962, 966, 968, 971, 975, 979, 983, 987, 989, 994, 997, 1000,
	}

	a := polyarray.NewPolyArray(nums)

	fmt.Println("last elt is:", a.Get(int32(a.Len()-1)))

	st := a.Stat()
	for _, k := range []string{
		"elt_width",
		"mem_elts",
		"bits/elt"} {
		fmt.Printf("%10s : %d\n", k, st[k])
	}

	// Unordered output:
	// last elt is: 1000
	//  elt_width : 3
	//   mem_elts : 112
	//   bits/elt : 14
}

How it works

package polyarray uses polynomial to compress and store an array of uint32. A uint32 costs only 5 bits in a sorted array of a million number in range [0, 1000*1000].

The General Idea

We use a polynomial y = a + bx + cx² to describe the overall trend of the numbers. And for every number i we add a residual to fit the gap between y(i) and nums[i]. E.g. If there are 4 numbers: 0, 15, 33, 50 The polynomial and residuals are:

y = 16x
0, -1, 1, 2

In this case the residuals require 3 bits for each of them. To retrieve the numbers, we evaluate y(i) and add the residual to it:

get(0) = y(0) + 0 = 16 * 0 + 0 = 0
get(1) = y(1) - 1 = 16 * 1 - 1 = 15
get(2) = y(2) + 1 = 16 * 2 + 1 = 33
get(3) = y(3) + 2 = 16 * 3 + 2 = 50
What It Is And What It Is Not

Another space efficient data structure to store uint32 array is trie or prefix tree or radix tree. It is possible to use bitmap-based btree like structure to reduce space(very likely in such case it provides higher compression rate). But it requires the array to be sorted.

PolyArray does not have such restriction. It is more adaptive with data layout. To achieve high compression rate, it only requires the data has a overall trend, e.g., roughly sorted, as seen in the above 4 integers examples. Additionally, it also accept duplicated element in the array, which a bitmap based or tree-like data structure does not allow.

Data Structure

PolyArray splits the entire array into segments(Seg), each of which has 1024 numbers. And then it splits every segment into several spans. Every span has its own polynomial. A span has 16*k numbers. A segment has at most 64 spans.

        seg[0]                      seg[1]
        1024 nums                   1024 nums
|-------+---------------+---|---------------------------|...
 span[0]    span[1]
 16 nums    32 nums      ..
Uncompacted Data Structures

A PolyArray is a compacted data structure. The original data structures are defined as follow(assumes original user data is nums []uint32):

Seg strcut {
  SpansBitmap   uint64      // describe span layout
  OnesCount     uint64      // count `1` in preceding Seg.
  Spans       []Span
}

Span struct {
  width         int32       // is retrieved from SpansBitmap

  Polynomial [3]double      //
  Config strcut {           //
    Offset        int32     // residual offset
    ResidualWidth int32     // number of bits a residual requires
  }
  Residuals  [width][ResidualWidth]bit // pack into PolyArray.Residuals
}

A span stores 16*k int32 in it, where k ∈ [1, 64).

Seg.SpansBitmap describes the layout of Span-s in a Seg. A "1" at i-th bit and a "1" at j-th bit means a Span stores nums[i*16:j*16], e.g.:

100101110000......
<-- least significant bit

In the above example:

span[0] has 16*3 nums in it.
span[1] has 16*2 nums in it.
span[2] has 16*1 nums in it.

Seg.OnesCount caches the total count of "1" in all preceding Seg.SpansBitmap. This accelerate locating a Span in the packed field PolyArray.Polynomials .

Span.width is the count of numbers stored in this span. It does not need to be stored because it can be calculated by counting the "0" between two "1" in Seg.SpansBitmap.

Span.Polynomial stores 3 coefficients of the polynomial describing the overall trend of this span. I.e. the [a, b, c] in y = a + bx + cx²

Span.Config.Offset adjust the offset to locate a residual. In a span we want to have that:

residual position = Config.Offset + (i%1024) * Config.ResidualWidth

But if the preceding span has smaller residual width, the "offset" could be negative, e.g.: span[0] has residual of width 0 and 16 residuals, span[1] has residual of width 4. Then the "offset" of span[1] is -16*4 in order to satisify: (-16*4) + i * 4 is the correct residual position, for i in [16, 32).

Span.Config.ResidualWidth specifies the number of bits to store every residual in this span, it must be a power of 2: 2^k.

Span.Residuals is an array of residuals of length Span.width. Every elt in it is a ResidualWidth-bits integers.

Compact

PolyArray compact Seg into a dense format:

PolyArray.Bitmap = [
  Seg[0].SpansBitmap,
  Seg[0].OnesCount,
  Seg[1].SpansBitmap,
  Seg[1].OnesCount,
  ... ]

PolyArray.Polynomials = [
  Seg[0].Spans[0].Polynomials,
  Seg[0].Spans[1].Polynomials,
  ...
  Seg[1].Spans[0].Polynomials,
  Seg[1].Spans[1].Polynomials,
  ...
]

PolyArray.Configs = [
  Seg[0].Spans[0].Config
  Seg[0].Spans[1].Config
  ...
  Seg[1].Spans[0].Config
  Seg[1].Spans[1].Config
  ...
]

PolyArray.Residuals simply packs the residuals of every nums[i] together.

Documentation

Overview

package polyarray uses polynomial to compress and store an array of uint32. A uint32 costs only 5 bits in a sorted array of a million number in range [0, 1000*1000].

The General Idea

We use a polynomial y = a + bx + cx² to describe the overall trend of the numbers. And for every number i we add a residual to fit the gap between y(i) and nums[i]. E.g. If there are 4 numbers: 0, 15, 33, 50 The polynomial and residuals are:

y = 16x
0, -1, 1, 2

In this case the residuals require 3 bits for each of them. To retrieve the numbers, we evaluate y(i) and add the residual to it:

get(0) = y(0) + 0 = 16 * 0 + 0 = 0
get(1) = y(1) - 1 = 16 * 1 - 1 = 15
get(2) = y(2) + 1 = 16 * 2 + 1 = 33
get(3) = y(3) + 2 = 16 * 3 + 2 = 50

What It Is And What It Is Not

Another space efficient data structure to store uint32 array is trie or prefix tree or radix tree. It is possible to use bitmap-based btree like structure to reduce space(very likely in such case it provides higher compression rate). But it requires the array to be sorted.

PolyArray does not have such restriction. It is more adaptive with data layout. To achieve high compression rate, it only requires the data has a overall trend, e.g., roughly sorted, as seen in the above 4 integers examples. Additionally, it also accept duplicated element in the array, which a bitmap based or tree-like data structure does not allow.

Data Structure

PolyArray splits the entire array into segments(Seg), each of which has 1024 numbers. And then it splits every segment into several spans. Every span has its own polynomial. A span has 16*k numbers. A segment has at most 64 spans.

        seg[0]                      seg[1]
        1024 nums                   1024 nums
|-------+---------------+---|---------------------------|...
 span[0]    span[1]
 16 nums    32 nums      ..

Uncompacted Data Structures

A PolyArray is a compacted data structure. The original data structures are defined as follow(assumes original user data is `nums []uint32`):

Seg strcut {
  SpansBitmap   uint64      // describe span layout
  OnesCount     uint64      // count `1` in preceding Seg.
  Spans       []Span
}

Span struct {
  width         int32       // is retrieved from SpansBitmap

  Polynomial [3]double      //
  Config strcut {           //
    Offset        int32     // residual offset
    ResidualWidth int32     // number of bits a residual requires
  }
  Residuals  [width][ResidualWidth]bit // pack into PolyArray.Residuals
}

A span stores 16*k int32 in it, where k ∈ [1, 64).

`Seg.SpansBitmap` describes the layout of Span-s in a Seg. A "1" at i-th bit and a "1" at j-th bit means a Span stores `nums[i*16:j*16]`, e.g.:

100101110000......
<-- least significant bit

In the above example:

span[0] has 16*3 nums in it.
span[1] has 16*2 nums in it.
span[2] has 16*1 nums in it.

`Seg.OnesCount` caches the total count of "1" in all preceding Seg.SpansBitmap. This accelerate locating a Span in the packed field PolyArray.Polynomials .

`Span.width` is the count of numbers stored in this span. It does not need to be stored because it can be calculated by counting the "0" between two "1" in `Seg.SpansBitmap`.

`Span.Polynomial` stores 3 coefficients of the polynomial describing the overall trend of this span. I.e. the `[a, b, c]` in `y = a + bx + cx²`

`Span.Config.Offset` adjust the offset to locate a residual. In a span we want to have that:

residual position = Config.Offset + (i%1024) * Config.ResidualWidth

But if the preceding span has smaller residual width, the "offset" could be negative, e.g.: span[0] has residual of width 0 and 16 residuals, span[1] has residual of width 4. Then the "offset" of span[1] is `-16*4` in order to satisify: `(-16*4) + i * 4` is the correct residual position, for i in [16, 32).

`Span.Config.ResidualWidth` specifies the number of bits to store every residual in this span, it must be a power of 2: `2^k`.

`Span.Residuals` is an array of residuals of length `Span.width`. Every elt in it is a `ResidualWidth`-bits integers.

Compact

PolyArray compact `Seg` into a dense format:

PolyArray.Bitmap = [
  Seg[0].SpansBitmap,
  Seg[0].OnesCount,
  Seg[1].SpansBitmap,
  Seg[1].OnesCount,
  ... ]

PolyArray.Polynomials = [
  Seg[0].Spans[0].Polynomials,
  Seg[0].Spans[1].Polynomials,
  ...
  Seg[1].Spans[0].Polynomials,
  Seg[1].Spans[1].Polynomials,
  ...
]

PolyArray.Configs = [
  Seg[0].Spans[0].Config
  Seg[0].Spans[1].Config
  ...
  Seg[1].Spans[0].Config
  Seg[1].Spans[1].Config
  ...
]

`PolyArray.Residuals` simply packs the residuals of every nums[i] together.

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type PolyArray

type PolyArray struct {
	// N is the count of elts
	N int32 `protobuf:"varint,10,opt,name=N,proto3" json:"N,omitempty"`
	// Every 1024 elts segment has a 64-bit bitmap to describe the spans in it,
	// and another 64-bit rank: the count of `1` in preceding bitmaps.
	Bitmap []uint64 `protobuf:"varint,20,rep,packed,name=Bitmap,proto3" json:"Bitmap,omitempty"`
	// Polynomial and config of every span.
	// 3 doubles to represent a polynomial;
	Polynomials []float64 `protobuf:"fixed64,21,rep,packed,name=Polynomials,proto3" json:"Polynomials,omitempty"`
	// Config stores the offset of residuals in Residuals and the bit width to
	// store a residual in a span.
	Configs []int64 `protobuf:"varint,22,rep,packed,name=Configs,proto3" json:"Configs,omitempty"`
	// packed residuals for every elt.
	Residuals            []uint64 `protobuf:"varint,23,rep,packed,name=Residuals,proto3" json:"Residuals,omitempty"`
	XXX_NoUnkeyedLiteral struct{} `json:"-"`
	XXX_unrecognized     []byte   `json:"-"`
	XXX_sizecache        int32    `json:"-"`
}

PolyArray compresses a uint32 array with overall trend by describing the trend with a polynomial, e.g., to store a sorted array is very common in practice. Such as an block-list of IP addresses, or a series of var-length record position on disk.

E.g. a uint32 costs only 5 bits in average in a sorted array of a million number in range [0, 1000*1000].

In addition to the unbelievable low memory footprint, a `Get` access is also very fast: it takes only 10 nano second in our benchmark.

PolyArray is also ready for transport since it is defined with protobuf. E.g.:

a := polyarray.NewPolyArray([]uint32{1, 2, 3})
bytes, err := proto.Marshal(a)

Since 0.1.1

Example
package main

import (
	"fmt"

	"github.com/openacid/polyarray"
)

func main() {

	nums := []uint32{
		0, 16, 32, 48, 64, 79, 95, 111, 126, 142, 158, 174, 190, 206, 222, 236,
		252, 268, 275, 278, 281, 283, 285, 289, 296, 301, 304, 307, 311, 313, 318,
		321, 325, 328, 335, 339, 344, 348, 353, 357, 360, 364, 369, 372, 377, 383,
		387, 393, 399, 404, 407, 410, 415, 418, 420, 422, 426, 430, 434, 439, 444,
		446, 448, 451, 456, 459, 462, 465, 470, 473, 479, 482, 488, 490, 494, 500,
		506, 509, 513, 519, 521, 528, 530, 534, 537, 540, 544, 546, 551, 556, 560,
		566, 568, 572, 574, 576, 580, 585, 588, 592, 594, 600, 603, 606, 608, 610,
		614, 620, 623, 628, 630, 632, 638, 644, 647, 653, 658, 660, 662, 665, 670,
		672, 676, 681, 683, 687, 689, 691, 693, 695, 697, 703, 706, 710, 715, 719,
		722, 726, 731, 735, 737, 741, 748, 750, 753, 757, 763, 766, 768, 775, 777,
		782, 785, 791, 795, 798, 800, 806, 811, 815, 818, 821, 824, 829, 832, 836,
		838, 842, 846, 850, 855, 860, 865, 870, 875, 878, 882, 886, 890, 895, 900,
		906, 910, 913, 916, 921, 925, 929, 932, 937, 940, 942, 944, 946, 952, 954,
		956, 958, 962, 966, 968, 971, 975, 979, 983, 987, 989, 994, 997, 1000,
	}

	a := polyarray.NewPolyArray(nums)

	fmt.Println("last elt is:", a.Get(int32(a.Len()-1)))

	st := a.Stat()
	for _, k := range []string{
		"elt_width",
		"mem_elts",
		"bits/elt"} {
		fmt.Printf("%10s : %d\n", k, st[k])
	}

}
Output:

last elt is: 1000
 elt_width : 3
  mem_elts : 112
  bits/elt : 14

func NewPolyArray

func NewPolyArray(nums []uint32) *PolyArray

NewPolyArray creates a "PolyArray" array from a slice of uint32.

Since 0.1.1

func (*PolyArray) Descriptor

func (*PolyArray) Descriptor() ([]byte, []int)

func (*PolyArray) Get

func (m *PolyArray) Get(i int32) uint32

Get returns the uncompressed uint32 value. A Get() costs about 10 ns

Since 0.1.1

func (*PolyArray) GetBitmap

func (m *PolyArray) GetBitmap() []uint64

func (*PolyArray) GetConfigs

func (m *PolyArray) GetConfigs() []int64

func (*PolyArray) GetN

func (m *PolyArray) GetN() int32

func (*PolyArray) GetPolynomials

func (m *PolyArray) GetPolynomials() []float64

func (*PolyArray) GetResiduals

func (m *PolyArray) GetResiduals() []uint64

func (*PolyArray) Len

func (m *PolyArray) Len() int

Len returns number of elements.

Since 0.1.1

func (*PolyArray) ProtoMessage

func (*PolyArray) ProtoMessage()

func (*PolyArray) Reset

func (m *PolyArray) Reset()

func (*PolyArray) Stat

func (m *PolyArray) Stat() map[string]int32

Stat returns a map describing memory usage.

seg_cnt   :512         // segment count
elt_width :8           // average bits count per elt
span_cnt  :12          // total count of spans
spans/seg :7           // average span count per segment
mem_elts  :1048576     // memory cost for residuals
mem_total :1195245     // total memory cost
bits/elt  :9           // average memory cost per elt
n         :10          // total elt count

Since 0.1.1

Example
package main

import (
	"fmt"
	"math/rand"
	"sort"

	"github.com/openacid/polyarray"
)

func main() {

	fmt.Println("== Memory cost stats of sorted random uint array ==")

	cases := []struct {
		n   int
		rng uint32
	}{
		{1000, 1000},
		{1000 * 1000, 1000 * 1000},
		{1000 * 1000, 1000 * 1000 * 1000},
	}

	for _, c := range cases {
		n := c.n
		rng := c.rng

		nums := []uint32{}
		rnd := rand.New(rand.NewSource(int64(n) * int64(rng)))

		for i := 0; i < n; i++ {
			s := uint32(rnd.Float64() * float64(rng))
			nums = append(nums, s)
		}

		sort.Slice(nums, func(i, j int) bool { return nums[i] < nums[j] })

		a := polyarray.NewPolyArray(nums)

		st := a.Stat()
		fmt.Printf("\nn=%d rng=[0, %d]:\n\n", n, rng)

		for _, k := range []string{
			// "mem_elts", "span_cnt", "spans/seg", "elt_width",
			"n", "mem_total", "bits/elt",
		} {
			fmt.Printf("  %10s: %d\n", k, st[k])
		}
	}

}
Output:

== Memory cost stats of sorted random uint array ==

n=1000 rng=[0, 1000]:

           n: 1000
   mem_total: 824
    bits/elt: 6

n=1000000 rng=[0, 1000000]:

           n: 1000000
   mem_total: 702624
    bits/elt: 5

n=1000000 rng=[0, 1000000000]:

           n: 1000000
   mem_total: 2078304
    bits/elt: 16

func (*PolyArray) String

func (m *PolyArray) String() string

func (*PolyArray) XXX_DiscardUnknown

func (m *PolyArray) XXX_DiscardUnknown()

func (*PolyArray) XXX_Marshal

func (m *PolyArray) XXX_Marshal(b []byte, deterministic bool) ([]byte, error)

func (*PolyArray) XXX_Merge

func (dst *PolyArray) XXX_Merge(src proto.Message)

func (*PolyArray) XXX_Size

func (m *PolyArray) XXX_Size() int

func (*PolyArray) XXX_Unmarshal

func (m *PolyArray) XXX_Unmarshal(b []byte) error

Directories

Path Synopsis
Package polyfit models a polynomial y from sample points xs and ys, to minimize the squared residuals.
Package polyfit models a polynomial y from sample points xs and ys, to minimize the squared residuals.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL