Documentation
¶
Overview ¶
Package slimarray uses polynomial to compress and store an array of uint32. A uint32 costs only 5 bits in a sorted array of a million number in range [0, 1000*1000].
The General Idea ¶
We use a polynomial y = a + bx + cx² to describe the overall trend of the numbers. And for every number i we add a residual to fit the gap between y(i) and nums[i]. E.g. If there are 4 numbers: 0, 15, 33, 50 The polynomial and residuals are:
y = 16x 0, -1, 1, 2
In this case the residuals require 3 bits for each of them. To retrieve the numbers, we evaluate y(i) and add the residual to it:
get(0) = y(0) + 0 = 16 * 0 + 0 = 0 get(1) = y(1) - 1 = 16 * 1 - 1 = 15 get(2) = y(2) + 1 = 16 * 2 + 1 = 33 get(3) = y(3) + 2 = 16 * 3 + 2 = 50
What It Is And What It Is Not ¶
Another space efficient data structure to store uint32 array is trie or prefix tree or radix tree. It is possible to use bitmap-based btree like structure to reduce space(very likely in such case it provides higher compression rate). But it requires the array to be sorted.
SlimArray does not have such restriction. It is more adaptive with data layout. To achieve high compression rate, it only requires the data has a overall trend, e.g., roughly sorted, as seen in the above 4 integers examples. Additionally, it also accept duplicated element in the array, which a bitmap based or tree-like data structure does not allow.
Data Structure ¶
SlimArray splits the entire array into segments(Seg), each of which has 1024 numbers. And then it splits every segment into several spans. Every span has its own polynomial. A span has 16*k numbers. A segment has at most 64 spans.
seg[0] seg[1] 1024 nums 1024 nums |-------+---------------+---|---------------------------|... span[0] span[1] 16 nums 32 nums ..
Uncompressed Data Structures ¶
A SlimArray is a compacted data structure. The original data structures are defined as follow(assumes original user data is `nums []uint32`):
Seg struct { SpansBitmap uint64 // describe span layout Rank uint64 // count `1` in preceding Seg. Spans []Span } Span struct { width int32 // is retrieved from SpansBitmap Polynomial [3]double // Config struct { // Offset int32 // residual offset ResidualWidth int32 // number of bits a residual requires } Residuals [width][ResidualWidth]bit // pack into SlimArray.Residuals }
A span stores 16*k int32 in it, where k ∈ [1, 64).
`Seg.SpansBitmap` describes the layout of Span-s in a Seg. The i-th "1" indicates where the last 16 numbers are in the i-th Span. e.g.:
001011110000...... <-- least significant bit
In the above example:
span[0] has 16*3 nums in it. span[1] has 16*2 nums in it. span[2] has 16*1 nums in it.
`Seg.Rank` caches the total count of "1" in all preceding Seg.SpansBitmap. This accelerate locating a Span in the packed field SlimArray.Polynomials .
`Span.width` is the count of numbers stored in this span. It does not need to be stored because it can be calculated by counting the "0" between two "1" in `Seg.SpansBitmap`.
`Span.Polynomial` stores 3 coefficients of the polynomial describing the overall trend of this span. I.e. the `[a, b, c]` in `y = a + bx + cx²`
`Span.Config.Offset` adjust the offset to locate a residual. In a span we want to have that:
residual position = Config.Offset + (i%1024) * Config.ResidualWidth
But if the preceding span has smaller residual width, the "offset" could be negative, e.g.: span[0] has residual of width 0 and 16 residuals, span[1] has residual of width 4. Then the "offset" of span[1] is `-16*4` in order to satisfy: `(-16*4) + i * 4` is the correct residual position, for i in [16, 32).
`Span.Config.ResidualWidth` specifies the number of bits to store every residual in this span, it must be a power of 2: `2^k`.
`Span.Residuals` is an array of residuals of length `Span.width`. Every elt in it is a `ResidualWidth`-bits integers.
Compact ¶
SlimArray compact `Seg` into a dense format:
SlimArray.Bitmap = [ Seg[0].SpansBitmap, Seg[1].SpansBitmap, ... ] SlimArray.Polynomials = [ Seg[0].Spans[0].Polynomials, Seg[0].Spans[1].Polynomials, ... Seg[1].Spans[0].Polynomials, Seg[1].Spans[1].Polynomials, ... ] SlimArray.Configs = [ Seg[0].Spans[0].Config Seg[0].Spans[1].Config ... Seg[1].Spans[0].Config Seg[1].Spans[1].Config ... ]
`SlimArray.Residuals` simply packs the residuals of every nums[i] together.
Index ¶
- Variables
- type SlimArray
- func (*SlimArray) Descriptor() ([]byte, []int)deprecated
- func (sm *SlimArray) Get(i int32) uint32
- func (x *SlimArray) GetBitmap() []uint64
- func (x *SlimArray) GetConfigs() []int64
- func (x *SlimArray) GetN() int32
- func (x *SlimArray) GetPolynomials() []float64
- func (x *SlimArray) GetRank() []uint64
- func (x *SlimArray) GetResiduals() []uint64
- func (sm *SlimArray) Len() int
- func (*SlimArray) ProtoMessage()
- func (x *SlimArray) ProtoReflect() protoreflect.Message
- func (x *SlimArray) Reset()
- func (sm *SlimArray) Slice(start int32, end int32, rst []uint32)
- func (sm *SlimArray) Stat() map[string]int32
- func (x *SlimArray) String() string
Examples ¶
Constants ¶
This section is empty.
Variables ¶
var File_slimarray_proto protoreflect.FileDescriptor
Functions ¶
This section is empty.
Types ¶
type SlimArray ¶
type SlimArray struct { // N is the count of elts N int32 `protobuf:"varint,10,opt,name=N,proto3" json:"N,omitempty"` Rank []uint64 `protobuf:"varint,19,rep,packed,name=Rank,proto3" json:"Rank,omitempty"` // Every 1024 elts segment has a 64-bit bitmap to describe the spans in it, // and another 64-bit rank: the count of `1` in preceding bitmaps. Bitmap []uint64 `protobuf:"varint,20,rep,packed,name=Bitmap,proto3" json:"Bitmap,omitempty"` // Polynomial and config of every span. // 3 doubles to represent a polynomial; Polynomials []float64 `protobuf:"fixed64,21,rep,packed,name=Polynomials,proto3" json:"Polynomials,omitempty"` // Config stores the offset of residuals in Residuals and the bit width to // store a residual in a span. Configs []int64 `protobuf:"varint,22,rep,packed,name=Configs,proto3" json:"Configs,omitempty"` // packed residuals for every elt. Residuals []uint64 `protobuf:"varint,23,rep,packed,name=Residuals,proto3" json:"Residuals,omitempty"` // contains filtered or unexported fields }
SlimArray compresses a uint32 array with overall trend by describing the trend with a polynomial, e.g., to store a sorted array is very common in practice. Such as an block-list of IP addresses, or a series of var-length record position on disk.
E.g. a uint32 costs only 5 bits in average in a sorted array of a million number in range [0, 1000*1000].
In addition to the unbelievable low memory footprint, a `Get` access is also very fast: it takes only 10 nano second in our benchmark.
SlimArray is also ready for transport since it is defined with protobuf. E.g.:
a := slimarray.NewU32([]uint32{1, 2, 3}) bytes, err := proto.Marshal(a)
Since 0.1.1
Example ¶
package main import ( "fmt" "github.com/openacid/slimarray" ) func main() { nums := []uint32{ 0, 16, 32, 48, 64, 79, 95, 111, 126, 142, 158, 174, 190, 206, 222, 236, 252, 268, 275, 278, 281, 283, 285, 289, 296, 301, 304, 307, 311, 313, 318, 321, 325, 328, 335, 339, 344, 348, 353, 357, 360, 364, 369, 372, 377, 383, 387, 393, 399, 404, 407, 410, 415, 418, 420, 422, 426, 430, 434, 439, 444, 446, 448, 451, 456, 459, 462, 465, 470, 473, 479, 482, 488, 490, 494, 500, 506, 509, 513, 519, 521, 528, 530, 534, 537, 540, 544, 546, 551, 556, 560, 566, 568, 572, 574, 576, 580, 585, 588, 592, 594, 600, 603, 606, 608, 610, 614, 620, 623, 628, 630, 632, 638, 644, 647, 653, 658, 660, 662, 665, 670, 672, 676, 681, 683, 687, 689, 691, 693, 695, 697, 703, 706, 710, 715, 719, 722, 726, 731, 735, 737, 741, 748, 750, 753, 757, 763, 766, 768, 775, 777, 782, 785, 791, 795, 798, 800, 806, 811, 815, 818, 821, 824, 829, 832, 836, 838, 842, 846, 850, 855, 860, 865, 870, 875, 878, 882, 886, 890, 895, 900, 906, 910, 913, 916, 921, 925, 929, 932, 937, 940, 942, 944, 946, 952, 954, 956, 958, 962, 966, 968, 971, 975, 979, 983, 987, 989, 994, 997, 1000, } a := slimarray.NewU32(nums) fmt.Println("last elt is:", a.Get(int32(a.Len()-1))) st := a.Stat() for _, k := range []string{ "elt_width", "mem_elts", "bits/elt"} { fmt.Printf("%10s : %d\n", k, st[k]) } }
Output: last elt is: 1000 elt_width : 3 mem_elts : 112 bits/elt : 16
func NewU32 ¶
NewU32 creates a "SlimArray" array from a slice of uint32.
A NewU32() costs about 110 ns/elt.
Since 0.1.1
func (*SlimArray) Descriptor
deprecated
func (*SlimArray) Get ¶
Get returns the uncompressed uint32 value. A Get() costs about 7 ns
Since 0.1.1
func (*SlimArray) GetConfigs ¶
func (*SlimArray) GetPolynomials ¶
func (*SlimArray) GetResiduals ¶
func (*SlimArray) ProtoMessage ¶
func (*SlimArray) ProtoMessage()
func (*SlimArray) ProtoReflect ¶
func (x *SlimArray) ProtoReflect() protoreflect.Message
func (*SlimArray) Slice ¶
Slice returns a slice of uncompressed uint32, e.g., similar to foo := nums[start:end]. `rst` is used to store returned values, it has to have at least `end-start` elt in it.
A Slice() costs about 3.8 ns, when retrieving 100 or more values a time.
Since 0.1.3
func (*SlimArray) Stat ¶
Stat returns a map describing memory usage.
seg_cnt :512 // segment count elt_width :8 // average bits count per elt span_cnt :12 // total count of spans spans/seg :7 // average span count per segment mem_elts :1048576 // memory cost for residuals mem_total :1195245 // total memory cost bits/elt :9 // average memory cost per elt n :10 // total elt count
Since 0.1.1
Example ¶
package main import ( "fmt" "math/rand" "sort" "github.com/openacid/slimarray" ) func main() { fmt.Println("== Memory cost stats of sorted random uint array ==") cases := []struct { n int rng uint32 }{ {1000, 1000}, {1000 * 1000, 1000 * 1000}, {1000 * 1000, 1000 * 1000 * 1000}, } for _, c := range cases { n := c.n rng := c.rng nums := []uint32{} rnd := rand.New(rand.NewSource(int64(n) * int64(rng))) for i := 0; i < n; i++ { s := uint32(rnd.Float64() * float64(rng)) nums = append(nums, s) } sort.Slice(nums, func(i, j int) bool { return nums[i] < nums[j] }) a := slimarray.NewU32(nums) st := a.Stat() fmt.Printf("\nn=%d rng=[0, %d]:\n\n", n, rng) for _, k := range []string{ // "mem_elts", "span_cnt", "spans/seg", "elt_width", "n", "mem_total", "bits/elt", } { fmt.Printf(" %10s: %d\n", k, st[k]) } } }
Output: == Memory cost stats of sorted random uint array == n=1000 rng=[0, 1000]: n: 1000 mem_total: 856 bits/elt: 6 n=1000000 rng=[0, 1000000]: n: 1000000 mem_total: 705720 bits/elt: 5 n=1000000 rng=[0, 1000000000]: n: 1000000 mem_total: 2078336 bits/elt: 16