succinct

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 4, 2021 License: MIT Imports: 1 Imported by: 0

README

succinct

Travis test

Report card Coverage Status

GoDoc PkgGoDev Sourcegraph

succinct provides several static succinct data types

Succinct Set

中文介绍: 100行代码的压缩前缀树: 50% smaller

Set is a succinct, sorted and static string set impl with compacted trie as storage. The space cost is about half lower than the original data.

Synopsis

package succinct

import "fmt"

func ExampleNewSet() {
	keys := []string{
		"A", "Aani", "Aaron", "Aaronic", "Aaronical", "Aaronite",
		"Aaronitic", "Aaru", "Ab", "Ababdeh", "Ababua", "Abadite",
	}
	s := NewSet(keys)
	for _, k := range []string{"Aani", "Foo", "Ababdeh"} {
		found := s.Has(k)
		fmt.Printf("lookup %10s, found: %v\n", k, found)
	}

	// Output:
	//
	// lookup       Aani, found: true
	// lookup        Foo, found: false
	// lookup    Ababdeh, found: true
}

Performance

  • 200 kilo real-world words collected from web:

    • the space costs is 57% of original data size.
    • And a Has() costs about 350 ns with a zip-f workload.

    Original size: 2204 KB

    With comparison with string array bsearch and google btree :

    Data Engine Size(KB) Size/original ns/op
    200kweb2 bsearch 5890 267% 229
    200kweb2 succinct.Set 1258 57% 356
    200kweb2 btree 12191 553% 483

    A string in go has two fields: a pointer to the text content and a length. Thus the space overhead is quite high with small strings. btree internally has more pointers and indirections(interface).

  • 870 kilo real-world ipv4:

    • the space costs is 67% of original data size.
    • And a Has() costs about 500 ns with a zip-f workload.

    Original size: 6823 KB

    Data Engine Size(KB) Size/original ns/op
    870k_ip4_hex bsearch 17057 500% 276
    870k_ip4_hex succinct.Set 2316 67% 496
    870k_ip4_hex btree 40388 1183% 577

Implementation

It stores sorted strings in a compacted trie(AKA prefix tree). A trie node has at most 256 outgoing labels. A label is just a single byte. E.g., [ab, abc, abcd, axy, buv] is represented with a trie like the following: (Numbers are node id)

^ -a-> 1 -b-> 3 $
  |      |      `c-> 6 $
  |      |             `d-> 9 $
  |      `x-> 4 -y-> 7 $
  `b-> 2 -u-> 5 -v-> 8 $

Internally it uses a packed []byte and a bitmap with len([]byte) bits to describe the outgoing labels of a node,:

^: ab  00
1: bx  00
2: u   0
3: c   0
4: y   0
5: v   0
6: d   0
7: ø
8: ø
9: ø

In storage it packs labels together and bitmaps joined with separator 1:

labels(ignore space): "ab bx u c y v d"
label bitmap:          0010010101010101111

Finally leaf nodes are indicated by another bitmap leaves, in which a 1 at i-th bit indicates the i-th node is a leaf:

leaves: 0001001111

License

This project is licensed under the MIT License - see the LICENSE file for details.

Documentation

Overview

Package succinct provides several succinct data types.

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Set

type Set struct {
	// contains filtered or unexported fields
}

Set is a succinct, sorted and static string set impl with compacted trie as storage. The space cost is about half lower than the original data.

Implementation

It stores sorted strings in a compacted trie(AKA prefix tree). A trie node has at most 256 outgoing labels. A label is just a single byte. E.g., [ab, abc, abcd, axy, buv] is represented with a trie like the following: (Numbers are node id)

^ -a-> 1 -b-> 3 $
  |      |      `c-> 6 $
  |      |             `d-> 9 $
  |      `x-> 4 -y-> 7 $
  `b-> 2 -u-> 5 -v-> 8 $

Internally it uses a packed []byte and a bitmap with `len([]byte)` bits to describe the outgoing labels of a node,:

^: ab  00
1: bx  00
2: u   0
3: c   0
4: y   0
5: v   0
6: d   0
7: ø
8: ø
9: ø

In storage it packs labels together and bitmaps joined with separator `1`:

labels(ignore space): "ab bx u c y v d"
label bitmap:          0010010101010101111

Finally leaf nodes are indicated by another bitmap `leaves`, in which a `1` at i-th bit indicates the i-th node is a leaf:

leaves: 0001001111

func NewSet

func NewSet(keys []string) *Set

NewSet creates a new *Set struct, from a slice of sorted strings.

Example
keys := []string{
	"A", "Aani", "Aaron", "Aaronic", "Aaronical", "Aaronite",
	"Aaronitic", "Aaru", "Ab", "Ababdeh", "Ababua", "Abadite",
}
s := NewSet(keys)
for _, k := range []string{"Aani", "Foo", "Ababdeh"} {
	found := s.Has(k)
	fmt.Printf("lookup %10s, found: %v\n", k, found)
}
Output:


lookup       Aani, found: true
lookup        Foo, found: false
lookup    Ababdeh, found: true
Example (Memory)
keys := testkeys.Load("200kweb2")

original := 0
for _, k := range keys {
	original += len(k)
}

s := NewSet(keys)

fmt.Println("With", len(keys)/1000, "thousands keys:")
fmt.Println("  Original size:", original/1024, "KB")
fmt.Println("  Compressed size:",
	size.Of(s)/1024, "KB, ratio:",
	size.Of(s)*100/original, "%")
fmt.Println("Memory layout:")
fmt.Println(size.Stat(s, 10, 1))
Output:


With 235 thousands keys:
  Original size: 2204 KB
  Compressed size: 1258 KB, ratio: 57 %
Memory layout:
*succinct.Set: 1288412
    succinct.Set: 1288404
        leaves: []uint64: 99128
            0: uint64: 8
        labelBitmap: []uint64: 198224
            0: uint64: 8
        labels: []uint8: 792800
            0: uint8: 1
        ranks: []int32: 99128
            0: int32: 4
        selects: []int32: 99124
            0: int32: 4

func (*Set) Has

func (ss *Set) Has(key string) bool

Has query for a key and return whether it presents in the Set.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL