Documentation
¶
Overview ¶
Offheap
An off-heap hash-table for Go (golang). Originally called go-offheap-hashtable, but now shortened to just offheap.
The purpose here is to have a hash table that can work away from Go's Garbage Collector, to avoid long GC pause times.
We accomplish this by writing our own Malloc() and Free() implementation (see malloc.go) which requests memory directly from the OS.
The keys, values, and entire hash table is kept on off-heap storage. This storage can also optionally be backed by memory mapped file for speedy persistence and fast startup times.
Initial HashTable implementation inspired by the public domain C++ code of
https://github.com/preshing/CompareIntegerMaps
See also
http://preshing.com/20130107/this-hash-table-is-faster-than-a-judy-array/
for performance studies of the C++ code.
HashTable ¶
The implementation is mostly in offheap.go, read that to start.
Maps pointer-sized integers to Cell structures, which in turn hold Val_t as well as Key_t structures.
Uses open addressing with linear probing. This makes it very cache friendly and thus very fast.
In the t.Cells array, UnHashedKey = 0 is reserved to indicate an unused cell. Actual value for key 0 (if any) is stored in t.ZeroCell. The hash table automatically doubles in size when it becomes 75% full. The hash table never shrinks in size, even after Clear(), unless you explicitly call Compact().
Basic operations: Lookup(), Insert(), DeleteKey(). These are the equivalent of the builtin map[uint64]interface{}.
As an example of how to specialize for a map[string]*Cell equivalent, see the following functions in the bytekey.go file:
func (t *HashTable) InsertStringKey(strkey string, value interface{}) bool func (t *HashTable) LookupStringKey(strkey string) (Val_t, bool) func (t *HashTable) DeleteStringKey(strkey string) bool
Example use:
h := offheap.NewHashTable(2) // basic three operations are: h.InsertStringKey("My number", 43) val, ok := h.LookupStringKey("My number") h.DeleteStringKey("My number")
Note that this library is only a starting point of source code, and not intended to be used without customization. Users of the HashTable will have to customize it by changing the definitions of Key_t and Val_t to suite their needs.
On Save(), serialization of the HashTable itself is done using msgpack to write bytes to the first page (4k bytes) of the memory mapped file. This uses github.com/tinylib/msgp which is a blazing fast msgpack serialization library. It is fast because it avoids reflection and pre-computes the serializations (using go generate based inspection of your go source). If you need to serialize your values into the Val_t, I would suggest evaluating the msgp for serialization and deserialization. The author, Philip Hofer, has done a terrific job and put alot of effort into tuning it for performance. If you are still pressed for speed, consider also ommitting the field labels using the '//msgp:tuple MyValueType' annotation. As Mr. Hofer says, "For smaller objects, tuple encoding can yield serious performance improvements." [https://github.com/tinylib/msgp/wiki/Preprocessor-Directives].
Related ideas:
https://gist.github.com/mish15/9822474 (using CGO)
CGO note: the cgo-malloc branch of this github repo has an implementation that uses CGO to call the malloc/calloc/free functions in the C stdlib. Using CGO gives up the save-to-disk instantly feature and creates a portability issue where you have linked against a specific version of the C stdlib. However if you are making/destroying alot of tables, the CGO apporach may be faster. This is because calling malloc and free in the standard C library are much faster than making repeated system calls to mmap().
more related ideas:
https://groups.google.com/forum/#!topic/golang-nuts/kCQP6S6ZGh0
not fully off-heap, but using a slice instead of a map appears to help GC quite alot too:
Index ¶
- Constants
- type ByteKeyHashTable
- type Cell
- func (z *Cell) DecodeMsg(dc *msgp.Reader) (err error)
- func (z *Cell) EncodeMsg(en *msgp.Writer) (err error)
- func (cell *Cell) GetInt() int
- func (cell *Cell) GetString() string
- func (z *Cell) MarshalMsg(b []byte) (o []byte, err error)
- func (z *Cell) Msgsize() (s int)
- func (cell *Cell) SetInt(n int)
- func (cell *Cell) SetString(s string)
- func (cell *Cell) SetValue(v interface{})
- func (z *Cell) UnmarshalMsg(bts []byte) (o []byte, err error)
- func (cell *Cell) ZeroValue()
- type HashTable
- func (t *HashTable) CellAt(pos uint64) *Cell
- func (t *HashTable) Clear()
- func (t *HashTable) Compact()
- func (z *HashTable) DecodeMsg(dc *msgp.Reader) (err error)
- func (t *HashTable) DeleteCell(cell *Cell)
- func (t *HashTable) DeleteKey(key uint64)
- func (t *HashTable) DestroyHashTable()
- func (t *HashTable) Dump()
- func (z *HashTable) EncodeMsg(en *msgp.Writer) (err error)
- func (t *HashTable) Insert(key uint64) (*Cell, bool)
- func (t *HashTable) InsertBK(bytekey []byte, value interface{}) bool
- func (t *HashTable) InsertIntValue(key uint64, value int) bool
- func (t *HashTable) Lookup(key uint64) *Cell
- func (t *HashTable) LookupBK(bytekey []byte) (Val_t, bool)
- func (t *HashTable) LookupBKInt(bytekey []byte) (int, bool)
- func (z *HashTable) MarshalMsg(b []byte) (o []byte, err error)
- func (z *HashTable) Msgsize() (s int)
- func (tab *HashTable) NewIterator() *Iterator
- func (t *HashTable) Repopulate(desiredSize uint64)
- func (t *HashTable) Save(background bool) error
- func (z *HashTable) UnmarshalMsg(bts []byte) (o []byte, err error)
- type Iterator
- func (z *Iterator) DecodeMsg(dc *msgp.Reader) (err error)
- func (it *Iterator) Done() bool
- func (z *Iterator) EncodeMsg(en *msgp.Writer) (err error)
- func (z *Iterator) MarshalMsg(b []byte) (o []byte, err error)
- func (z *Iterator) Msgsize() (s int)
- func (it *Iterator) Next() *Cell
- func (z *Iterator) UnmarshalMsg(bts []byte) (o []byte, err error)
- type Key_t
- type MmapMalloc
- type StringHashTable
- type Val_t
- func (z *Val_t) DecodeMsg(dc *msgp.Reader) (err error)
- func (z *Val_t) EncodeMsg(en *msgp.Writer) (err error)
- func (v *Val_t) GetInt() int
- func (v *Val_t) GetString() string
- func (z *Val_t) MarshalMsg(b []byte) (o []byte, err error)
- func (z *Val_t) Msgsize() (s int)
- func (v *Val_t) SetInt(n int)
- func (v *Val_t) SetString(s string)
- func (z *Val_t) UnmarshalMsg(bts []byte) (o []byte, err error)
Constants ¶
const MetadataHeaderMaxBytes = 4096
metadata serialization size can never grow bigger than MetadataHeaderMaxBytes (currently one 4K page), without impacting backwards compatibility. We reserve this many bytes at the beginning of the memory mapped file for the metadata.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type ByteKeyHashTable ¶
type ByteKeyHashTable HashTable
ByteKeyHashTable shows how to specialize HashTable to handle the []byte type as a key. At present, given the current definition of Key_t (you should redefine Key_t if needed -- see offheap.go), only the first 64 bytes are used in the hash of the key.
func NewByteKeyHashTable ¶
func NewByteKeyHashTable(initialSize int64) *ByteKeyHashTable
NewByteKeyHashTable produces a new ByteKeyHashTable, one specialized for handling []byte as keys.
func (*ByteKeyHashTable) DeleteBK ¶
func (t *ByteKeyHashTable) DeleteBK(bytekey []byte) bool
DeleteBK removes an entry with a []byte key. By default only len(Key_t) bytes are used in the key.
func (*ByteKeyHashTable) InsertBK ¶
func (t *ByteKeyHashTable) InsertBK(bytekey []byte, value interface{}) bool
InsertBK is the insert function for []byte keys. By default only len(Key_t) bytes are used in the key.
type Cell ¶
type Cell struct { UnHashedKey uint64 ByteKey Key_t Value Val_t // customize this to hold your value's data type entirely here. }
Cell is the basic payload struct, stored inline in the HashTable. The cell is returned by the fundamental Lookup() function. The member Value is where the value that corresponds to the key (in ByteKey) is stored. Both the key (in ByteKey) and the value (in Value) are stored inline inside the hashtable, so that all storage for the hashtable is in the same offheap segment. The uint64 key given to fundamental Insert() method is stored in UnHashedKey. The hashed value of the UnHashedKey is not stored in the Cell, but rather computed as needed by the basic Insert() and Lookup() methods.
func (*Cell) MarshalMsg ¶
MarshalMsg implements msgp.Marshaler
func (*Cell) Msgsize ¶
Msgsize returns an upper bound estimate of the number of bytes occupied by the serialized message
func (*Cell) SetString ¶
SetString stores string s (up to val_t length, currently 56 bytes) in cell.Value.
func (*Cell) SetValue ¶
func (cell *Cell) SetValue(v interface{})
SetValue stores any value v in the Cell. Note that users of the library will need to extend this for their type. Only strings of length less than 56, and integers are handled by default.
func (*Cell) UnmarshalMsg ¶
UnmarshalMsg implements msgp.Unmarshaler
type HashTable ¶
type HashTable struct { MagicNumber int // distinguish between reading from empty file versus an on-disk HashTable. Cells uintptr `msg:"-"` CellSz uint64 ArraySize uint64 Population uint64 ZeroUsed bool ZeroCell Cell OffheapHeader []byte `msg:"-"` OffheapCells []byte `msg:"-"` Mmm MmapMalloc `msg:"-"` }
HashTable represents the off-heap hash table. Create a new one with NewHashTable(), then use Lookup(), Insert(), and DeleteKey() on it. HashTable is meant to be customized by the user, to reflect your choice of key and value types. See StringHashTable and ByteKeyHashTable for examples of this specialization process.
func NewHashFileBacked ¶
func NewHashTable ¶
Create a new hash table, able to hold initialSize count of keys.
func (*HashTable) CellAt ¶
CellAt: fetch the cell at a given index. E.g. t.CellAt(pos) replaces t.Cells[pos]
func (*HashTable) Clear ¶
func (t *HashTable) Clear()
Clear does not resize the table, but zeroes-out all entries.
func (*HashTable) Compact ¶
func (t *HashTable) Compact()
Compact will compress the hashtable so that it is at most 75% full.
func (*HashTable) DeleteCell ¶
DeleteCell deletes the cell pointed to by cell.
func (*HashTable) DestroyHashTable ¶
func (t *HashTable) DestroyHashTable()
DestroyHashTable frees the memory-mapping, returning the memory containing the hash table and its cells to the OS. By default the save-to-file-on-disk functionality in malloc.go is not used, but that can be easily activated. See malloc.go. Deferencing any cells/pointers into the hash table after destruction will result in crashing your process, almost surely.
func (*HashTable) Dump ¶
func (t *HashTable) Dump()
Dump provides a diagnostic dump of the full HashTable contents.
func (*HashTable) Insert ¶
Insert a key and get back the Cell for that key, so as to enable assignment of Value within that Cell, for the specified key. The 2nd return value is false if key already existed (and thus required no addition); if the key already existed you can inspect the existing value in the *Cell returned.
func (*HashTable) InsertBK ¶
InsertBK is the insert function for []byte keys. By default only len(Key_t) bytes are used in the key.
func (*HashTable) InsertIntValue ¶
InsertIntValue inserts value under key in the table.
func (*HashTable) MarshalMsg ¶
MarshalMsg implements msgp.Marshaler
func (*HashTable) Msgsize ¶
Msgsize returns an upper bound estimate of the number of bytes occupied by the serialized message
func (*HashTable) NewIterator ¶
NewIterator creates a new iterator for HashTable tab.
func (*HashTable) Repopulate ¶
Repopulate expands the hashtable to the desiredSize count of cells.
type Iterator ¶
type Iterator struct { Tab *HashTable Pos int64 Cur *Cell // will be set to nil when done with iteration. }
Iterator
sample use: given a HashTable h, enumerate h's contents with:
for it := offheap.NewIterator(h); it.Cur != nil; it.Next() { found = append(found, it.Cur.UnHashedKey) }
func (*Iterator) Done ¶
Done checks to see if we have already iterated through all cells in the table. Equivalent to checking it.Cur == nil.
func (*Iterator) MarshalMsg ¶
MarshalMsg implements msgp.Marshaler
func (*Iterator) Msgsize ¶
Msgsize returns an upper bound estimate of the number of bytes occupied by the serialized message
type Key_t ¶
type Key_t [64]byte
Key_t is the basic type for keys. Users of the library will probably redefine this.
func (*Key_t) MarshalMsg ¶
MarshalMsg implements msgp.Marshaler
type MmapMalloc ¶
type MmapMalloc struct { Path string File *os.File Fd int FileBytesLen int64 BytesAlloc int64 MMap gommap.MMap // equiv to Mem, just avoids casts everywhere. Mem []byte // equiv to Mmap }
The MmapMalloc struct represents either an anonymous, private region of memory (if path was "", or a memory mapped file if path was supplied to Malloc() at creation.
Malloc() creates and returns an MmapMalloc struct, which can then be later Free()-ed. Malloc() calls request memory directly from the kernel via mmap(). Memory can optionally be backed by a file for simplicity/efficiency of saving to disk.
For use when the Go GC overhead is too large, and you need to move the hash table off-heap.
func Growmap ¶
func Growmap(oldmap *MmapMalloc, newSize int64) (newmap *MmapMalloc, err error)
Growmap grows a memory mapping
func Malloc ¶
func Malloc(numBytes int64, path string) *MmapMalloc
Malloc() creates a new memory region that is provided directly by OS via the mmap() call, and is thus not scanned by the Go garbage collector.
If path is not empty then we memory map to the given path. Otherwise it is just like a call to malloc(): an anonymous memory allocation, outside the realm of the Go Garbage Collector. If numBytes is -1, then we take the size from the path file's size. Otherwise the file is expanded or truncated to be numBytes in size. If numBytes is -1 then a path must be provided; otherwise we have no way of knowing the size to allocate, and the function will panic.
The returned value's .Mem member holds a []byte pointing to the returned memory (as does .MMap, for use in other gommap calls).
func (*MmapMalloc) BackgroundSync ¶
func (mm *MmapMalloc) BackgroundSync()
BackgroundSync() schedules a sync to disk, but may return before it is done. Without a call to either BackgroundSync() or BlockUntilSync(), there is no guarantee that file has ever been written to disk at any point before the munmap() call that happens during Free(). See the man pages msync(2) and mmap(2) for details.
func (*MmapMalloc) BlockUntilSync ¶
func (mm *MmapMalloc) BlockUntilSync()
BlockUntilSync() returns only once the file is synced to disk.
func (*MmapMalloc) Free ¶
func (mm *MmapMalloc) Free()
Free releases the memory allocation back to the OS by removing the (possibly anonymous and private) memroy mapped file that was backing it. Warning: any pointers still remaining will crash the program if dereferenced.
func (*MmapMalloc) TruncateTo ¶
func (mm *MmapMalloc) TruncateTo(newSize int64)
TruncateTo enlarges or shortens the file backing the memory map to be size newSize bytes. It only impacts the file underlying the mapping, not the mapping itself at this point.
type StringHashTable ¶
type StringHashTable HashTable
StringHashTable shows how to specialize HashTable to handle strings as keys. At present, given the current definition of Key_t (you should redefine Key_t if needed -- see offheap.go), only the first 64 bytes are used in the hash of the key.
func NewStringHashTable ¶
func NewStringHashTable(initialSize int64) *StringHashTable
NewStringHashTable produces a new StringHashTable, one specialized for handling keys of type string.
func (*StringHashTable) DeleteStringKey ¶
func (t *StringHashTable) DeleteStringKey(strkey string) bool
DeleteStringKey deletes the cell (if there is one) that has been previously inserted with the given strkey string key.
func (*StringHashTable) DumpStringKey ¶
func (t *StringHashTable) DumpStringKey()
DumpStringKey provides a diagnostic printout of the contents of a hashtable that is using strings as keys.
func (*StringHashTable) InsertStringKey ¶
func (t *StringHashTable) InsertStringKey(strkey string, value interface{}) bool
InsertStringKey inserts a value with a key that is a string.
func (*StringHashTable) LookupStringKey ¶
func (t *StringHashTable) LookupStringKey(strkey string) (Val_t, bool)
LookupStringKey looks up a value based on a key that is a string.
type Val_t ¶
type Val_t [56]byte
Val_t is the basic type for values stored in the cells in the table. Users of the library will probably redefine this to be a different size at the very least.
func (*Val_t) MarshalMsg ¶
MarshalMsg implements msgp.Marshaler
func (*Val_t) Msgsize ¶
Msgsize returns an upper bound estimate of the number of bytes occupied by the serialized message