package module
Version: v0.0.0-...-5b47ed6 Latest Latest

This package is not in the latest version of its module.

Go to latest
Published: Nov 17, 2014 License: MIT Imports: 13 Imported by: 27



gkvlite is a simple, ordered, ACID, key-value persistence library for Go.

GoDoc Build Status Coverage Status


gkvlite is a library that provides a simple key-value persistence store, inspired by SQLite and CouchDB/Couchstore.

gkvlite has the following features...

  • 100% implemented in the Go Language (golang).
  • Open source license - MIT.
  • Keys are ordered, so range scans are supported.
  • On-disk storage for a "Store" is a single file.
  • ACID properties are supported via a simple, append-only, copy-on-write storage design.
  • Concurrent goroutine support (multiple readers, single mutator).

Key concepts

  • A Store is held in a single file.
  • A Store can have zero or more Collections.
  • A Collection can have zero or more Items.
  • An Item is a key and value.
  • A key is a []byte, max length 64KB (length is uint16).
  • A value is a []byte, max length 4GB (length is uint32).

ACID properties

  • Atomicity - all unpersisted changes from all Collections during a Store.Flush() will be persisted atomically.
  • Consistency - simple key-value level consistency is supported.
  • Isolation - mutations won't affect concurrent readers or snapshots.
  • Durability - you control when you want to Flush() & fsync so your application can address its performance-vs-safety tradeoffs appropriately.


  • In general, performance is similar to probabilistic balanced binary tree performance.
  • O(log N) performance for item retrieval, insert, update, delete.
  • O(log N) performance to find the smallest or largest items (by key).
  • Range iteration performance is same as binary tree traversal performance.
  • You can optionally retrieve just keys only, to save I/O & memory resources.


  • Read-only Store snapshots are supported.
  • Mutations on the original Store won't be seen by snapshots.
  • Snapshot creation is a fast O(1) operation per Collection.


The simplest way to use gkvlite is in single-threaded fashion, such as by using a go channel or other application-provided locking to serialize access to a Store.

More advanced users may want to use gkvlite's support for concurrent goroutines. The idea is that multiple read-only goroutines, a single read-write (mutation) goroutine, and a single persistence (flusher) goroutine do not need to block each other.

More specifically, you should have only a single read-write (or mutation) goroutine per Store, and should have only a single persistence goroutine per Store (doing Flush()'s). And, you may have multiple, concurrent read-only goroutines per Store (doing read-only operations like Get()'s, Visit()'s, Snapshot()'s, CopyTo()'s, etc).

A read-only goroutine that performs a long or slow read operation, like a Get() that must retrieve from disk or a range scan, will see a consistent, isolated view of the collection. That is, mutations that happened after the slow read started will not be seen by the reader.

IMPORTANT: In concurrent usage, the user must provide a StoreFile implementation that is concurrent safe.

Note that os.File is not a concurrent safe implementation of the StoreFile interface. You will need to provide your own implementation of the StoreFile interface, such as by using a channel to serialize StoreFile method invocations.

Finally, advanced users may also use a read-write (mutation) goroutine per Collection instead of per Store. There should only be, though, only a single persistence (flusher) goroutine per Store.

Other features

  • In-memory-only mode is supported, where you can use the same API but without any persistence.
  • You provide the os.File - this library just uses the os.File you provide.
  • You provide the os.File.Sync() - if you want to fsync your file, call file.Sync() after you do a Flush().
  • Similar to SQLite's VFS feature, you can supply your own StoreFile interface implementation instead of an actual os.File, for your own advanced testing or I/O interposing needs (e.g., compression, checksums, I/O statistics, caching, enabling concurrency, etc).
  • You can specify your own KeyCompare function. The default is bytes.Compare(). See also the StoreCallbacks.KeyCompareForCollection() callback function.
  • Collections are written to file sorted by Collection name. This allows users with advanced concurrency needs to reason about how concurrent flushes interact with concurrent mutations. For example, if you have a main data collection and a secondary-index collection, with clever collection naming you can know that the main collection will always be "ahead of" the secondary-index collection even with concurrent flushing.
  • A Store can be reverted using the FlushRevert() API to revert the last Flush(). This brings the state of a Store back to where it was as of the next-to-last Flush(). This allows the application to rollback or undo changes on a persisted file.
  • Reverted snapshots (calling FlushRevert() on a snapshot) does not affect (is isolated from) the original Store and does not affect the underlying file. Calling FlushRevert() on the main Store, however, will adversely affect any active snapshots; where the application should stop using any snapshots that were created before the FlushRevert() invocation on the main Store.
  • To evict O(log N) number of items from memory, call Collection.EvictSomeItems(), which traverses a random tree branch and evicts any clean (already persisted) items found during that traversal. Eviction means clearing out references to those clean items, which means those items can be candidates for GC.
  • You can control item priority to access hotter items faster by shuffling them closer to the top of balanced binary trees (warning: intricate/advanced tradeoffs here).
  • Tree depth is provided by using the VisitItemsAscendEx() or VisitItemsDescendEx() methods.
  • You can associate transient, ephemeral (non-persisted) data with your items. If you do use the Item.Transient field, you should use sync/atomic pointer functions for concurrency correctness. In general, Item's should be treated as immutable, except for the Item.Transient field.
  • Application-level Item.Val buffer management is possible via the optional ItemValAddRef/ItemValDecRef() store callbacks, to help reduce garbage memory. Libraries like github.com/steveyen/go-slab may be helpful here.
  • Errors from file operations are propagated all the way back to your code, so your application can respond appropriately.
  • Tested - "go test" unit tests.
  • Docs - "go doc" documentation.


Open source - MIT licensed.


import (

f, err := os.Create("/tmp/test.gkvlite")
s, err := gkvlite.NewStore(f)
c := s.SetCollection("cars", nil)

// You can also retrieve the collection, where c == cc.
cc := s.GetCollection("cars")

// Insert values.
c.Set([]byte("tesla"), []byte("$$$"))
c.Set([]byte("mercedes"), []byte("$$"))
c.Set([]byte("bmw"), []byte("$"))

// Replace values.
c.Set([]byte("tesla"), []byte("$$$$"))

// Retrieve values.
mercedesPrice, err := c.Get([]byte("mercedes"))

// One of the most priceless vehicles is not in the collection.
thisIsNil, err := c.Get([]byte("the-apollo-15-moon-buggy"))

// Iterate through items.
c.VisitItemsAscend([]byte("ford"), true, func(i *gkvlite.Item) bool {
    // This visitor callback will be invoked with every item
    // with key "ford" and onwards, in key-sorted order.
    // So: "mercedes", "tesla" are visited, in that ascending order,
    // but not "bmw".
    // If we want to stop visiting, return false;
    // otherwise return true to keep visiting.
    return true

// Let's get a snapshot.
snap := s.Snapshot()
snapCars := snap.GetCollection("cars")

// The snapshot won't see modifications against the original Store.
mercedesIsNil, err := c.Get([]byte("mercedes"))
mercedesPriceFromSnapshot, err := snapCars.Get([]byte("mercedes"))

// Persist all the changes to disk.

f.Sync() // Some applications may also want to fsync the underlying file.

// Now, other file readers can see the data, too.
f2, err := os.Open("/tmp/test.gkvlite")
s2, err := gkvlite.NewStore(f2)
c2 := s.GetCollection("cars")

bmwPrice, err := c2.Get([]byte("bmw"))


Because all collections are persisted atomically when you flush a store to disk, you can implement consistent secondary indexes by maintaining additional collections per store. For example, a "users" collection can hold a JSON document per user, keyed by userId. Another "userEmails" collection can be used like a secondary index, keyed by "emailAddress:userId", with empty values (e.g., []byte{}).

Bulk inserts or batched mutations are roughly supported in gkvlite where your application should only occasionally invoke Flush() after N mutations or after a given amount of time, as opposed to invoking a Flush() after every Set/Delete().

Implementation / design

The fundamental data structure is an immutable treap (tree + heap).

When used with random heap item priorities, treaps have probabilistic balanced tree behavior with the usual O(log N) performance bounds expected of balanced binary trees.

The persistence design is append-only, using ideas from Apache CouchDB and Couchstore / Couchbase, providing a simple approach to reaching ACID properties in the face of process or machine crashes. On re-opening a file, the implementation scans the file backwards looking for the last good root record and logically "truncates" the file at that point. New mutations are appended from that last good root location. This follows the MVCC (multi-version concurrency control) and "the log is the database" approach of CouchDB / Couchstore / Couchbase.

TRADEOFF: the append-only persistence design means file sizes will grow until there's a compaction. To get a compacted file, use CopyTo() with a high "flushEvery" argument.

The append-only file format allows the FlushRevert() API (undo the changes on a file) to have a simple implementation of scanning backwards in the file for the next-to-last good root record and physically truncating the file at that point.

TRADEOFF: the append-only design means it's possible for an advanced adversary to corrupt a gkvlite file by cleverly storing the bytes of a valid gkvlite root record as a value; however, they would need to know the size of the containing gkvlite database file in order to compute a valid gkvlite root record and be able to force a process or machine crash after their fake root record is written but before the next good root record is written/sync'ed.

The immutable, copy-on-write treap plus the append-only persistence design allows for fast and efficient MVCC snapshots.

TRADEOFF: the immutable, copy-on-write design means more memory garbage may be created than other designs, meaning more work for the garbage collector (GC).

TODO / ideas

  • TODO: Performance: consider splitting item storage from node storage, so we're not mixing metadata and data in same cache pages. Need to measure how much win this could be in cases like compaction. Tradeoff as this could mean no more single file simplicity.

  • TODO: Performance: persist items as log, and don't write treap nodes on every Flush().

  • TODO: Keep stats on misses, disk fetches & writes, etc.

  • TODO: Provide public API for O(log N) collection spliting & joining.

  • TODO: Provide O(1) MidItem() or TopItem() implementation, so that users can split collections at decent points.

  • TODO: Provide item priority shifting during CopyTo().

  • TODO: Build up perfectly balanced treap from large batch of (potentially externally) sorted items.

  • TODO: Allow users to retrieve an item's value size (in bytes) without having to first fetch the item into memory.

  • TODO: Allow more fine-grained cached item and node eviction. Node and item objects are currently cached in-memory by gkvlite for higher retrieval performance, only for the nodes & items that you either have updated or have fetched from disk. Certain applications may need that memory instead, though, for more important tasks. The current coarse workaround is to drop all your references to any relevant Stores and Collections, start brand new Store/Collection instances, and let GC reclaim memory.

  • See more TODO's throughout codebase / grep.



Package gkvlite provides a simple, ordered, ACID, key-value persistence library. It provides persistent, immutable data structure abstrations.



View Source
const VERSION = uint32(4)


View Source
var MAGIC_BEG []byte = []byte("0g1t2r")
View Source
var MAGIC_END []byte = []byte("3e4a5p")


This section is empty.


type AllocStats

type AllocStats struct {
	MkNodes      int64
	FreeNodes    int64 // Number of invocations of the freeNode() API.
	AllocNodes   int64
	CurFreeNodes int64 // Current length of freeNodes list.

	MkNodeLocs      int64
	FreeNodeLocs    int64 // Number of invocations of the freeNodeLoc() API.
	AllocNodeLocs   int64
	CurFreeNodeLocs int64 // Current length of freeNodeLocs list.

	MkRootNodeLocs      int64
	FreeRootNodeLocs    int64 // Number of invocations of the freeRootNodeLoc() API.
	AllocRootNodeLocs   int64
	CurFreeRootNodeLocs int64 // Current length of freeRootNodeLocs list.

type Collection

type Collection struct {
	AppData unsafe.Pointer // For app-specific data; atomic CAS recommended.
	// contains filtered or unexported fields

A persistable collection of ordered key-values (Item's).

func (*Collection) AllocStats

func (t *Collection) AllocStats() (res AllocStats)

func (*Collection) Delete

func (t *Collection) Delete(key []byte) (wasDeleted bool, err error)

Deletes an item of a given key.

func (*Collection) EvictSomeItems

func (t *Collection) EvictSomeItems() (numEvicted uint64)

Evict some clean items found by randomly walking a tree branch. For concurrent users, only the single mutator thread should call EvictSomeItems(), making it serialized with mutations.

func (*Collection) Get

func (t *Collection) Get(key []byte) (val []byte, err error)

Retrieve a value by its key. Returns nil if the item is not in the collection. The returned value should be treated as immutable.

func (*Collection) GetItem

func (t *Collection) GetItem(key []byte, withValue bool) (i *Item, err error)

Retrieve an item by its key. Use withValue of false if you don't need the item's value (Item.Val may be nil), which might be able to save on I/O and memory resources, especially for large values. The returned Item should be treated as immutable.

func (*Collection) GetTotals

func (t *Collection) GetTotals() (numItems uint64, numBytes uint64, err error)

Returns total number of items and total key bytes plus value bytes.

func (*Collection) MarshalJSON

func (t *Collection) MarshalJSON() ([]byte, error)

Returns JSON representation of root node file location.

func (*Collection) MaxItem

func (t *Collection) MaxItem(withValue bool) (*Item, error)

Retrieves the item with the "largest" key. The returned item should be treated as immutable.

func (*Collection) MinItem

func (t *Collection) MinItem(withValue bool) (*Item, error)

Retrieves the item with the "smallest" key. The returned item should be treated as immutable.

func (*Collection) Name

func (t *Collection) Name() string

func (*Collection) Set

func (t *Collection) Set(key []byte, val []byte) error

Replace or insert an item of a given key.

func (*Collection) SetItem

func (t *Collection) SetItem(item *Item) (err error)

Replace or insert an item of a given key. A random item Priority (e.g., rand.Int31()) will usually work well, but advanced users may consider using non-random item priorities at the risk of unbalancing the lookup tree. The input Item instance should be considered immutable and owned by the Collection.

func (*Collection) UnmarshalJSON

func (t *Collection) UnmarshalJSON(d []byte) error

Unmarshals JSON representation of root node file location.

func (*Collection) VisitItemsAscend

func (t *Collection) VisitItemsAscend(target []byte, withValue bool, v ItemVisitor) error

Visit items greater-than-or-equal to the target key in ascending order.

func (*Collection) VisitItemsAscendEx

func (t *Collection) VisitItemsAscendEx(target []byte, withValue bool,
	visitor ItemVisitorEx) error

Visit items greater-than-or-equal to the target key in ascending order; with depth info.

func (*Collection) VisitItemsDescend

func (t *Collection) VisitItemsDescend(target []byte, withValue bool, v ItemVisitor) error

Visit items less-than the target key in descending order.

func (*Collection) VisitItemsDescendEx

func (t *Collection) VisitItemsDescendEx(target []byte, withValue bool,
	visitor ItemVisitorEx) error

Visit items less-than the target key in descending order; with depth info.

func (*Collection) Write

func (t *Collection) Write() error

Writes dirty items of a collection BUT (WARNING) does NOT write new root records. Use Store.Flush() to write root records, which would make these writes visible to the next file re-opening/re-loading.

type Item

type Item struct {
	Transient unsafe.Pointer // For any ephemeral data; atomic CAS recommended.
	Key, Val  []byte         // Val may be nil if not fetched into memory yet.
	Priority  int32          // Use rand.Int31() for probabilistic balancing.

A persistable item.

func (*Item) Copy

func (i *Item) Copy() *Item

The returned Item will not have been allocated through the optional StoreCallbacks.ItemAlloc() callback.

func (*Item) NumBytes

func (i *Item) NumBytes(c *Collection) int

Number of Key bytes plus number of Val bytes.

func (*Item) NumValBytes

func (i *Item) NumValBytes(c *Collection) int

type ItemCallback

type ItemCallback func(*Collection, *Item) (*Item, error)

type ItemVisitor

type ItemVisitor func(i *Item) bool

type ItemVisitorEx

type ItemVisitorEx func(i *Item, depth uint64) bool

type KeyCompare

type KeyCompare func(a, b []byte) int

User-supplied key comparison func should return 0 if a == b, -1 if a < b, and +1 if a > b. For example: bytes.Compare()

type Store

type Store struct {
	// contains filtered or unexported fields

A persistable store holding collections of ordered keys & values.

func NewStore

func NewStore(file StoreFile) (*Store, error)

Provide a nil StoreFile for in-memory-only (non-persistent) usage.

func NewStoreEx

func NewStoreEx(file StoreFile,
	callbacks StoreCallbacks) (*Store, error)

func (*Store) Close

func (s *Store) Close()

func (*Store) CopyTo

func (s *Store) CopyTo(dstFile StoreFile, flushEvery int) (res *Store, err error)

Copy all active collections and their items to a different file. If flushEvery > 0, then during the item copying, Flush() will be invoked at every flushEvery'th item and at the end of the item copying. The copy will not include any old items or nodes so the copy should be more compact if flushEvery is relatively large.

func (*Store) Flush

func (s *Store) Flush() error

Writes (appends) any dirty, unpersisted data to file. As a greater-window-of-data-loss versus higher-performance tradeoff, consider having many mutations (Set()'s & Delete()'s) and then have a less occasional Flush() instead of Flush()'ing after every mutation. Users may also wish to file.Sync() after a Flush() for extra data-loss protection.

func (*Store) FlushRevert

func (s *Store) FlushRevert() error

Reverts the last Flush(), bringing the Store back to its state at its next-to-last Flush() or to an empty Store (with no Collections) if there were no next-to-last Flush(). This call will truncate the Store file.

func (*Store) GetCollection

func (s *Store) GetCollection(name string) *Collection

Retrieves a named Collection.

func (*Store) GetCollectionNames

func (s *Store) GetCollectionNames() []string

func (*Store) ItemAddRef

func (o *Store) ItemAddRef(c *Collection, i *Item)

func (*Store) ItemAlloc

func (o *Store) ItemAlloc(c *Collection, keyLength uint16) *Item

func (*Store) ItemDecRef

func (o *Store) ItemDecRef(c *Collection, i *Item)

func (*Store) ItemValRead

func (o *Store) ItemValRead(c *Collection, i *Item,
	r io.ReaderAt, offset int64, valLength uint32) error

func (*Store) ItemValWrite

func (o *Store) ItemValWrite(c *Collection, i *Item, w io.WriterAt, offset int64) error

func (*Store) MakePrivateCollection

func (s *Store) MakePrivateCollection(compare KeyCompare) *Collection

Returns a new, unregistered (non-named) collection. This allows advanced users to manage collections of private collections.

func (*Store) RemoveCollection

func (s *Store) RemoveCollection(name string)

The Collection removal won't be reflected into persistence until you do a Flush(). Invoking RemoveCollection(x) and then SetCollection(x) is a fast way to empty a Collection.

func (*Store) SetCollection

func (s *Store) SetCollection(name string, compare KeyCompare) *Collection

SetCollection() is used to create a named Collection, or to modify the KeyCompare function on an existing Collection. In either case, a new Collection to use is returned. A newly created Collection and any mutations on it won't be persisted until you do a Flush().

func (*Store) Snapshot

func (s *Store) Snapshot() (snapshot *Store)

Returns a read-only snapshot, including any mutations on the original Store that have not been Flush()'ed to disk yet. The snapshot has its mutations and Flush() operations disabled because the original store "owns" writes to the StoreFile.

func (*Store) Stats

func (s *Store) Stats(out map[string]uint64)

Updates the provided map with statistics.

type StoreCallbacks

type StoreCallbacks struct {
	BeforeItemWrite, AfterItemRead ItemCallback

	// Optional callback to allocate an Item with an Item.Key.  If
	// your app uses ref-counting, the returned Item should have
	// logical ref-count of 1.
	ItemAlloc func(c *Collection, keyLength uint16) *Item

	// Optional callback to allow you to track gkvlite's ref-counts on
	// an Item.  Apps might use this for buffer management and putting
	// Item's on a free-list.
	ItemAddRef func(c *Collection, i *Item)

	// Optional callback to allow you to track gkvlite's ref-counts on
	// an Item.  Apps might use this for buffer management and putting
	// Item's on a free-list.
	ItemDecRef func(c *Collection, i *Item)

	// Optional callback to control on-disk size, in bytes, of an item's value.
	ItemValLength func(c *Collection, i *Item) int

	// Optional callback to allow you to write item bytes differently.
	ItemValWrite func(c *Collection, i *Item,
		w io.WriterAt, offset int64) error

	// Optional callback to read item bytes differently.  For example,
	// the app might have an optimization to just remember the reader
	// & file offsets in the item.Transient field for lazy reading.
	ItemValRead func(c *Collection, i *Item,
		r io.ReaderAt, offset int64, valLength uint32) error

	// Invoked when a Store is reloaded (during NewStoreEx()) from
	// disk, this callback allows the user to optionally supply a key
	// comparison func for each collection.  Otherwise, the default is
	// the bytes.Compare func.
	KeyCompareForCollection func(collName string) KeyCompare

Allows applications to override or interpose before/after events.

type StoreFile

type StoreFile interface {
	Stat() (os.FileInfo, error)
	Truncate(size int64) error

The StoreFile interface is implemented by os.File. Application specific implementations may add concurrency, caching, stats, etc.


Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL