caskdb

package module

v0.0.0-...-6a166c5 Latest Latest Go to latest Published: Mar 2, 2024 License: MIT Imports: 7 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/avinassh/go-caskdb

Links

Open Source Insights

README ¶

CaskDB - Disk based Log Structured Hash Table Store

GitHub License

architecture

CaskDB is a disk-based, embedded, persistent, key-value store based on the Riak's bitcask paper, written in Go. It is more focused on the educational capabilities than using it in production. The file format is platform, machine, and programming language independent. Say, the database file created from Go on macOS should be compatible with Rust on Windows.

This project aims to help anyone, even a beginner in databases, build a persistent database in a few hours. There are no external dependencies; only the Go standard library is enough.

If you are interested in writing the database yourself, head to the workshop section.

Features

Low latency for reads and writes
High throughput
Easy to back up / restore
Simple and easy to understand
Store data much larger than the RAM

Limitations

Most of the following limitations are of CaskDB. However, there are some due to design constraints by the Bitcask paper.

Single file stores all data, and deleted keys still take up the space
CaskDB does not offer range scans
CaskDB requires keeping all the keys in the internal memory. With a lot of keys, RAM usage will be high
Slow startup time since it needs to load all the keys in memory

Community

Consider joining the Discord community to build and learn KV Store with peers.

Dependencies

CaskDB does not require any external libraries to run. Go standard library is enough.

Installation

go get github.com/avinassh/go-caskdb

Usage

store, _ := NewDiskStore("books.db")
store.Set("othello", "shakespeare")
author := store.Get("othello")

Cask DB (Python)

This project is a Go version of the same project in Python.

Prerequisites

The workshop is for intermediate-advanced programmers. Knowing basics of Go helps, and you can build the database in any language you wish.

Not sure where you stand? You are ready if you have done the following in any language:

If you have used a dictionary or hash table data structure
Converting an object (class, struct, or dict) to JSON and converting JSON back to the things
Open a file to write or read anything. A common task is dumping a dictionary contents to disk and reading back

Workshop

NOTE: I don't have any workshops scheduled shortly. Follow me on Twitter for updates. Drop me an email if you wish to arrange a workshop for your team/company.

CaskDB comes with a full test suite and a wide range of tools to help you write a database quickly. A Github action is present with an automated tests runner. Fork the repo, push the code, and pass the tests!

Throughout the workshop, you will implement the following:

Serialiser methods take a bunch of objects and serialise them into bytes. Also, the procedures take a bunch of bytes and deserialise them back to the things.
Come up with a data format with a header and data to store the bytes on the disk. The header would contain metadata like timestamp, key size, and value.
Store and retrieve data from the disk
Read an existing CaskDB file to load all keys

Tasks

Read the paper. Fork this repo and checkout the start-here branch
Implement the fixed-sized header, which can encode timestamp (uint, 4 bytes), key size (uint, 4 bytes), value size (uint, 4 bytes) together
Implement the key, value serialisers, and pass the tests from format_test.go
Figure out how to store the data on disk and the row pointer in the memory. Implement the get/set operations. Tests for the same are in disk_store_test.go
Code from the task #2 and #3 should be enough to read an existing CaskDB file and load the keys into memory

Run make test to run the tests locally. Push the code to Github, and tests will run on different OS: ubuntu, mac, and windows.

Not sure how to proceed? Then check the hints file which contains more details on the tasks and hints.

Hints

Not sure how to come up with a file format? Read the comment in the format file

What next?

I often get questions about what is next after the basic implementation. Here are some challenges (with different levels of difficulties)

Level 1:

Crash safety: the bitcask paper stores CRC in the row, and while fetching the row back, it verifies the data
Key deletion: CaskDB does not have a delete API. Read the paper and implement it
Instead of using a hash table, use a data structure like the red-black tree to support range scans
CaskDB accepts only strings as keys and values. Make it generic and take other data structures like int or bytes.

Level 2:

Hint file to improve the startup time. The paper has more details on it
Implement an internal cache which stores some of the key-value pairs. You may explore and experiment with different cache eviction strategies like LRU, LFU, FIFO etc.
Split the data into multiple files when the files hit a specific capacity

Level 3:

Support for multiple processes
Garbage collector: keys which got updated and deleted remain in the file and take up space. Write a garbage collector to remove such stale data
Add SQL query engine layer
Store JSON in values and explore making CaskDB as a document database like Mongo
Make CaskDB distributed by exploring algorithms like raft, paxos, or consistent hashing

Line Count

$ tokei -f format.go disk_store.go
===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 Go                      2          320          133          168           19
-------------------------------------------------------------------------------
 format.go                          111           35           67            9
 disk_store.go                      209           98          101           10
===============================================================================
 Total                   2          320          133          168           19
===============================================================================

Contributing

All contributions are welcome. Please check CONTRIBUTING.md for more details.

Community Contributions

Author	Feature	PR
PaulisMatrix	Delete Op	#6
PaulisMatrix	Checksum	#7

License

The MIT license. Please check LICENSE for more details.

Documentation ¶

Index ¶

type DiskStore
- func NewDiskStore(fileName string) (*DiskStore, error)
type KeyEntry
- func NewKeyEntry(timestamp uint32, position uint32, totalSize uint32) KeyEntry
type MemoryStore
- func NewMemoryStore() *MemoryStore
type Store

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type DiskStore ¶

type DiskStore struct {
	// contains filtered or unexported fields
}

DiskStore is a Log-Structured Hash Table as described in the BitCask paper. We keep appending the data to a file, like a log. DiskStorage maintains an in-memory hash table called KeyDir, which keeps the row's location on the disk.

The idea is simple yet brilliant:

Write the record to the disk
Update the internal hash table to point to that byte offset
Whenever we get a read request, check the internal hash table for the address, fetch that and return

KeyDir does not store values, only their locations.

The above approach solves a lot of problems:

Writes are insanely fast since you are just appending to the file
Reads are insanely fast since you do only one disk seek. In B-Tree backed storage, there could be 2-3 disk seeks

However, there are drawbacks too:

We need to maintain an in-memory hash table KeyDir. A database with a large number of keys would require more RAM
Since we need to build the KeyDir at initialisation, it will affect the startup time too
Deleted keys need to be purged from the file to reduce the file size

Read the paper for more details: https://riak.com/assets/bitcask-intro.pdf

DiskStore provides two simple operations to get and set key value pairs. Both key and value need to be of string type, and all the data is persisted to disk. During startup, DiskStorage loads all the existing KV pair metadata, and it will throw an error if the file is invalid or corrupt.

Note that if the database file is large, the initialisation will take time accordingly. The initialisation is also a blocking operation; till it is completed, we cannot use the database.

Typical usage example:

	store, _ := NewDiskStore("books.db")
   	store.Set("othello", "shakespeare")
   	author := store.Get("othello")

func NewDiskStore ¶

func NewDiskStore(fileName string) (*DiskStore, error)

func (*DiskStore) Close ¶

func (d *DiskStore) Close() bool

func (*DiskStore) Get ¶

func (d *DiskStore) Get(key string) string

func (*DiskStore) Set ¶

func (d *DiskStore) Set(key string, value string)

type KeyEntry ¶

type KeyEntry struct {
	// contains filtered or unexported fields
}

KeyEntry keeps the metadata about the KV, specially the position of the byte offset in the file. Whenever we insert/update a key, we create a new KeyEntry object and insert that into keyDir.

func NewKeyEntry ¶

func NewKeyEntry(timestamp uint32, position uint32, totalSize uint32) KeyEntry

type MemoryStore ¶

type MemoryStore struct {
	// contains filtered or unexported fields
}

func NewMemoryStore ¶

func NewMemoryStore() *MemoryStore

func (*MemoryStore) Close ¶

func (m *MemoryStore) Close() bool

func (*MemoryStore) Get ¶

func (m *MemoryStore) Get(key string) string

func (*MemoryStore) Set ¶

func (m *MemoryStore) Set(key string, value string)

type Store ¶

type Store interface {
	Get(key string) string
	Set(key string, value string)
	Close() bool
}

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL