t4

package module
v0.16.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 13, 2026 License: MIT Imports: 25 Imported by: 0

README

T4

CI Go Reference Go Report Card Docs

An embeddable, S3-durable key-value store for Go, with an etcd-compatible standalone server.

  • Embedded-firstt4.Open(cfg) is the entire API. No sidecar, no daemon.
  • S3-durable — WAL segments and periodic checkpoints are uploaded to S3. A node that loses its disk recovers automatically.
  • Multi-node — Leader elected via an S3 lock. Followers stream the WAL in real time and forward writes transparently.
  • etcd v3 compatible — The standalone binary speaks the etcd v3 gRPC protocol, including multi-key transactions.
  • Twelve-factor config — CLI flags can be supplied through T4_* environment variables.
  • Branches — Fork a database at any checkpoint with zero S3 copies. Each branch writes to its own prefix; shared SST files are deduplicated automatically.

Embedded usage

import "github.com/t4db/t4"

node, err := t4.Open(t4.Config{
    DataDir: "/var/lib/myapp/t4",
})
defer node.Close()

rev, err := node.Put(ctx, "/config/timeout", []byte("30s"), 0)

kv, err := node.Get("/config/timeout")
fmt.Println(string(kv.Value)) // 30s

events, _ := node.Watch(ctx, "/config/", 0)
for e := range events {
    fmt.Printf("%s %s=%s\n", e.Type, e.KV.Key, e.KV.Value)
}
With S3 durability
import (
    "github.com/t4db/t4"
    "github.com/t4db/t4/pkg/object"
)

store, err := object.NewS3StoreFromConfig(ctx, object.S3Config{
    Bucket: "my-bucket",
    Prefix: "t4/",
    Region: "us-east-1",
    // Endpoint: "http://localhost:9000", // MinIO or another S3-compatible store
})
if err != nil {
    return err
}

node, err := t4.Open(t4.Config{
    DataDir:     "/var/lib/myapp/t4",
    ObjectStore: store,
})

Standalone binary

The t4 binary exposes the etcd v3 gRPC protocol. Use etcdctl, the official Go client, or any other etcd v3 compatible tool.

go install github.com/t4db/t4/cmd/t4@latest

# Single node, local only
t4 run --data-dir /var/lib/t4 --listen 0.0.0.0:3379

# Single node with S3
t4 run --data-dir /var/lib/t4 --listen 0.0.0.0:3379 \
           --s3-bucket my-bucket --s3-prefix t4/

# The same configuration can come from environment variables.
T4_DATA_DIR=/var/lib/t4 \
T4_LISTEN=0.0.0.0:3379 \
T4_S3_BUCKET=my-bucket \
T4_S3_PREFIX=t4/ \
T4_S3_REGION=us-east-1 \
t4 run

# Verify
etcdctl --endpoints=localhost:3379 put /hello world
etcdctl --endpoints=localhost:3379 get /hello

Multi-node and production setup: see Operations.


Branching

Branches fork a database from an existing S3 checkpoint without copying shared SST files.

# Register the branch against the source prefix.
checkpoint_key=$(t4 branch fork \
  --s3-bucket my-bucket \
  --s3-prefix t4/ \
  --branch-id experiment)

# Start the branch in its own prefix, using the source prefix as its ancestor.
t4 run \
  --data-dir /var/lib/t4-experiment \
  --listen 0.0.0.0:3379 \
  --s3-bucket my-bucket \
  --s3-prefix t4-experiment/ \
  --branch-prefix t4/ \
  --branch-checkpoint "$checkpoint_key"

When the branch is retired, remove its registry entry so future GC can reclaim unneeded source objects:

t4 branch unfork --s3-bucket my-bucket --s3-prefix t4/ --branch-id experiment

Documentation

Full documentation is available at t4db.github.io/t4.

Document Contents
Getting Started Quickstart for standalone server and embedded Go library
API Reference Full Go API — methods, types, errors, branching
Configuration All config fields and CLI flags
Operations Multi-node clusters, S3, TLS, authentication, RBAC, observability
Backup and Restore Checkpoints, point-in-time restore, branching, retention
Security TLS, mTLS, client auth, RBAC setup
Recipes Distributed locks, service discovery, common patterns
Kubernetes Helm chart, StatefulSet deployment
Docker Compose Local, MinIO-backed, and multi-node cluster examples
Architecture Internals — WAL, checkpoints, leader election, replication
Benchmarks T4 vs etcd benchmark results and analysis
Migrating from etcd Compatibility table and migration steps
Troubleshooting Diagnostics, debug logging, and common fixes
FAQ Frequently asked questions

Documentation

Overview

Package t4 provides an embeddable, S3-durable key-value store.

Index

Constants

This section is empty.

Variables

View Source
var (
	ErrKeyExists = errors.New("t4: key already exists")
	ErrNotLeader = errors.New("t4: this node is not the leader; writes are rejected")
	ErrClosed    = errors.New("t4: node is closed")
	ErrCompacted = errors.New("t4: required revision has been compacted")
)

Sentinel errors.

Functions

func Fork

func Fork(ctx context.Context, sourceStore object.Store, branchID string) (checkpointKey string, err error)

Fork registers a new branch in sourceStore under branchID, pinning the latest checkpoint so GC will not delete its SST files. It returns the checkpoint key to use as BranchPoint.CheckpointKey when starting the branch node.

The branch is forked from the latest committed checkpoint revision in sourceStore. If you need a branch at an older revision, pass that checkpoint's index key directly to BranchPoint.CheckpointKey and call checkpoint.RegisterBranch yourself.

Call Fork before starting the branch node. When the branch is decommissioned, call Unfork to allow GC to reclaim the protected SSTs.

func Unfork

func Unfork(ctx context.Context, sourceStore object.Store, branchID string) error

Unfork removes the branch registry entry from sourceStore. Call when the branch is decommissioned to allow GC to reclaim SSTs that are no longer referenced by any live checkpoint.

Types

type BranchPoint

type BranchPoint struct {
	// SourceStore is the object store of the source node.
	SourceStore object.Store
	// CheckpointKey is the v2 checkpoint index key in SourceStore
	// (e.g. "checkpoint/0001/0000000000000000100/manifest.json").
	CheckpointKey string
}

BranchPoint describes a source checkpoint from which a new branch node should bootstrap. Unlike RestorePoint, it does not require S3 versioning — SST files are protected by registering the branch in the source store via checkpoint.RegisterBranch before starting the node.

On first boot (local data directory does not exist), the node downloads the source checkpoint's SST files and Pebble metadata, then writes its own checkpoint to Config.ObjectStore. Subsequent restarts use local disk only.

type Config

type Config struct {

	// ReadConsistency controls the consistency guarantee for reads served
	// through the etcd adapter.
	// Default: ReadConsistencyLinearizable (etcd-compatible; free for
	// leaders and single-node deployments since the sync is a no-op).
	ReadConsistency ReadConsistency

	// DataDir is the directory used for local Pebble data and WAL segments.
	// Required.
	DataDir string

	// ObjectStore is used to archive WAL segments and checkpoints and to run
	// leader election. If nil the node runs in single-node mode.
	ObjectStore object.Store

	// RestorePoint, if set, causes the node to bootstrap from a specific
	// point in time on first boot rather than reading the latest checkpoint
	// from ObjectStore. See RestorePoint for details.
	RestorePoint *RestorePoint

	// BranchPoint, if set, causes the node to bootstrap from a specific source
	// checkpoint on first boot. Unlike RestorePoint, it does not require S3
	// versioning. The source store's SST files are shared until the branch node
	// creates its own checkpoints and compacts away the inherited data.
	// Ignored on subsequent restarts (when local data directory already exists).
	BranchPoint *BranchPoint

	// AncestorStore is the object store of the source node, set for branch nodes.
	// When non-nil, checkpoint.Write skips uploading SST files already present in
	// AncestorStore and records them as AncestorSSTFiles instead.
	AncestorStore object.Store

	// SegmentMaxSize is the byte threshold that triggers WAL segment rotation.
	// Default: 50 MB.
	SegmentMaxSize int64

	// SegmentMaxAge is the time threshold that triggers WAL segment rotation
	// and, when WALSyncUpload is false, the maximum interval between async S3
	// uploads. Default: 10 s.
	SegmentMaxAge time.Duration

	// WALSyncUpload controls whether WAL segments are uploaded to S3
	// synchronously before a write is acknowledged in single-node mode.
	//
	// true (default): each write blocks until its WAL segment is durably in
	// S3. Safe even if local disk is ephemeral (e.g. emptyDir in Kubernetes).
	//
	// false: uploads happen asynchronously every SegmentMaxAge. Write latency
	// is much lower, but up to SegmentMaxAge of acknowledged writes can be lost
	// if local storage is destroyed before the upload completes. Use this when
	// local storage is already durable (e.g. a PVC).
	//
	// Has no effect in multi-node mode; quorum ACK provides durability without
	// blocking on S3, so uploads are always async there.
	WALSyncUpload *bool

	// CheckpointInterval controls how often the leader writes a checkpoint.
	// Default: 15 minutes.
	CheckpointInterval time.Duration

	// CheckpointEntries triggers a checkpoint after this many WAL entries
	// regardless of time. 0 means disabled.
	CheckpointEntries int64

	// NodeID is a stable, unique identifier for this node.
	// Defaults to the machine hostname.
	NodeID string

	// PeerListenAddr is the address on which the peer WAL-streaming gRPC
	// server listens (e.g. "0.0.0.0:3380"). Empty → single-node mode.
	PeerListenAddr string

	// AdvertisePeerAddr is the address followers use to reach this node's peer
	// server. Defaults to PeerListenAddr.
	AdvertisePeerAddr string

	// LeaderWatchInterval is how often the leader reads the lock from S3 to
	// detect if it has been superseded. Read-only; no renewals.
	// Default: 5 minutes.
	LeaderWatchInterval time.Duration

	// FollowerMaxRetries is the number of consecutive stream failures a follower
	// tolerates before attempting a TakeOver election.
	// Default: 5.
	FollowerMaxRetries int

	// FollowerWaitMode controls how many follower ACKs the leader waits for
	// before applying a batch to Pebble and acknowledging it to the client.
	// Default: FollowerWaitQuorum.
	FollowerWaitMode FollowerWaitMode

	// PeerBufferSize is the number of WAL entries the leader buffers for
	// follower catch-up. Default: 10 000.
	PeerBufferSize int

	// PeerServerTLS is the transport credentials used by the leader's peer
	// gRPC server. Nil means plaintext (only safe inside a trusted network).
	PeerServerTLS credentials.TransportCredentials

	// PeerClientTLS is the transport credentials used by a follower's peer
	// gRPC client. Must be set when PeerServerTLS is set on the leader.
	PeerClientTLS credentials.TransportCredentials

	// Logger, if set, receives all t4 log output.  When nil the global
	// logrus logger is used (logrus.StandardLogger()).
	//
	// In embedded mode, supply your own Logger to control destination, level,
	// and format independently from t4's internals.  This also silences the
	// noisy Pebble storage-engine messages (WAL recovery, compaction) which
	// are downgraded to DEBUG by the built-in Pebble adapter – so setting
	// your logger's level to INFO or above is enough to hide them.
	//
	// *logrus.Logger already implements Logger.  For other libraries:
	//
	//	// Silence all t4 output:
	//	cfg.Logger = t4.NoopLogger
	//
	//	// Logrus at warn level:
	//	l := logrus.New()
	//	l.SetLevel(logrus.WarnLevel)
	//	cfg.Logger = l
	Logger Logger

	// MetricsRegisterer is the Prometheus registerer used to register all
	// t4 metrics. When nil, prometheus.DefaultRegisterer is used.
	// Pass a *prometheus.Registry to isolate t4 metrics from the global
	// registry (useful when embedding t4 in applications that manage their
	// own Prometheus registries). The first registration also sets the gatherer
	// served by Node.ServeMetrics when the registerer implements
	// prometheus.Gatherer; otherwise /metrics falls back to
	// prometheus.DefaultGatherer.
	MetricsRegisterer prometheus.Registerer
}

Config holds all configuration for a Node.

type Event

type Event struct {
	Type   EventType
	KV     *KeyValue
	PrevKV *KeyValue // nil for creates
}

Event is a single watch notification.

type EventType

type EventType int

EventType classifies a watch event.

const (
	EventPut    EventType = iota // create or update
	EventDelete                  // deletion
)

type FollowerWaitMode

type FollowerWaitMode string

FollowerWaitMode controls how many follower ACKs a leader waits for before applying a batch locally and acknowledging the write to the client.

const (
	// FollowerWaitNone skips follower ACK waiting entirely.
	FollowerWaitNone FollowerWaitMode = "none"

	// FollowerWaitQuorum waits for a majority of the cluster. Since the leader
	// already has the entry durably in its WAL, this means waiting for enough
	// followers to reach quorum with the leader included.
	FollowerWaitQuorum FollowerWaitMode = "quorum"

	// FollowerWaitAll waits for every connected follower present when the batch
	// starts waiting.
	FollowerWaitAll FollowerWaitMode = "all"
)

type KeyValue

type KeyValue struct {
	Key            string
	Value          []byte
	Revision       int64
	CreateRevision int64
	PrevRevision   int64
	Lease          int64
}

KeyValue is a versioned key-value pair.

type Logger added in v0.13.1

type Logger interface {
	Debugf(format string, args ...interface{})
	Infof(format string, args ...interface{})
	Warnf(format string, args ...interface{})
	Errorf(format string, args ...interface{})
}

Logger is the logging interface used by t4. Any leveled logger can be adapted to satisfy it with a small wrapper; *logrus.Logger already does.

The interface is intentionally narrow: only formatted, leveled methods are required. Structured fields and writer-level operations are not part of the contract so that callers are not forced into logrus-specific types.

var NoopLogger Logger = noopLogger{}

NoopLogger discards all log output. Useful in tests or when the embedding application manages its own logging at a higher layer.

type Node

type Node struct {
	// contains filtered or unexported fields
}

func Open

func Open(cfg Config) (*Node, error)

Open creates and starts a Node.

func (*Node) Close

func (n *Node) Close() error

Close shuts down the node cleanly.

func (*Node) Compact

func (n *Node) Compact(ctx context.Context, revision int64) error

Compact removes log entries at or below revision.

func (*Node) CompactRevision

func (n *Node) CompactRevision() int64

func (*Node) Config

func (n *Node) Config() Config

func (*Node) Count

func (n *Node) Count(prefix string) (int64, error)

func (*Node) Create

func (n *Node) Create(ctx context.Context, key string, value []byte, lease int64) (int64, error)

Create creates key only if it does not already exist.

func (*Node) CurrentRevision

func (n *Node) CurrentRevision() int64

func (*Node) Delete

func (n *Node) Delete(ctx context.Context, key string) (int64, error)

Delete removes key unconditionally.

func (*Node) DeleteIfRevision

func (n *Node) DeleteIfRevision(ctx context.Context, key string, revision int64) (int64, *KeyValue, bool, error)

DeleteIfRevision deletes key only if its current revision matches (CAS).

func (*Node) Get

func (n *Node) Get(key string) (*KeyValue, error)

func (*Node) HandleForward

func (n *Node) HandleForward(ctx context.Context, req *peer.ForwardRequest) (*peer.ForwardResponse, error)

HandleForward implements peer.ForwardHandler. Called by the peer gRPC server when a follower forwards a write. Dispatches to the appropriate Node method. Since HandleForward runs on the leader, all write methods execute directly.

func (*Node) IsLeader

func (n *Node) IsLeader() bool

func (*Node) LinearizableCount

func (n *Node) LinearizableCount(ctx context.Context, prefix string) (int64, error)

LinearizableCount returns the count of keys with the given prefix with linearizability guaranteed.

func (*Node) LinearizableGet

func (n *Node) LinearizableGet(ctx context.Context, key string) (*KeyValue, error)

LinearizableGet returns the value for key with linearizability guaranteed. On a follower it syncs to the leader's revision before serving locally.

func (*Node) LinearizableList

func (n *Node) LinearizableList(ctx context.Context, prefix string) ([]*KeyValue, error)

LinearizableList returns all keys with the given prefix with linearizability guaranteed.

func (*Node) List

func (n *Node) List(prefix string) ([]*KeyValue, error)

func (*Node) Put

func (n *Node) Put(ctx context.Context, key string, value []byte, lease int64) (int64, error)

Put creates or updates key with value. Returns the new revision.

func (*Node) ReadConsistency

func (n *Node) ReadConsistency() ReadConsistency

ReadConsistency returns the configured read consistency mode.

func (*Node) Txn added in v0.15.0

func (n *Node) Txn(ctx context.Context, req TxnRequest) (TxnResponse, error)

Txn executes an atomic multi-key transaction.

All Conditions are evaluated under the write lock in a single atomic step. If every condition is satisfied the Success ops are applied and Succeeded is true; otherwise the Failure ops are applied and Succeeded is false. Either branch may be empty — if no ops need to be written the method returns immediately with the current revision.

All write ops within the selected branch share a single revision and are committed to the WAL in one entry, ensuring crash-safe atomicity.

func (*Node) Update

func (n *Node) Update(ctx context.Context, key string, value []byte, revision, lease int64) (int64, *KeyValue, bool, error)

Update updates key only if its current revision matches (CAS).

func (*Node) WaitForRevision

func (n *Node) WaitForRevision(ctx context.Context, rev int64) error

func (*Node) Watch

func (n *Node) Watch(ctx context.Context, prefix string, startRev int64) (<-chan Event, error)

Watch streams prefix-matching events using etcd revision semantics: startRev=0 means "from now"; startRev=N means replay from revision N (inclusive).

type PinnedObject

type PinnedObject struct {
	Key       string
	VersionID string
}

PinnedObject identifies a specific version of an object in object storage.

type ReadConsistency

type ReadConsistency string

ReadConsistency controls the consistency guarantee for read operations served by the etcd adapter. It acts as a server-side override on top of the per-request Serializable flag sent by etcd clients.

const (
	// ReadConsistencyLinearizable (default) respects each request's Serializable
	// flag: linearizable requests use the ReadIndex pattern (follower syncs to the
	// leader's revision before serving); serializable requests are served locally
	// without any leader contact.
	ReadConsistencyLinearizable ReadConsistency = "linearizable"

	// ReadConsistencySerializable forces all reads to be served from the local
	// Pebble store, bypassing the ReadIndex sync even when the client requests
	// linearizability. Reads are fast (~450 ns on a single node) and scale
	// horizontally, but a follower may return data that is slightly behind the
	// leader. Choose this when throughput and horizontal read scaling matter more
	// than strict linearizability (e.g., when each API server has a dedicated
	// t4 leader).
	ReadConsistencySerializable ReadConsistency = "serializable"
)

type RestorePoint

type RestorePoint struct {

	// Store is the versioned object store to read pinned objects from.
	// It may use a different prefix than Config.ObjectStore (e.g. to read
	// from the source branch while writing to a new branch prefix).
	Store object.VersionedStore

	// CheckpointArchive is the pinned checkpoint archive object.
	CheckpointArchive PinnedObject

	// WALSegments are the WAL segments to replay after the checkpoint,
	// in ascending sequence order.
	WALSegments []PinnedObject
}

RestorePoint describes a precise point in time from which a node should bootstrap. When set in Config, the node restores the checkpoint and replays the listed WAL segments using their pinned S3 version IDs, rather than reading the latest objects from its own prefix.

This enables point-in-time restore, blue/green deployments, and copy-free forking: the source data is read directly from S3 by version ID — no objects are copied to the new prefix.

The node's own ObjectStore prefix is used for all subsequent writes after startup. RestorePoint is only applied on first boot (when the local data directory does not yet exist); it is ignored on subsequent restarts.

S3 versioning must be enabled on the source bucket.

type TxnCondResult added in v0.15.0

type TxnCondResult uint8

TxnCondResult is the comparison operator for a TxnCondition.

const (
	TxnCondEqual    TxnCondResult = iota
	TxnCondNotEqual               //nolint:deadcode
	TxnCondGreater
	TxnCondLess
)

type TxnCondTarget added in v0.15.0

type TxnCondTarget uint8

TxnCondTarget identifies which field of a key's metadata is compared.

const (
	TxnCondMod     TxnCondTarget = iota // compare ModRevision
	TxnCondVersion                      // compare Version (write count; 0 = does not exist)
	TxnCondCreate                       // compare CreateRevision
	TxnCondValue                        // compare Value bytes
	TxnCondLease                        // compare Lease ID
)

type TxnCondition added in v0.15.0

type TxnCondition struct {
	Key    string
	Target TxnCondTarget
	Result TxnCondResult
	// Exactly one of the following is used, depending on Target:
	ModRevision    int64
	CreateRevision int64
	Version        int64 // number of writes to key; 0 means key does not exist
	Value          []byte
	Lease          int64
}

TxnCondition is one predicate in a transaction's If clause. All conditions in a TxnRequest are evaluated atomically under the write lock; no revision can change between evaluating the first and last condition.

type TxnOp added in v0.15.0

type TxnOp struct {
	Type  TxnOpType
	Key   string
	Value []byte
	Lease int64
}

TxnOp is one write operation in a transaction's Then or Else branch.

type TxnOpType added in v0.15.0

type TxnOpType uint8

TxnOpType identifies the kind of operation within a transaction branch.

const (
	TxnPut    TxnOpType = iota // upsert
	TxnDelete                  // unconditional delete
)

type TxnRequest added in v0.15.0

type TxnRequest struct {
	Conditions []TxnCondition
	Success    []TxnOp
	Failure    []TxnOp
}

TxnRequest is the input to Node.Txn. If all Conditions are met the Success ops are applied atomically; otherwise the Failure ops are applied atomically (may be empty for a read-only else).

type TxnResponse added in v0.15.0

type TxnResponse struct {
	Succeeded   bool                // true if all Conditions were satisfied
	Revision    int64               // revision assigned to the write, or current revision if no-op
	DeletedKeys map[string]struct{} // set of keys actually removed by the txn's write ops
}

TxnResponse is returned by Node.Txn.

Directories

Path Synopsis
bench
cmd/t4bench command
t4bench is a standalone load-generator that speaks the etcd v3 protocol.
t4bench is a standalone load-generator that speaks the etcd v3 protocol.
cmd
t4 command
Command t4 runs a T4 node and exposes it as an etcd v3 gRPC endpoint.
Command t4 runs a T4 node and exposes it as an etcd v3 gRPC endpoint.
Package etcd exposes a t4 Node as an etcd v3 gRPC server.
Package etcd exposes a t4 Node as an etcd v3 gRPC server.
auth
Package auth implements etcd-compatible authentication and RBAC for T4.
Package auth implements etcd-compatible authentication and RBAC for T4.
internal
checkpoint
Package checkpoint handles creating, writing, and restoring Pebble snapshots to/from object storage.
Package checkpoint handles creating, writing, and restoring Pebble snapshots to/from object storage.
election
Package election implements S3-based leader election.
Package election implements S3-based leader election.
metrics
Package metrics defines Prometheus metrics for a t4 node.
Package metrics defines Prometheus metrics for a t4 node.
peer
Package peer implements the leader→follower WAL streaming gRPC service, plus write forwarding (follower→leader).
Package peer implements the leader→follower WAL streaming gRPC service, plus write forwarding (follower→leader).
store
Package store implements the Pebble-backed key-value state machine.
Package store implements the Pebble-backed key-value state machine.
wal
Package wal implements the write-ahead log.
Package wal implements the write-ahead log.
pkg
object
Package object provides a small interface for object storage operations used by T4 (WAL archive, checkpoints, manifest).
Package object provides a small interface for object storage operations used by T4 (WAL archive, checkpoints, manifest).

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL