Documentation
¶
Overview ¶
Package t4 provides an embeddable, S3-durable key-value store.
Index ¶
- Variables
- func Fork(ctx context.Context, sourceStore object.Store, branchID string) (checkpointKey string, err error)
- func Unfork(ctx context.Context, sourceStore object.Store, branchID string) error
- type BranchPoint
- type Config
- type Event
- type EventType
- type FollowerWaitMode
- type KeyValue
- type Logger
- type Node
- func (n *Node) Close() error
- func (n *Node) Compact(ctx context.Context, revision int64) error
- func (n *Node) CompactRevision() int64
- func (n *Node) Config() Config
- func (n *Node) Count(prefix string) (int64, error)
- func (n *Node) Create(ctx context.Context, key string, value []byte, lease int64) (int64, error)
- func (n *Node) CurrentRevision() int64
- func (n *Node) Delete(ctx context.Context, key string) (int64, error)
- func (n *Node) DeleteIfRevision(ctx context.Context, key string, revision int64) (int64, *KeyValue, bool, error)
- func (n *Node) Get(key string) (*KeyValue, error)
- func (n *Node) HandleForward(ctx context.Context, req *peer.ForwardRequest) (*peer.ForwardResponse, error)
- func (n *Node) IsLeader() bool
- func (n *Node) LinearizableCount(ctx context.Context, prefix string) (int64, error)
- func (n *Node) LinearizableGet(ctx context.Context, key string) (*KeyValue, error)
- func (n *Node) LinearizableList(ctx context.Context, prefix string) ([]*KeyValue, error)
- func (n *Node) List(prefix string) ([]*KeyValue, error)
- func (n *Node) Put(ctx context.Context, key string, value []byte, lease int64) (int64, error)
- func (n *Node) ReadConsistency() ReadConsistency
- func (n *Node) Txn(ctx context.Context, req TxnRequest) (TxnResponse, error)
- func (n *Node) Update(ctx context.Context, key string, value []byte, revision, lease int64) (int64, *KeyValue, bool, error)
- func (n *Node) WaitForRevision(ctx context.Context, rev int64) error
- func (n *Node) Watch(ctx context.Context, prefix string, startRev int64) (<-chan Event, error)
- type PinnedObject
- type ReadConsistency
- type RestorePoint
- type TxnCondResult
- type TxnCondTarget
- type TxnCondition
- type TxnOp
- type TxnOpType
- type TxnRequest
- type TxnResponse
Constants ¶
This section is empty.
Variables ¶
var ( ErrKeyExists = errors.New("t4: key already exists") ErrNotLeader = errors.New("t4: this node is not the leader; writes are rejected") ErrClosed = errors.New("t4: node is closed") ErrCompacted = errors.New("t4: required revision has been compacted") )
Sentinel errors.
Functions ¶
func Fork ¶
func Fork(ctx context.Context, sourceStore object.Store, branchID string) (checkpointKey string, err error)
Fork registers a new branch in sourceStore under branchID, pinning the latest checkpoint so GC will not delete its SST files. It returns the checkpoint key to use as BranchPoint.CheckpointKey when starting the branch node.
The branch is forked from the latest committed checkpoint revision in sourceStore. If you need a branch at an older revision, pass that checkpoint's index key directly to BranchPoint.CheckpointKey and call checkpoint.RegisterBranch yourself.
Call Fork before starting the branch node. When the branch is decommissioned, call Unfork to allow GC to reclaim the protected SSTs.
Types ¶
type BranchPoint ¶
type BranchPoint struct {
// SourceStore is the object store of the source node.
SourceStore object.Store
// CheckpointKey is the v2 checkpoint index key in SourceStore
// (e.g. "checkpoint/0001/0000000000000000100/manifest.json").
CheckpointKey string
}
BranchPoint describes a source checkpoint from which a new branch node should bootstrap. Unlike RestorePoint, it does not require S3 versioning — SST files are protected by registering the branch in the source store via checkpoint.RegisterBranch before starting the node.
On first boot (local data directory does not exist), the node downloads the source checkpoint's SST files and Pebble metadata, then writes its own checkpoint to Config.ObjectStore. Subsequent restarts use local disk only.
type Config ¶
type Config struct {
// ReadConsistency controls the consistency guarantee for reads served
// through the etcd adapter.
// Default: ReadConsistencyLinearizable (etcd-compatible; free for
// leaders and single-node deployments since the sync is a no-op).
ReadConsistency ReadConsistency
// DataDir is the directory used for local Pebble data and WAL segments.
// Required.
DataDir string
// ObjectStore is used to archive WAL segments and checkpoints and to run
// leader election. If nil the node runs in single-node mode.
ObjectStore object.Store
// RestorePoint, if set, causes the node to bootstrap from a specific
// point in time on first boot rather than reading the latest checkpoint
// from ObjectStore. See RestorePoint for details.
RestorePoint *RestorePoint
// BranchPoint, if set, causes the node to bootstrap from a specific source
// checkpoint on first boot. Unlike RestorePoint, it does not require S3
// versioning. The source store's SST files are shared until the branch node
// creates its own checkpoints and compacts away the inherited data.
// Ignored on subsequent restarts (when local data directory already exists).
BranchPoint *BranchPoint
// AncestorStore is the object store of the source node, set for branch nodes.
// When non-nil, checkpoint.Write skips uploading SST files already present in
// AncestorStore and records them as AncestorSSTFiles instead.
AncestorStore object.Store
// SegmentMaxSize is the byte threshold that triggers WAL segment rotation.
// Default: 50 MB.
SegmentMaxSize int64
// SegmentMaxAge is the time threshold that triggers WAL segment rotation
// and, when WALSyncUpload is false, the maximum interval between async S3
// uploads. Default: 10 s.
SegmentMaxAge time.Duration
// WALSyncUpload controls whether WAL segments are uploaded to S3
// synchronously before a write is acknowledged in single-node mode.
//
// true (default): each write blocks until its WAL segment is durably in
// S3. Safe even if local disk is ephemeral (e.g. emptyDir in Kubernetes).
//
// false: uploads happen asynchronously every SegmentMaxAge. Write latency
// is much lower, but up to SegmentMaxAge of acknowledged writes can be lost
// if local storage is destroyed before the upload completes. Use this when
// local storage is already durable (e.g. a PVC).
//
// Has no effect in multi-node mode; quorum ACK provides durability without
// blocking on S3, so uploads are always async there.
WALSyncUpload *bool
// CheckpointInterval controls how often the leader writes a checkpoint.
// Default: 15 minutes.
CheckpointInterval time.Duration
// CheckpointEntries triggers a checkpoint after this many WAL entries
// regardless of time. 0 means disabled.
CheckpointEntries int64
// NodeID is a stable, unique identifier for this node.
// Defaults to the machine hostname.
NodeID string
// PeerListenAddr is the address on which the peer WAL-streaming gRPC
// server listens (e.g. "0.0.0.0:3380"). Empty → single-node mode.
PeerListenAddr string
// AdvertisePeerAddr is the address followers use to reach this node's peer
// server. Defaults to PeerListenAddr.
AdvertisePeerAddr string
// LeaderWatchInterval is how often the leader reads the lock from S3 to
// detect if it has been superseded. Read-only; no renewals.
// Default: 5 minutes.
LeaderWatchInterval time.Duration
// FollowerMaxRetries is the number of consecutive stream failures a follower
// tolerates before attempting a TakeOver election.
// Default: 5.
FollowerMaxRetries int
// FollowerWaitMode controls how many follower ACKs the leader waits for
// before applying a batch to Pebble and acknowledging it to the client.
// Default: FollowerWaitQuorum.
FollowerWaitMode FollowerWaitMode
// PeerBufferSize is the number of WAL entries the leader buffers for
// follower catch-up. Default: 10 000.
PeerBufferSize int
// PeerServerTLS is the transport credentials used by the leader's peer
// gRPC server. Nil means plaintext (only safe inside a trusted network).
PeerServerTLS credentials.TransportCredentials
// PeerClientTLS is the transport credentials used by a follower's peer
// gRPC client. Must be set when PeerServerTLS is set on the leader.
PeerClientTLS credentials.TransportCredentials
// Logger, if set, receives all t4 log output. When nil the global
// logrus logger is used (logrus.StandardLogger()).
//
// In embedded mode, supply your own Logger to control destination, level,
// and format independently from t4's internals. This also silences the
// noisy Pebble storage-engine messages (WAL recovery, compaction) which
// are downgraded to DEBUG by the built-in Pebble adapter – so setting
// your logger's level to INFO or above is enough to hide them.
//
// *logrus.Logger already implements Logger. For other libraries:
//
// // Silence all t4 output:
// cfg.Logger = t4.NoopLogger
//
// // Logrus at warn level:
// l := logrus.New()
// l.SetLevel(logrus.WarnLevel)
// cfg.Logger = l
Logger Logger
// MetricsRegisterer is the Prometheus registerer used to register all
// t4 metrics. When nil, prometheus.DefaultRegisterer is used.
// Pass a *prometheus.Registry to isolate t4 metrics from the global
// registry (useful when embedding t4 in applications that manage their
// own Prometheus registries). The first registration also sets the gatherer
// served by Node.ServeMetrics when the registerer implements
// prometheus.Gatherer; otherwise /metrics falls back to
// prometheus.DefaultGatherer.
MetricsRegisterer prometheus.Registerer
}
Config holds all configuration for a Node.
type FollowerWaitMode ¶
type FollowerWaitMode string
FollowerWaitMode controls how many follower ACKs a leader waits for before applying a batch locally and acknowledging the write to the client.
const ( // FollowerWaitNone skips follower ACK waiting entirely. FollowerWaitNone FollowerWaitMode = "none" // FollowerWaitQuorum waits for a majority of the cluster. Since the leader // already has the entry durably in its WAL, this means waiting for enough // followers to reach quorum with the leader included. FollowerWaitQuorum FollowerWaitMode = "quorum" // FollowerWaitAll waits for every connected follower present when the batch // starts waiting. FollowerWaitAll FollowerWaitMode = "all" )
type KeyValue ¶
type KeyValue struct {
Key string
Value []byte
Revision int64
CreateRevision int64
PrevRevision int64
Lease int64
}
KeyValue is a versioned key-value pair.
type Logger ¶ added in v0.13.1
type Logger interface {
Debugf(format string, args ...interface{})
Infof(format string, args ...interface{})
Warnf(format string, args ...interface{})
Errorf(format string, args ...interface{})
}
Logger is the logging interface used by t4. Any leveled logger can be adapted to satisfy it with a small wrapper; *logrus.Logger already does.
The interface is intentionally narrow: only formatted, leveled methods are required. Structured fields and writer-level operations are not part of the contract so that callers are not forced into logrus-specific types.
var NoopLogger Logger = noopLogger{}
NoopLogger discards all log output. Useful in tests or when the embedding application manages its own logging at a higher layer.
type Node ¶
type Node struct {
// contains filtered or unexported fields
}
func (*Node) CompactRevision ¶
func (*Node) CurrentRevision ¶
func (*Node) DeleteIfRevision ¶
func (n *Node) DeleteIfRevision(ctx context.Context, key string, revision int64) (int64, *KeyValue, bool, error)
DeleteIfRevision deletes key only if its current revision matches (CAS).
func (*Node) HandleForward ¶
func (n *Node) HandleForward(ctx context.Context, req *peer.ForwardRequest) (*peer.ForwardResponse, error)
HandleForward implements peer.ForwardHandler. Called by the peer gRPC server when a follower forwards a write. Dispatches to the appropriate Node method. Since HandleForward runs on the leader, all write methods execute directly.
func (*Node) LinearizableCount ¶
LinearizableCount returns the count of keys with the given prefix with linearizability guaranteed.
func (*Node) LinearizableGet ¶
LinearizableGet returns the value for key with linearizability guaranteed. On a follower it syncs to the leader's revision before serving locally.
func (*Node) LinearizableList ¶
LinearizableList returns all keys with the given prefix with linearizability guaranteed.
func (*Node) ReadConsistency ¶
func (n *Node) ReadConsistency() ReadConsistency
ReadConsistency returns the configured read consistency mode.
func (*Node) Txn ¶ added in v0.15.0
func (n *Node) Txn(ctx context.Context, req TxnRequest) (TxnResponse, error)
Txn executes an atomic multi-key transaction.
All Conditions are evaluated under the write lock in a single atomic step. If every condition is satisfied the Success ops are applied and Succeeded is true; otherwise the Failure ops are applied and Succeeded is false. Either branch may be empty — if no ops need to be written the method returns immediately with the current revision.
All write ops within the selected branch share a single revision and are committed to the WAL in one entry, ensuring crash-safe atomicity.
type PinnedObject ¶
PinnedObject identifies a specific version of an object in object storage.
type ReadConsistency ¶
type ReadConsistency string
ReadConsistency controls the consistency guarantee for read operations served by the etcd adapter. It acts as a server-side override on top of the per-request Serializable flag sent by etcd clients.
const ( // ReadConsistencyLinearizable (default) respects each request's Serializable // flag: linearizable requests use the ReadIndex pattern (follower syncs to the // leader's revision before serving); serializable requests are served locally // without any leader contact. ReadConsistencyLinearizable ReadConsistency = "linearizable" // ReadConsistencySerializable forces all reads to be served from the local // Pebble store, bypassing the ReadIndex sync even when the client requests // linearizability. Reads are fast (~450 ns on a single node) and scale // horizontally, but a follower may return data that is slightly behind the // leader. Choose this when throughput and horizontal read scaling matter more // than strict linearizability (e.g., when each API server has a dedicated // t4 leader). ReadConsistencySerializable ReadConsistency = "serializable" )
type RestorePoint ¶
type RestorePoint struct {
// Store is the versioned object store to read pinned objects from.
// It may use a different prefix than Config.ObjectStore (e.g. to read
// from the source branch while writing to a new branch prefix).
Store object.VersionedStore
// CheckpointArchive is the pinned checkpoint archive object.
CheckpointArchive PinnedObject
// WALSegments are the WAL segments to replay after the checkpoint,
// in ascending sequence order.
WALSegments []PinnedObject
}
RestorePoint describes a precise point in time from which a node should bootstrap. When set in Config, the node restores the checkpoint and replays the listed WAL segments using their pinned S3 version IDs, rather than reading the latest objects from its own prefix.
This enables point-in-time restore, blue/green deployments, and copy-free forking: the source data is read directly from S3 by version ID — no objects are copied to the new prefix.
The node's own ObjectStore prefix is used for all subsequent writes after startup. RestorePoint is only applied on first boot (when the local data directory does not yet exist); it is ignored on subsequent restarts.
S3 versioning must be enabled on the source bucket.
type TxnCondResult ¶ added in v0.15.0
type TxnCondResult uint8
TxnCondResult is the comparison operator for a TxnCondition.
const ( TxnCondEqual TxnCondResult = iota TxnCondNotEqual //nolint:deadcode TxnCondGreater TxnCondLess )
type TxnCondTarget ¶ added in v0.15.0
type TxnCondTarget uint8
TxnCondTarget identifies which field of a key's metadata is compared.
const ( TxnCondMod TxnCondTarget = iota // compare ModRevision TxnCondVersion // compare Version (write count; 0 = does not exist) TxnCondCreate // compare CreateRevision TxnCondValue // compare Value bytes TxnCondLease // compare Lease ID )
type TxnCondition ¶ added in v0.15.0
type TxnCondition struct {
Key string
Target TxnCondTarget
Result TxnCondResult
// Exactly one of the following is used, depending on Target:
ModRevision int64
CreateRevision int64
Version int64 // number of writes to key; 0 means key does not exist
Value []byte
Lease int64
}
TxnCondition is one predicate in a transaction's If clause. All conditions in a TxnRequest are evaluated atomically under the write lock; no revision can change between evaluating the first and last condition.
type TxnOpType ¶ added in v0.15.0
type TxnOpType uint8
TxnOpType identifies the kind of operation within a transaction branch.
type TxnRequest ¶ added in v0.15.0
type TxnRequest struct {
Conditions []TxnCondition
Success []TxnOp
Failure []TxnOp
}
TxnRequest is the input to Node.Txn. If all Conditions are met the Success ops are applied atomically; otherwise the Failure ops are applied atomically (may be empty for a read-only else).
type TxnResponse ¶ added in v0.15.0
type TxnResponse struct {
Succeeded bool // true if all Conditions were satisfied
Revision int64 // revision assigned to the write, or current revision if no-op
DeletedKeys map[string]struct{} // set of keys actually removed by the txn's write ops
}
TxnResponse is returned by Node.Txn.
Directories
¶
| Path | Synopsis |
|---|---|
|
bench
|
|
|
cmd/t4bench
command
t4bench is a standalone load-generator that speaks the etcd v3 protocol.
|
t4bench is a standalone load-generator that speaks the etcd v3 protocol. |
|
cmd
|
|
|
t4
command
Command t4 runs a T4 node and exposes it as an etcd v3 gRPC endpoint.
|
Command t4 runs a T4 node and exposes it as an etcd v3 gRPC endpoint. |
|
Package etcd exposes a t4 Node as an etcd v3 gRPC server.
|
Package etcd exposes a t4 Node as an etcd v3 gRPC server. |
|
auth
Package auth implements etcd-compatible authentication and RBAC for T4.
|
Package auth implements etcd-compatible authentication and RBAC for T4. |
|
internal
|
|
|
checkpoint
Package checkpoint handles creating, writing, and restoring Pebble snapshots to/from object storage.
|
Package checkpoint handles creating, writing, and restoring Pebble snapshots to/from object storage. |
|
election
Package election implements S3-based leader election.
|
Package election implements S3-based leader election. |
|
metrics
Package metrics defines Prometheus metrics for a t4 node.
|
Package metrics defines Prometheus metrics for a t4 node. |
|
peer
Package peer implements the leader→follower WAL streaming gRPC service, plus write forwarding (follower→leader).
|
Package peer implements the leader→follower WAL streaming gRPC service, plus write forwarding (follower→leader). |
|
store
Package store implements the Pebble-backed key-value state machine.
|
Package store implements the Pebble-backed key-value state machine. |
|
wal
Package wal implements the write-ahead log.
|
Package wal implements the write-ahead log. |
|
pkg
|
|
|
object
Package object provides a small interface for object storage operations used by T4 (WAL archive, checkpoints, manifest).
|
Package object provides a small interface for object storage operations used by T4 (WAL archive, checkpoints, manifest). |