raftly

module

v0.2.0 Latest Latest Go to latest Published: Apr 23, 2026 License: MIT

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/ani03sha/raftly

Links

Open Source Insights

README ¶

Raftly

A production-quality Raft consensus implementation in Go: written as a failure-simulation laboratory, not a tutorial.

Most distributed systems guides teach you how things work when they go right. This project teaches what happens when they go wrong: network partitions, WAL corruption, stale leaders, split-brain. Each failure mode is based on a real incident. Each one is reproducible on a laptop in under a minute.

1. What This Is

Raftly is a from-scratch Raft implementation that prioritizes observability and fault injection over performance. The goal is to understand, at the code level, why distributed consensus is hard.

What it includes:

Full Raft consensus: leader election, log replication, pre-vote, fast log backtracking
Write-ahead log with CRC32 torn-write detection
In-memory transport with a chaos injection plane (network partitions, packet loss, delay)
gRPC transport for real multi-node deployment
A replicated key-value HTTP API (PUT / GET / DELETE)
Prometheus metrics + Grafana dashboards
Four reproducible failure scenarios based on real production incidents
Benchmark suite measuring WAL throughput, proposal latency, and leader election time

What it is not:

A replacement for etcd or Consul
A complete snapshot/compaction implementation
A membership change (joint consensus) implementation

2. Architecture

Architecture Diagram

Packages:

Package	Responsibility
`raft/`	Core state machine: election, replication, log, WAL, config
`transport/`	Transport abstraction: gRPC for real clusters, in-memory for tests
`chaos/`	Fault injection: crash, partition, packet loss, delay
`scenarios/`	Reproducible failure scenarios with pass/fail assertions
`server/`	HTTP API, KVStore state machine, Prometheus metrics
`tests/`	Integration tests and benchmarks
`proto/`	Protobuf definitions for `RequestVote`, `PreVote`, `AppendEntries`

Data flow — proposal to commit:

client → PUT /keys/x
    → KVStore.HandlePut()
        → RaftNode.Propose(data)
        → WAL.SaveEntry() + WAL.Sync()
        → log.Append()
        → maybeSendHeartbeats()   ← triggers immediate AppendEntries, not next tick
        → wait on resultCh
            ← AppendEntries reaches quorum
            ← CommitTo(index) called
            ← notifyProposal(index)
        → return index, term
        → read from CommitCh
        → apply to data map
→ 200 OK

3. What I Deliberately Broke

This section is the point of the project.

Each failure below corresponds to a real incident. For each one, I:

Wrote code that exhibits the failure mode
Observed the failure
Implemented the defense
Added a chaos scenario that reproduces it on demand

Network Partition → Split Brain (AWS, 2011)

What happens: A network partition isolates the leader. The majority partition elects a new leader. For a window of time, two nodes believe they are the leader.

The defense: Raft's quorum requirement. An isolated leader cannot commit new entries because it cannot reach a majority. Its proposals sit in the log, unacknowledged. When the partition heals, the old leader discovers a higher term and immediately steps down. Its uncommitted entries are overwritten.

Reproduce it:

make scenario NAME=split-brain-2011

Leader Isolation Without Lease (etcd, 2018)

What happens: A leader is partitioned but its local clock hasn't expired. It still believes it is the leader and accepts read requests, returning stale data to clients.

The observation: Without a lease mechanism, the node continues serving reads from its local state machine. GET /keys/x returns the value from before the partition.

Reproduce it:

make scenario NAME=leader-isolation-write-loss

WAL Torn Write

What happens: A process crashes mid-write(). The OS flushes part of a log record. On restart, the partial record has a valid length but invalid CRC32.

The defense: Every WAL record is suffixed with a CRC32 checksum over the record data. WAL.ReadAll() stops at the first checksum mismatch and truncates the file at that boundary. All records before the torn write are recovered. All records after are discarded.

Reproduce it:

make scenario NAME=wal-torn-write

Stale Log Elected Leader (etcd, 2018)

What happens: A partitioned follower misses 10 log entries. The partition heals. The partitioned node starts an election because its election timer fires first.

Without the election restriction: A node with a log at index 10 could be elected over a node at index 20, causing committed entries (11–20) to be overwritten. This is data loss.

The defense: Raft's election restriction. A voter rejects any RequestVote from a candidate whose log is less up-to-date than the voter's own. The stale node cannot win a majority and cannot become leader.

Reproduce it:

make scenario NAME=stale-log-elected-leader

4. Failure Scenarios

Run all four scenarios:

make scenarios

Run one by name:

make scenario NAME=split-brain-2011
make scenario NAME=leader-isolation-write-loss
make scenario NAME=wal-torn-write
make scenario NAME=stale-log-elected-leader

Each scenario prints a structured result:

=== TestChaosSplitBrain ===
PASS
Summary:
  old_leader_term:   3
  new_leader_term:   4
  entries_overwritten: 5
  observation: old leader stepped down at t=318ms

Scenario	Real Incident	What It Tests
`split-brain-2011`	AWS US-East-1 EBS (April 2011)	Quorum prevents committed data loss on partition
`leader-isolation-write-loss`	etcd network partition (2018)	Isolated leader cannot commit without quorum
`wal-torn-write`	Any process crash mid-fsync	CRC32 detects torn writes; recovery stops at boundary
`stale-log-elected-leader`	etcd stale read bug (2018)	Election restriction blocks stale nodes from leadership

5. Benchmarks

make bench

Results on a MacBook Pro M3 (single-machine in-memory transport):

BenchmarkWALWriteSync-12          2,706,794 ns/op    23.65 MB/s   (2.7ms per fsync)
BenchmarkWALBatchWrite-12         2,381,029 ns/op   268.80 MB/s   (2.4ms per 10-entry batch)
BenchmarkLogAppend-12                 82.81 ns/op                  (in-memory only)
BenchmarkSingleNodePropose-12     2,275,550 ns/op                  (2.3ms, no peers)
BenchmarkThreeNodePropose-12      6,615,942 ns/op                  (6.6ms, 2 AppendEntries RTTs)
BenchmarkConcurrentPropose/c1     6,453,177 ns/op                  (baseline)
BenchmarkConcurrentPropose/c4     4,140,956 ns/op                  (36% faster — group commit)
BenchmarkConcurrentPropose/c16    3,823,897 ns/op                  (41% faster)
BenchmarkConcurrentPropose/c64    3,481,429 ns/op                  (46% faster)
BenchmarkLatencyPercentiles-12    p50=6,730µs  p95=9,903µs  p99=12,411µs
BenchmarkLeaderElectionTime-12    155 ms/election

Key observations:

WAL is the bottleneck. BenchmarkLogAppend (82ns) vs BenchmarkWALWriteSync (2.7ms), the in-memory log is 32,000× faster than a durable log. Every Propose() call pays one fsync. This is the correct tradeoff for a system that claims durability.

Concurrent proposals get faster, not slower. The naive prediction was that proposals would serialize on disk I/O and throughput would stay flat. The opposite happened: c64 is 46% faster than c1. The reason is opportunistic group commit. When 64 goroutines call Propose() simultaneously, they pile up waiting for the mutex. The leader grabs them all, sends one AppendEntries to both followers, and does one fsync — then unblocks all 64 callers. This is the same effect that makes Kafka's batching so effective.

Tail latency is bounded. p99 (12.4ms) is 1.85× the p50 (6.7ms). That ratio is healthy. Production Raft clusters get worried when p99/p50 exceeds 10×. The tail here is bounded by an occasional heartbeat delay before the final commit acknowledgment arrives.

Leader election takes one election timeout. 155ms ≈ the 150ms base election timeout. That means the first follower whose timer fired won in a single round — no vote split, no retry. Real clusters add network RTT and occasionally retry (300–500ms is typical).

6. Implementation Highlights

Pre-Vote

Before starting a real election (which increments the term and disrupts the cluster), a candidate first runs a dry-run election: it sends PreVote RPCs with term+1 but does not persist the higher term. Only if a quorum responds positively does the real election begin.

This prevents the common failure mode where a partitioned follower's timer fires repeatedly, incrementing its term to 100 while the cluster is at term 4. When it rejoins, it would force an unnecessary election even though a stable leader exists.

Fast Log Backtracking

Standard Raft retries AppendEntries one entry at a time on conflict — O(N) round-trips to catch up a lagging follower. This implementation uses the optimization from §5.3 of the Raft paper: the follower includes ConflictTerm and ConflictIndex in its rejection response. The leader skips back an entire term in one RTT.

Heartbeat-Driven Commit

When Propose() appends an entry, it immediately calls maybeSendHeartbeats() rather than waiting for the next heartbeat tick (50ms). This dropped single-node proposal latency from ~50ms to ~2.3ms — a 22× improvement. Without it, every proposal waited up to one full heartbeat interval before followers even knew about the new entry.

WAL Format

[ 4 bytes: record length ][ N bytes: data ][ 4 bytes: CRC32 of data ]

Two record types: walRecordState (currentTerm + votedFor, written on every term change) and walRecordEntry (one log entry). On startup, ReadAll() replays records in order. A CRC32 mismatch stops replay at that record and truncates the file there. All records before the torn write are restored. The node rejoins the cluster and catches up via AppendEntries.

Transport Abstraction

Both GRPCTransport (real network) and InMemTransport (in-process) implement the same Transport interface. All chaos injection, partitions, packet loss, delay, is applied through a NetworkProxy that sits between the transport and its callers. Tests run against InMemTransport with a NetworkProxy in front, so chaos scenarios work identically in tests and in production.

7. Running Locally

Prerequisites: Go 1.22+

git clone https://github.com/ani03sha/raftly
cd raftly

Build:

make build
# Output: bin/raftly-server

Run all tests:

make test

Run short tests only (skip slow chaos tests):

make test-short

Run benchmarks:

make bench
# Optional filter: make bench WAL

Run a single chaos scenario:

make scenario NAME=split-brain-2011

Run all four chaos scenarios:

make scenarios

Start a 3-node cluster manually:

Terminal 1:

./bin/raftly-server -id node1 -grpc-addr :7001 -http-addr :8001 \
  -peers node2=:7002,node3=:7003 \
  -http-peers node1=:8001,node2=:8002,node3=:8003

Terminal 2:

./bin/raftly-server -id node2 -grpc-addr :7002 -http-addr :8002 \
  -peers node1=:7001,node3=:7003 \
  -http-peers node1=:8001,node2=:8002,node3=:8003

Terminal 3:

./bin/raftly-server -id node3 -grpc-addr :7003 -http-addr :8003 \
  -peers node1=:7001,node2=:7002 \
  -http-peers node1=:8001,node2=:8002,node3=:8003

Write and read:

curl -X PUT http://localhost:8001/keys/hello -d '{"value":"world"}'
curl http://localhost:8001/keys/hello
curl http://localhost:8002/keys/hello   # follower proxies to leader

Check node status:

curl http://localhost:8001/status

8. Dashboard (UI)

The dashboard is a React app embedded directly in the server binary: no separate process, no build step at runtime. After starting the cluster, open http://localhost:8001 in a browser.

Layout

The UI is a fixed three-column viewport. Nothing scrolls at the page level; each column manages its own content.

Raftly UI

Left Column: Cluster State

Health strip at the top shows four pills that update every 500ms via SSE:

Pill	What it shows
`health`	`HEALTHY` (green) · `DEGRADED` (amber) · `PARTITIONED` (red) · `CRITICAL` (red)
`leader`	Current leader node ID, or `—` if no leader elected
`term`	Current Raft term across the cluster
`nodes`	Reachable nodes out of total, e.g. `3/3`

Topology SVG renders all nodes as circles connected by edges:

Color	Meaning
Green circle	Leader
Blue circle	Follower
Amber circle	Candidate (mid-election)
Red circle	Down or partitioned
Green edge	Healthy replication link (animated from leader)
Dashed red edge + ✕	Partitioned link
Purple edge	Delayed link (label shows `Nms`)
Amber edge	Lossy link (label shows `N%`)

Node cards show per-node state, term, and commit index. Cards turn red when a node is down or partitioned.

Replication log table shows the last 10 committed entries across all nodes. Green cells = committed; blue cells = replicated but not yet committed. A progress bar at the top shows commitIndex / lastIndex.

Center Column: Controls

Four tabs, each with an interactive top section and a contextual explainer below.

Scenarios tab

Five pre-built scenarios that automate chaos injection, observation, and healing:

Scenario	What it does
Happy path	Writes 5 keys back-to-back; shows clean replication across all nodes
Leader crash	Isolates the leader; remaining majority elects a replacement within ~300ms
Minority partition	Isolates one follower; majority keeps committing; isolated node can't win election
Packet loss	Injects 40% loss on a follower; commits advance via retries, no election fires
Slow follower	Adds 300ms latency to a follower; leader uses fast peer for quorum

To run: click a scenario tab → read the steps → click Run scenario. The button shows a spinner while running. All destructive scenarios auto-heal after the delay configured in Settings.

Chaos tab

Manual fault injection with four modes:

Mode	What it does
Isolate	Blackholes all Raft traffic to/from the target node
Partition	Drops traffic between two specific nodes only, all other links stay healthy
Delay	Adds `delay ± jitter` ms latency to a node's Raft messages
Loss	Drops a configurable percentage of packets on a node (slider: 5%–95%)

Click Heal all at the top of the panel to clear every rule on every node simultaneously. Active rules are shown at the bottom of the panel.

The chaos proxy intercepts gRPC (Raft RPC) traffic only. HTTP calls (KV ops, cluster API, status polling) always pass through — this is why Heal always works even when a node is fully isolated at the Raft level.

KV ops tab

Interact directly with the replicated key-value store:

PUT — proposes a write through Raft. The entry appears in the replication log as blue (replicated) then green (committed).
GET — reads from the leader's state machine. Always linearizable — reflects all prior commits.
DELETE — proposes a delete command through the same Raft pipeline as PUT.

If the node you're hitting is not the leader it transparently proxies the request. The "Routed to leader" label at the top shows who handled it.

Settings tab

Adjust cluster parameters live — no restart required:

Setting	Range	Effect
Election timeout	50–2000ms	How long followers wait before starting an election. Default: 150ms. Increasing this makes the cluster more stable; decreasing it makes elections fire faster under chaos.
Heartbeat interval	10–500ms	How often the leader sends AppendEntries. Default: 50ms. Must stay well below election timeout (rule of thumb: `heartbeat ≤ timeout / 5`).
Auto-heal delay	2–30s	How long Scenarios wait before healing after injecting chaos. Increase this to give more time to observe leader elections. Default: 4s.

Click Apply to all nodes to fan out changes to the entire cluster. The UI validates that heartbeat < election timeout before allowing apply. Click Reset to restore defaults.

Right Column: Event Timeline

A live stream of cluster and local events, newest first. Events are color-coded:

Color	Event type
Green	`leader_change` — new leader elected
Amber	`term_change` — term incremented
Red	`node_down` — node became unreachable
Blue	`node_up` — node recovered
Red (dim)	Local chaos injection
Purple	Local scenario step
Grey	Local KV operation

Cluster events arrive over SSE from the server. Local events (chaos, KV, scenario steps) are generated client-side and merged into the same timeline.

9. Docker Deployment

Starts a 3-node cluster with the embedded dashboard, Prometheus metrics, and a Grafana dashboard — one command.

make docker-build
make docker-up

Service	URL	Notes
Dashboard (UI)	http://localhost:8001	Open this — embedded React app
Node 1 API	http://localhost:8001	Also accepts raw curl requests
Node 2 API	http://localhost:8002	Followers proxy writes to the leader
Node 3 API	http://localhost:8003
Prometheus	http://localhost:9090
Grafana	http://localhost:3000 (admin/admin)

Grafana dashboard panels:

Current Raft term (all nodes)
Node state (follower=0, candidate=1, leader=2)
Committed log index (all nodes)
WAL fsync duration (histogram)
Replication lag per peer

Write to the cluster:

curl -X PUT http://localhost:8001/keys/foo -d '{"value":"bar"}'

Stop:

make docker-down

Tear down metrics:

docker compose -f docker/docker-compose.yml down -v

10. References

Papers:

In Search of an Understandable Consensus Algorithm (Extended Version) — Ongaro & Ousterhout, 2014
Paxos Made Simple — Lamport, 2001

Incident Reports:

Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region (April 21-24, 2011) — The EBS split-brain incident
etcd: a distributed key-value store — post-mortem issues — etcd stale read and election bugs

Books:

Designing Data-Intensive Applications — Martin Kleppmann (Chapter 9: Consistency and Consensus)
Database Internals — Alex Petrov (Chapter 14: Consensus Protocols)

Related Projects:

etcd/raft — Production Raft library
hashicorp/raft — Used in Consul and Vault
MIT 6.824 Labs — Distributed systems course

Built as a learning exercise to understand what production consensus algorithms get wrong — and how they recover.

Directories ¶

Path	Synopsis
chaos
client
cmd
server command
proto
raft
scenarios
server
transport *	*
web

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

Raftly

Table of Contents

1. What This Is

2. Architecture

3. What I Deliberately Broke

Network Partition → Split Brain (AWS, 2011)

Leader Isolation Without Lease (etcd, 2018)

WAL Torn Write

Stale Log Elected Leader (etcd, 2018)

4. Failure Scenarios

5. Benchmarks

6. Implementation Highlights

Pre-Vote

Fast Log Backtracking

Heartbeat-Driven Commit

WAL Format

Transport Abstraction

7. Running Locally

8. Dashboard (UI)

Layout

Left Column: Cluster State

Center Column: Controls

Scenarios tab

Chaos tab

KV ops tab

Settings tab

Right Column: Event Timeline

9. Docker Deployment

10. References

Directories ¶