Distributed Systems
Raft: build and break a cluster
Reading the safety proof is not the same as watching a cluster truncate an uncommitted entry, halt a minority partition, and survive a node swap without losing a write. Stand up a small Raft cluster, drive it through the exact failure modes the unit described, and capture the evidence that each guarantee held.
Turn the unit’s mental model into a reproducible lab: replicate a log, force a leader election, prove a committed entry survives while an uncommitted one is discarded, demonstrate the CP halt under partition, and perform a safe membership change — each step backed by logs or metrics, not assertion.
Stand up a 3-node (then 5-node) Raft cluster — either a real one (embedded hashicorp/raft, etcd, or a from-scratch implementation) or a deterministic simulator — and demonstrate, with captured evidence, that it replicates a log, elects leaders correctly, preserves committed entries across crashes, halts the minority side under partition, and changes membership safely.
- A scenario log or recording for each of: replicated write applied identically on all nodes; clean election with a new term; uncommitted entry truncated after leader crash; committed entry surviving the same crash; minority halt + majority progress under partition.
- Evidence — captured logs or metric panels — that the term increases monotonically, that commitIndex only advances once a majority's matchIndex covers an entry, and that no two nodes ever applied different commands at the same index.
- A membership-change transcript showing the cluster moving 3 to 5 one node at a time, with quorum size and tolerated-failure count updating correctly and no period where two disjoint majorities could exist.
- A one-page write-up mapping each demonstrated behaviour back to the property that guarantees it: quorum overlap, Log Matching, Leader Completeness, the current-term commit rule, and the CP tradeoff.
- Inject a slow-disk fault: throttle the leader's fsync past the heartbeat interval and capture the resulting election flapping; then show that moving the WAL to fast storage (or raising the election timeout as a stopgap) restores stability.
- Add snapshots: compact the log after N entries, take a follower far enough offline that the leader has compacted past its needs, and show InstallSnapshot bringing it back into sync — including the membership config inside the snapshot.
- Add pre-vote and reproduce the rejoining-node disruption with it on vs off: show a long-partitioned node triggering a spurious election without pre-vote, and being silently rejected with it.
- Add ReadIndex or lease reads and measure linearizable-read latency vs a naive no-op-commit read under a high read:write ratio; for lease reads, demonstrate the clock-skew failure mode by deliberately skewing one node's clock.
This is the lab that converts the proof into reflexes. Once you have watched a cluster truncate an uncommitted entry, keep a committed one across the same crash, halt a minority partition while the majority commits, and grow membership one node at a time without a split-brain window — each backed by your own captured logs and metrics — the safety argument stops being an abstraction. Map every behaviour back to the property that guarantees it, and you will diagnose the production incident (slow disk, disabled pre-vote, bypassed membership) from the metrics in minutes instead of hours.