Distributed Systems DIST · 02 · 04

Raft in the real world: partitions, slow disks, and client routing

What Raft guarantees under partition (CP, not AP), how client writes reach the leader, and the three production failure modes — slow disk, network jitter, and clock drift — that cause most real incidents.

DIST Middle ◷ 12 min

Level

FoundationsJuniorMiddleSenior

A network partition splits your 5-node etcd cluster: 3 nodes in DC A, 2 nodes in DC B. Clients start throwing write errors on one side. Is that a bug? Or is it Raft working exactly as designed?

Partition behavior: the minority halts

When a partition isolates a minority of nodes (fewer than a majority), those nodes cannot commit or elect a leader. They cycle through failed elections, returning errors to any client that connects to them. The majority side continues normally.

This is CP behavior in the CAP sense: Raft picks consistency over availability. The minority side refuses to serve rather than risk two concurrent leaders both committing conflicting writes.

Side	Nodes	Can elect?	Can commit?	Client sees
Majority (DC A)	3 of 5	Yes	Yes	Normal
Minority (DC B)	2 of 5	No	No	Errors / timeouts

On partition heal: the minority’s stale nodes see a higher term in the first message from the majority side, step down to follower, update their term, and catch up their logs via AppendEntries. Any uncommitted entries from a phantom leadership attempt are overwritten. No data loss, no split state.

Partition trace: leader in the minority

A more subtle scenario: the leader itself gets partitioned to the minority side.

Trace it

1/5

Trace a partition where the current leader is isolated on the minority side.

Step 1 of 5

5-node cluster A, B, C, D, E. A is leader at term 4. Partition: A and B isolated from C, D, E.

Locked

On the C, D, E side, what happens?

Locked

State during partition: two 'leaders' exist?

Locked

Partition heals. A's heartbeat reaches C.

Locked

What did clients experience?

Client routing

Only the leader can commit writes. Client routing strategies:

Redirect: any follower that receives a write replies with the current leader’s address. Client retries against that address. Common pattern in etcd, Consul, TiKV.
Leader cache: clients cache the last known leader and go there directly; fall back to any node on failure.
Proxy: a load balancer tracks the leader via the cluster’s health API.

The redirect latency is typically 1–5 ms on the rare miss. In steady state, writes go directly to the leader.

Read consistency: linearizable reads must go through the leader (via ReadIndex or lease — covered in the next lesson). Eventually-consistent reads can go to any follower. The application layer chooses per query.

Three production failure modes

When you get paged at 2 AM for “etcd cluster unstable,” 90% of the time it is one of three causes — and knowing which one narrows your fix from an hour of guessing to a five-minute metric lookup.

1. Slow disk fsync. Every committed entry requires at least one fsync on the leader and one on each acknowledging follower. On NVMe with battery-backed cache, fsync is 50–100 µs. On cloud volumes (EBS gp3, GCP balanced PD), it can be 1–3 ms. If the leader’s fsync starts exceeding the heartbeat interval, followers time out before the leader acknowledges their AppendEntries and start an election. The new leader hits the same disk and the cycle repeats. Fix: dedicated NVMe for the Raft WAL, never shared cloud volumes.

A cloud volume's fsync is 20-40x the NVMe path — large enough to overrun the heartbeat interval and trigger spurious elections, which is why the WAL belongs on dedicated NVMe.

2. Network jitter. A brief congestion or packet-loss event drops heartbeats and triggers an election, even though the cluster is mostly healthy. The cluster experiences 150–300 ms unavailability for no lasting reason. Pre-vote (covered in the next lesson) mitigates this by requiring a dry-run before incrementing the term.

3. Clock drift on lease reads. If the leader’s clock runs ahead of followers, it may over-extend its lease window past the actual heartbeat round and serve reads that have lost the lease — stale data returned as fresh. NTP-syncing all nodes is a correctness requirement for lease reads, not just hygiene.

Quiz

A Raft cluster's leader disk fsync starts taking 2 seconds (instead of the normal 50 µs). What is the observable symptom, and why?

Quiz

Raft is described as CP, not AP. What does this mean in practice during a network partition?

Raft is CP: only the side that still holds a majority (3 of 5) can elect a leader and commit; the minority (2 of 5) refuses to serve rather than risk a second leader. On heal the minority steps down and catches up.

Recall before you leave

01
A 5-node Raft cluster has 2 nodes in DC A and 3 in DC B. The inter-DC link goes down for 5 minutes. What do clients connected to DC A experience?
02
Why is 'slow disk' on the leader worse than slow disk on a follower?
03
What is the correct fix for a Raft cluster that experiences elections every 30–60 seconds?

Recap

Raft is CP: under partition, the minority side refuses commits rather than risk split brain. The majority side continues normally; on heal, stale nodes catch up via the AppendEntries consistency check. Clients route writes to the leader, using redirect or a cached leader address. The three most common production failures are slow disk fsync on the leader (triggers elections by blocking heartbeats), network jitter (drops heartbeats spuriously), and clock drift (breaks lease-read correctness). Each has a known fix: dedicated NVMe, pre-vote, and NTP sync respectively. Now when you see an unexpected election in your monitoring, check the WAL fsync p99 first — more often than not, the culprit is a shared EBS volume, not a bug in the consensus code.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Raft leader election: timeouts, voting rules, and the four safety propertiesmiddle

unlocks

Raft extensions: pre-vote, learners, snapshots, and linearizable readssenior

deepens into

Raft extensions: pre-vote, learners, snapshots, and linearizable readssenior

appears again in211

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Crash-safe key-value store with a WALBuild a tiny on-disk KV store that survives a kill -9 mid-write by appending to a write-ahead log before touching the main file.