awesome-everything RU
↑ Back to the climb

Distributed Systems

Raft in the real world: partitions, slow disks, and client routing

Crux What Raft guarantees under partition (CP, not AP), how client writes reach the leader, and the three production failure modes — slow disk, network jitter, and clock drift — that cause most real incidents.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 12 min

A network partition splits your 5-node etcd cluster: 3 nodes in DC A, 2 nodes in DC B. Clients start throwing write errors on one side. Is that a bug? Or is it Raft working exactly as designed?

Partition behavior: the minority halts

When a partition isolates a minority of nodes (fewer than a majority), those nodes cannot commit or elect a leader. They cycle through failed elections, returning errors to any client that connects to them. The majority side continues normally.

This is CP behavior in the CAP sense: Raft picks consistency over availability. The minority side refuses to serve rather than risk two concurrent leaders both committing conflicting writes.

SideNodesCan elect?Can commit?Client sees
Majority (DC A)3 of 5YesYesNormal
Minority (DC B)2 of 5NoNoErrors / timeouts

On partition heal: the minority’s stale nodes see a higher term in the first message from the majority side, step down to follower, update their term, and catch up their logs via AppendEntries. Any uncommitted entries from a phantom leadership attempt are overwritten. No data loss, no split state.

Partition trace: leader in the minority

A more subtle scenario: the leader itself gets partitioned to the minority side.

Trace it
1/5

Trace a partition where the current leader is isolated on the minority side.

1
Step 1 of 5
5-node cluster A, B, C, D, E. A is leader at term 4. Partition: A and B isolated from C, D, E.
2
Locked
On the C, D, E side, what happens?
3
Locked
State during partition: two 'leaders' exist?
4
Locked
Partition heals. A's heartbeat reaches C.
5
Locked
What did clients experience?

Client routing

Only the leader can commit writes. Client routing strategies:

  1. Redirect: any follower that receives a write replies with the current leader’s address. Client retries against that address. Common pattern in etcd, Consul, TiKV.
  2. Leader cache: clients cache the last known leader and go there directly; fall back to any node on failure.
  3. Proxy: a load balancer tracks the leader via the cluster’s health API.

The redirect latency is typically 1–5 ms on the rare miss. In steady state, writes go directly to the leader.

Read consistency: linearizable reads must go through the leader (via ReadIndex or lease — covered in the next lesson). Eventually-consistent reads can go to any follower. The application layer chooses per query.

Three production failure modes

1. Slow disk fsync. Every committed entry requires at least one fsync on the leader and one on each acknowledging follower. On NVMe with battery-backed cache, fsync is 50–100 µs. On cloud volumes (EBS gp3, GCP balanced PD), it can be 1–3 ms. If the leader’s fsync starts exceeding the heartbeat interval, followers time out before the leader acknowledges their AppendEntries and start an election. The new leader hits the same disk and the cycle repeats. Fix: dedicated NVMe for the Raft WAL, never shared cloud volumes.

2. Network jitter. A brief congestion or packet-loss event drops heartbeats and triggers an election, even though the cluster is mostly healthy. The cluster experiences 150–300 ms unavailability for no lasting reason. Pre-vote (covered in the next lesson) mitigates this by requiring a dry-run before incrementing the term.

3. Clock drift on lease reads. If the leader’s clock runs ahead of followers, it may over-extend its lease window past the actual heartbeat round and serve reads that have lost the lease — stale data returned as fresh. NTP-syncing all nodes is a correctness requirement for lease reads, not just hygiene.

Quiz

A Raft cluster's leader disk fsync starts taking 2 seconds (instead of the normal 50 µs). What is the observable symptom, and why?

Quiz

Raft is described as CP, not AP. What does this mean in practice during a network partition?

Recall before you leave
  1. 01
    A 5-node Raft cluster has 2 nodes in DC A and 3 in DC B. The inter-DC link goes down for 5 minutes. What do clients connected to DC A experience?
  2. 02
    Why is 'slow disk' on the leader worse than slow disk on a follower?
  3. 03
    What is the correct fix for a Raft cluster that experiences elections every 30–60 seconds?
Recap

Raft is CP: under partition, the minority side refuses commits rather than risk split brain. The majority side continues normally; on heal, stale nodes catch up via the AppendEntries consistency check. Clients route writes to the leader, using redirect or a cached leader address. The three most common production failures are slow disk fsync on the leader (triggers elections by blocking heartbeats), network jitter (drops heartbeats spuriously), and clock drift (breaks lease-read correctness). Each has a known fix: dedicated NVMe, pre-vote, and NTP sync respectively.

Connected lessons
appears again in185
Continue the climb ↑Raft extensions: pre-vote, learners, snapshots, and linearizable reads
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.