awesome-everything RU
↑ Back to the climb

Distributed Systems

Raft in production: membership changes, Multi-Raft, and observability

Crux Joint consensus for safe membership change, Multi-Raft sharding patterns, leadership transfer for rolling deploys, the minimum viable Raft dashboard, and three real production postmortems.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 16 min

You need to replace a dead node in your 3-node etcd cluster. “Just remove it and add the new one” sounds simple — but done wrong, it can leave you with two groups of nodes each believing they are the majority. Joint consensus exists specifically to prevent this.

Membership change via joint consensus

Adding or removing nodes from a running Raft cluster is the most error-prone operation in production. The naive approach — atomically swap the config on all nodes — creates a discontinuity window during which two separate majorities can exist simultaneously.

Example of the bug: transitioning from 3 nodes (A, B, C) to 5 nodes (A, B, C, D, E). If different nodes apply the config change at different times, you can briefly have: a majority of the old config (2 of A, B, C) and a majority of the new config (3 of A, B, C, D, E). These two majorities can have zero overlap — two leaders could be elected simultaneously.

Joint consensus fixes this: the cluster transitions through an intermediate configuration C_old_new that requires a majority from both the old and the new configuration. Commits during this phase need a majority of C_old AND a majority of C_new. No two valid quorums in this window can be disjoint. Once C_old_new is committed cluster-wide, the cluster transitions to C_new alone.

Hashicorp Raft and etcd implement joint consensus with an additional simplification: single-server changes only (add or remove one node at a time). This limits the joint phase to a simple two-majority requirement and makes the operational story tractable. Never bypass joint consensus for membership changes in production — the Alibaba Cloud Raft engineering post documents a real 30-minute split brain caused by exactly this shortcut.

PhaseConfigQuorum rule
BeforeC_old: A, B, CMajority of C_old (2 of 3)
TransitionC_old_newMajority of C_old AND majority of C_new
AfterC_new: A, B, C, D, EMajority of C_new (3 of 5)

Leadership transfer for rolling deploys

Without explicit leadership transfer, draining a leader node before maintenance causes one election (150–300 ms unavailability) per leader migrated. On a multi-Raft system with thousands of groups, this is unacceptable.

TimeoutNow solves it: the current leader sends a TimeoutNow message to a designated target follower, which immediately starts an election. Because no other follower has timed out yet, the target almost always wins. The current leader steps down. User-visible unavailability is under 10 ms. CockroachDB and TiKV expose leadership-transfer endpoints for exactly this use case in rolling deploys.

Multi-Raft and sharding

A single Raft group’s throughput is bounded by the slowest replica’s fsync plus the network round-trip. For high-throughput systems, the architecture is Multi-Raft: many independent Raft groups in parallel, each owning a shard of the keyspace.

  • CockroachDB: partitions the keyspace into ~512 MB “ranges,” one Raft group per range.
  • TiKV: same with ~100 MB “regions.”
  • Kafka KRaft: one Raft group per metadata topic partition.

Heartbeat coalescing makes Multi-Raft viable at scale: instead of 50,000 groups × 1 heartbeat/50 ms = 1M heartbeats/s, a 5-node cluster with 50,000 groups sends 10 physical packets per 50 ms (one per node pair), each containing all group heartbeats in a compact bitmap. That is 200 packets/s instead of 1M.

Minimum viable Raft dashboard

MetricAlert thresholdRoot cause when triggered
leader_changes_total per minute>1/minFlapping leadership — check disk or heartbeat RTT
wal_fsync_duration_seconds p99>50 msDisk too slow — move to NVMe
commit_latency_seconds p99>100 msDisk, network, or follower lag
follower_lag_entries>1000Follower falling behind — may need snapshot soon
raft_term growth rate>2/minRepeated elections — check network + pre-vote

Etcd ships etcd_server_leader_changes_seen_total, etcd_disk_wal_fsync_duration_seconds, and etcd_network_peer_round_trip_time_seconds by default.

Three real production postmortems

GitHub October 2018. A network blip between datacenters caused etcd to trigger a failover. A misconfigured external sync wrote to both regions during a brief window — Raft was correct, the orchestrator above it was not. Lesson: Raft guarantees consensus within the cluster; it cannot protect against a misconfigured operator layer writing around the cluster.

Datadog 2023. A TiKV cluster experienced repeated leader elections under high write load. Root cause: pre-vote was disabled in an older version. Under write pressure, network jitter caused spurious elections, each one causing hundreds of milliseconds of unavailability. Enabling pre-vote resolved the issue. Lesson: pre-vote is a correctness feature, not a nicety.

Kubernetes etcd 2020. An admin removed a node from a 3-node cluster without using the membership change API (manually deleted the data dir and reconfigured). The remaining two nodes had disagreeing stored configurations and cycled between leader and candidate. Fix required stopping the cluster, manually editing configs, and restarting. Lesson: never manipulate Raft state outside the protocol’s own membership change path.

Debug this

Diagnose the Raft log — what is wrong with follower D?

log
2026-05-13T15:42:08Z INFO  raft: leader=A term=12 commit=400123
2026-05-13T15:42:08Z INFO  raft: AppendEntries -> B success
2026-05-13T15:42:08Z INFO  raft: AppendEntries -> C success
2026-05-13T15:42:08Z INFO  raft: AppendEntries -> D failure (mismatch at idx=400100 term=12 vs follower idx=400100 term=11)
2026-05-13T15:42:09Z INFO  raft: AppendEntries -> D retry prevIdx=400050 (decremented)
2026-05-13T15:42:09Z INFO  raft: AppendEntries -> D failure (mismatch at idx=400050 term=12 vs follower idx=400050 term=10)
2026-05-13T15:42:10Z INFO  raft: AppendEntries -> D retry prevIdx=399000 (decremented)
2026-05-13T15:42:10Z INFO  raft: AppendEntries -> D failure (mismatch at idx=399000 term=12 vs follower idx=399000 term=8)
2026-05-13T15:42:11Z WARN  raft: D has diverged extensively; consider InstallSnapshot

The leader keeps decrementing nextIndex for D. What is the diagnosis and the correct operational fix?

Pick the best fit

A startup needs a strongly consistent KV store for service discovery. Pick the consensus implementation that best balances correctness, latency, and operational burden.

Quiz

Why does a naive single-step config swap (adding 3 nodes at once without joint consensus) risk split brain?

Recall before you leave
  1. 01
    Explain why joint consensus prevents the split-brain risk in membership change.
  2. 02
    A CockroachDB cluster runs 50,000 Raft groups. Why does this not collapse under 1M heartbeats per second?
  3. 03
    What is the minimum set of Raft metrics you would alert on, and what does each one diagnose?
Recap

Safe membership change requires joint consensus — a two-quorum transition phase that prevents any window where two non-overlapping majorities could elect concurrent leaders. Leadership transfer via TimeoutNow reduces unavailability during rolling deploys from hundreds of milliseconds to under 10 ms. Multi-Raft scales a single cluster’s throughput via parallel groups with coalesced heartbeats. The minimum viable Raft dashboard tracks five metrics: leader changes, WAL fsync latency, commit latency, follower lag, and term growth rate. Every real production incident with Raft traces to one of three root causes: a membership change that bypassed joint consensus, a pre-vote that was disabled, or a WAL on a slow cloud volume. Fix all three preventively and you cover 90% of production Raft failures.

Connected lessons
appears again in185
Continue the climb ↑Raft: multiple-choice review
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources5
expand
  1. 01
  2. 02
  3. 03
  4. 04
  5. 05

Trademarks belong to their respective owners. Editorial reference only.