Distributed Systems DIST · 02 · 06

Raft in production: membership changes, Multi-Raft, and observability

Joint consensus for safe membership change, Multi-Raft sharding patterns, leadership transfer for rolling deploys, the minimum viable Raft dashboard, and three real production postmortems.

DIST Senior ◷ 16 min

Level

FoundationsJuniorMiddleSenior

You need to replace a dead node in your 3-node etcd cluster. “Just remove it and add the new one” sounds simple — but done wrong, it can leave you with two groups of nodes each believing they are the majority. Joint consensus exists specifically to prevent this.

Membership change via joint consensus

Adding or removing nodes from a running Raft cluster is the most error-prone operation in production. The naive approach — atomically swap the config on all nodes — creates a discontinuity window during which two separate majorities can exist simultaneously.

Example of the bug: transitioning from 3 nodes (A, B, C) to 5 nodes (A, B, C, D, E). If different nodes apply the config change at different times, you can briefly have: a majority of the old config (2 of A, B, C) and a majority of the new config (3 of A, B, C, D, E). These two majorities can have zero overlap — two leaders could be elected simultaneously.

Joint consensus fixes this: the cluster transitions through an intermediate configuration C_old_new that requires a majority from both the old and the new configuration. Commits during this phase need a majority of C_old AND a majority of C_new. No two valid quorums in this window can be disjoint. Once C_old_new is committed cluster-wide, the cluster transitions to C_new alone.

Hashicorp Raft and etcd implement joint consensus with an additional simplification: single-server changes only (add or remove one node at a time). This limits the joint phase to a simple two-majority requirement and makes the operational story tractable. Never bypass joint consensus for membership changes in production — the Alibaba Cloud Raft engineering post documents a real 30-minute split brain caused by exactly this shortcut.

Phase	Config	Quorum rule
Before	C_old: A, B, C	Majority of C_old (2 of 3)
Transition	C_old_new	Majority of C_old AND majority of C_new
After	C_new: A, B, C, D, E	Majority of C_new (3 of 5)

Leadership transfer for rolling deploys

Without explicit leadership transfer, draining a leader node before maintenance causes one election (150–300 ms unavailability) per leader migrated. On a multi-Raft system with thousands of groups, this is unacceptable.

TimeoutNow solves it: the current leader sends a TimeoutNow message to a designated target follower, which immediately starts an election. Because no other follower has timed out yet, the target almost always wins. The current leader steps down. User-visible unavailability is under 10 ms. CockroachDB and TiKV expose leadership-transfer endpoints for exactly this use case in rolling deploys.

Multi-Raft and sharding

A single Raft group’s throughput is bounded by the slowest replica’s fsync plus the network round-trip. For high-throughput systems, the architecture is Multi-Raft: many independent Raft groups in parallel, each owning a shard of the keyspace.

CockroachDB: partitions the keyspace into ~512 MB “ranges,” one Raft group per range.
TiKV: same with ~100 MB “regions.”
Kafka KRaft: one Raft group per metadata topic partition.

Heartbeat coalescing makes Multi-Raft viable at scale: instead of 50,000 groups × 1 heartbeat/50 ms = 1M heartbeats/s, a 5-node cluster with 50,000 groups sends 10 physical packets per 50 ms (one per node pair), each containing all group heartbeats in a compact bitmap. That is 200 packets/s instead of 1M.

One physical packet per node pair carries every group's heartbeat in a compact bitmap, so traffic tracks node count, not group count — 5,000x fewer packets is what makes Multi-Raft viable at 50,000 groups.

Minimum viable Raft dashboard

Metric	Alert threshold	Root cause when triggered
`leader_changes_total` per minute	>1/min	Flapping leadership — check disk or heartbeat RTT
`wal_fsync_duration_seconds` p99	>50 ms	Disk too slow — move to NVMe
`commit_latency_seconds` p99	>100 ms	Disk, network, or follower lag
`follower_lag_entries`	>1000	Follower falling behind — may need snapshot soon
`raft_term` growth rate	>2/min	Repeated elections — check network + pre-vote

Etcd ships etcd_server_leader_changes_seen_total, etcd_disk_wal_fsync_duration_seconds, and etcd_network_peer_round_trip_time_seconds by default.

Three real production postmortems

These are not hypothetical — all three cost real engineers real outages. Read them not as cautionary tales but as a checklist: if your cluster could reproduce any of these conditions, fix it before you get paged.

GitHub October 2018. A network blip between datacenters caused etcd to trigger a failover. A misconfigured external sync wrote to both regions during a brief window — Raft was correct, the orchestrator above it was not. Lesson: Raft guarantees consensus within the cluster; it cannot protect against a misconfigured operator layer writing around the cluster.

Datadog 2023. A TiKV cluster experienced repeated leader elections under high write load. Root cause: pre-vote was disabled in an older version. Under write pressure, network jitter caused spurious elections, each one causing hundreds of milliseconds of unavailability. Enabling pre-vote resolved the issue. Lesson: pre-vote is a correctness feature, not a nicety.

Kubernetes etcd 2020. An admin removed a node from a 3-node cluster without using the membership change API (manually deleted the data dir and reconfigured). The remaining two nodes had disagreeing stored configurations and cycled between leader and candidate. Fix required stopping the cluster, manually editing configs, and restarting. Lesson: never manipulate Raft state outside the protocol’s own membership change path.

Debug this

Diagnose the Raft log — what is wrong with follower D?

log

2026-05-13T15:42:08Z INFO  raft: leader=A term=12 commit=400123
2026-05-13T15:42:08Z INFO  raft: AppendEntries -> B success
2026-05-13T15:42:08Z INFO  raft: AppendEntries -> C success
2026-05-13T15:42:08Z INFO  raft: AppendEntries -> D failure (mismatch at idx=400100 term=12 vs follower idx=400100 term=11)
2026-05-13T15:42:09Z INFO  raft: AppendEntries -> D retry prevIdx=400050 (decremented)
2026-05-13T15:42:09Z INFO  raft: AppendEntries -> D failure (mismatch at idx=400050 term=12 vs follower idx=400050 term=10)
2026-05-13T15:42:10Z INFO  raft: AppendEntries -> D retry prevIdx=399000 (decremented)
2026-05-13T15:42:10Z INFO  raft: AppendEntries -> D failure (mismatch at idx=399000 term=12 vs follower idx=399000 term=8)
2026-05-13T15:42:11Z WARN  raft: D has diverged extensively; consider InstallSnapshot

The leader keeps decrementing nextIndex for D. What is the diagnosis and the correct operational fix?

Pick the best fit

A startup needs a strongly consistent KV store for service discovery. Pick the consensus implementation that best balances correctness, latency, and operational burden.

Quiz

Why does a naive single-step config swap (adding 3 nodes at once without joint consensus) risk split brain?

During the C_old_new phase a commit needs a majority of the old AND the new configuration at once, so no two valid quorums can be disjoint — closing the window where two leaders could be elected. Only after C_old_new commits cluster-wide does the cluster move to C_new alone.

Recall before you leave

01
Explain why joint consensus prevents the split-brain risk in membership change.
02
A CockroachDB cluster runs 50,000 Raft groups. Why does this not collapse under 1M heartbeats per second?
03
What is the minimum set of Raft metrics you would alert on, and what does each one diagnose?

Recap

Safe membership change requires joint consensus — a two-quorum transition phase that prevents any window where two non-overlapping majorities could elect concurrent leaders. Leadership transfer via TimeoutNow reduces unavailability during rolling deploys from hundreds of milliseconds to under 10 ms. Multi-Raft scales a single cluster’s throughput via parallel groups with coalesced heartbeats. The minimum viable Raft dashboard tracks five metrics: leader changes, WAL fsync latency, commit latency, follower lag, and term growth rate. Every real production incident with Raft traces to one of three root causes: a membership change that bypassed joint consensus, a pre-vote that was disabled, or a WAL on a slow cloud volume. Fix all three preventively and you cover 90% of production Raft failures. Now when you are handed a Raft cluster to run, the first thing to verify is not the Raft version — it is whether membership changes use the API, pre-vote is on, and the WAL lands on dedicated NVMe. Everything else is tuning.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Raft extensions: pre-vote, learners, snapshots, and linearizable readssenior

appears again in211

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.