Distributed Systems DIST · 04 · 01

Leader election: one writer, terms, and the split-brain that fencing tokens stop

A single leader serializes writes so a cluster avoids conflicts — but a GC pause or partition can leave two nodes both convinced they''''re leader. Terms, leases, and a monotonic fencing token are how production survives it.

DIST Junior ◷ 17 min

Level

FoundationsJuniorMiddleSenior

Already know this unit? Take a 1-minute quick check →

A batch worker holds the cluster lock and starts writing files to shared storage. Then its JVM hits a 14-second stop-the-world GC pause. From inside the process, nothing happened — one line of code, then the next. But the lease it held expired at second 10; the coordinator declared it dead and elected a new leader, which began its own writes. At second 14 the old worker wakes and finishes its write — to the same file, with stale data. Two leaders, one corrupted file, and a postmortem that starts with “but we had a lock.”

Why elect a leader at all

Most coordination problems get dramatically simpler the moment exactly one node is allowed to make a decision. With a single leader, writes are serialized through one place, so there are no concurrent-update conflicts to merge, no last-write-wins ambiguity, no two nodes assigning the same sequence number. The leader becomes the linearization point: every change goes through it, in an order everyone agrees on. This is why so much infrastructure is leader-based — Kafka partition leaders, a primary database in a replica set, the controller in a cluster, a cron job that must run on exactly one host.

The cost is that you’ve now created a single point of failure on purpose, so the hard part isn’t electing a leader — it’s re-electing one safely when the current one dies, without ever letting two be active at once. That second requirement is where almost all the real difficulty (and almost every production incident) lives.

Terms and randomized timeouts: how Raft elects

Raft, the consensus algorithm behind etcd and many others, divides time into terms — consecutive integers that act as a logical clock. Each term has at most one leader. A follower expects regular heartbeats from the leader; if none arrives within its election timeout, it assumes the leader is gone, increments the term, becomes a candidate, and asks everyone to vote for it. A candidate wins by collecting votes from a majority (a quorum), and each node votes at most once per term, first-come-first-served.

The clever detail is the timeout value. If every follower used the same fixed timeout, they’d all become candidates at the same instant, split the vote, and nobody would win — then they’d retry and split again. Raft fixes this by drawing each node’s election timeout randomly from a range, typically 150–300 ms. One node’s timer almost always fires first; it campaigns and wins before any peer wakes up. Randomization turns a synchronized stampede into a staggered queue.

System	Election mechanism	Key number	Liveness signal
Raft (etcd, Consul)	Term + majority vote	Election timeout 150–300 ms (randomized)	Leader heartbeats
ZooKeeper	Lowest ephemeral sequential znode	Session timeout (often seconds)	Session heartbeats; znode auto-deleted on expiry
etcd	Lease + compare-and-swap on a key	Lease TTL (set per acquire)	Lease keep-alive renewals

Leases, sessions, and the clock you can’t trust

Once elected, a leader must keep proving it’s alive, because the cluster needs to detect a dead leader fast enough to elect a replacement. The common mechanism is a lease (etcd) or a session (ZooKeeper): leadership is granted for a bounded time and must be renewed. In ZooKeeper, the leader holds an ephemeral znode tied to a session; if heartbeats stop for the session timeout, ZooKeeper deletes the znode and the next candidate takes over. In etcd, the leader attaches a lease with a TTL to a key and keeps it alive; if renewals stop, the lease expires and the key vanishes, freeing the role.

Here’s the trap a senior never forgets: a lease is a statement about time, and the leader’s local clock and the coordinator’s clock are not the same clock. A lease “valid for 10 seconds” is only safe if both sides agree on how long 10 seconds is — and they don’t, exactly. Worse, the leader might not even be running for part of that window. A GC pause, a hypervisor freezing a VM for live migration, a swapped-out process, or an NTP jump can all make the leader believe its lease is still valid long after the coordinator has expired it and elected someone else.

▸Why this works

A lease bounds how long a coordinator waits before declaring a node dead — it does not bound how long a paused leader will keep believing it still holds the lease. Those are two different clocks. The leader can be frozen mid-instruction while wall-clock time, and its lease, run out underneath it. That gap is precisely where split-brain is born.

Split-brain and the fencing-token fix

When you see a GC pause in a postmortem, the first question to ask is not “was the lock held?” — it’s “could the holder have woken after the lease expired and still written?” That question leads straight here.

Split-brain is two (or more) nodes both convinced they are the leader at the same time. A network partition can cause it — a minority side that can’t see the quorum may still think it leads — but the sneakier cause is the pause from the Hook: the old leader was never partitioned, it was just stopped, and it wakes up holding a lease the world already gave away. Now two leaders write, and shared state corrupts.

Kleppmann’s fix is the fencing token: every time the lock/lease is granted, the coordinator hands out a number that strictly increases — 33, then 34, then 35. The leader must include its token on every write to the protected resource, and the resource itself must reject any write whose token is lower than the highest it has already seen. So when client 1 (token 33) wakes from its GC pause and tries to write, the storage service has already accepted client 2’s write at token 34 — it rejects token 33 outright. The stale writer is fenced off at the resource boundary. The crucial, often-missed point: the token only works if the downstream resource actively checks it. A lock service that just hands out tokens but writes to a dumb store that ignores them buys you nothing.

Only the fencing token, enforced by the resource, closes the split-brain window; a shorter TTL, a self-check, and a faster election each fail for a different reason.

Pick the best fit

A leader holds a 10s lease, writes to object storage, and you've seen multi-second GC pauses. How do you prevent a stale write after a pause?

Quiz

Why does Raft draw each node's election timeout randomly from 150–300 ms instead of using one fixed value?

Quiz

A fencing token only prevents the stale-write bug if which condition holds?

Order the steps

Order the sequence of a classic split-brain stale write under a GC pause:

1 Leader A acquires the lease (token 33) and starts a write to shared storage
2 Leader A enters a long stop-the-world GC pause and stops executing code
3 A's lease expires; the coordinator declares A dead and elects leader B (token 34)
4 B writes successfully; the storage layer records highest token = 34
5 A wakes, finishes its write with token 33; the storage layer rejects it as fenced

The fencing token is checked by the resource: a stale leader's lower token is rejected regardless of clocks or pause length.

Recall before you leave

01
Explain to a teammate why a lease alone does not prevent two nodes from both writing as leader, and what actually fixes it.
02
Why does Raft use randomized election timeouts, and what failure does that prevent?

Recap

A single leader exists to serialize writes, so a cluster avoids concurrent-update conflicts and ambiguous ordering — but it trades that simplicity for a single point of failure that must be re-elected safely. Raft does this with terms (a logical clock, one leader per term) and majority votes, using election timeouts drawn randomly from roughly 150–300 ms so candidates stagger instead of splitting the vote forever. Leadership is then held as a lease or session that must be renewed; ZooKeeper ties it to an ephemeral znode, etcd to a lease TTL. The deep hazard is that a lease is a claim about time, and clocks plus pauses lie: a GC pause, paused VM, or partition can leave an old leader writing after its lease expired and a new leader was chosen — split-brain, two writers, corrupted state. The cure isn’t a shorter timeout or a clock check, because a frozen process can’t act; it’s a monotonic fencing token that the resource itself enforces, rejecting any write whose token has gone backwards. Now when you encounter a “we had a lock” postmortem, check first whether the resource enforced a fencing token — if it didn’t, the lock was never enough.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.