Distributed Systems
Leader election: multiple-choice review
Six questions that cut across the whole unit. Each mirrors a call you make in a real incident — not a definition to recite, but a safety argument to defend when a leader pauses, a partition heals, or two nodes both claim the crown.
Confirm you can connect why we elect a leader, how Raft elects one safely, why a lease is a claim about untrustworthy clocks, and why only a resource-enforced fencing token closes the split-brain window.
A leader holds a 10s lease and writes to object storage; you have observed multi-second GC pauses. Which mitigation actually closes the stale-write window, not just narrows it?
Why does Raft draw each follower's election timeout randomly from roughly 150–300 ms instead of using one fixed value?
A senior says 'a lease does not bound how long a paused leader keeps believing it leads.' What is the precise reason, and the consequence?
Your team adds fencing tokens: the lock service hands out strictly increasing numbers and the leader stamps every write with its token. Incidents continue. What is the most likely gap?
A 5-node cluster splits 3-2 by a network partition. How does a correct quorum-based election protocol behave, and why?
You need 'run this cron on exactly one host' and reach for a distributed lock. A reviewer warns the lock alone is not enough for correctness. What is the missing piece?
The through-line of the unit is one decision tree: elect a single leader to serialize writes (Raft uses terms plus randomized 150–300 ms timeouts to elect one safely), hold leadership as a lease that the coordinator can expire — but never trust that a paused leader knows it lost the lease. A partition is handled by quorum (only the majority leads); a pause is handled by a monotonic fencing token that the protected resource itself enforces. Elect for liveness, fence for safety — and remember a bare lock or lease without resource-side token checking buys you nothing against a stale write.