Distributed Systems DIST · 04 · 10

Leader election: build a fenced single-writer

Hands-on project — build a leader-elected job with a real lock service, reproduce a split-brain stale write under an injected pause, then make it impossible with a resource-enforced fencing token.

DIST Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading about split-brain is not the same as watching two of your own processes corrupt the same file. Build a leader-elected writer on a real lock service, deliberately drive it into split-brain with an injected pause, and then close the window with a fencing token the resource enforces — with evidence at every step.

Goal

Turn the unit’s safety argument into a reproducible experiment: elect a leader with a lease, reproduce a stale write across an injected pause, and prove that a monotonic fencing token checked at the resource makes the same attack a no-op.

Project

0 of 7

Objective

Build a small leader-elected job that writes to a shared resource, reproduce a split-brain stale write by injecting a pause longer than the lease, then add a resource-enforced fencing token and prove the same stale write is now rejected — all backed by logs.

Requirements

Acceptance criteria

A before/after log pair: the unfenced run shows two writers and a corrupted or stale final state; the fenced run shows leader A's stale write rejected with its lower token while leader B's write stands.
The fencing check lives at the resource (highest-token-wins), not only in the lock service — demonstrated by showing that a write bypassing the lock service but carrying a stale token is still rejected.
A measured failover number (pause-start to new-leader-first-write) plus one sentence on why shortening the lease TTL speeds failover but increases false evictions of healthy-but-slow leaders.
A short write-up stating which two clocks the lease spans, why a self-check or keep-alive callback could not have prevented the stale write, and why the resource-side token check could.

Senior stretch

Reproduce the OTHER split-brain cause: partition the lock cluster (or block one worker's network) and show a correct quorum protocol bars the minority side from electing, versus a naive single-node lock that does not.
Add an on-call runbook: how to detect split-brain in logs (overlapping leader terms, rejected-token spikes), how to confirm the resource enforces fencing, and the safe order of mitigations.
Swap the lock backend (e.g. etcd to ZooKeeper) and show the application's fencing logic is unchanged because token enforcement lives at the resource, not the lock service.
Add a metric and alert: emit a counter of fenced (rejected-token) writes and alert when it is non-zero, since any fenced write means a real split-brain event just occurred and was contained.

Recap

This is the loop you will run whenever a coordination incident lands: stand up real leader election, reproduce the failure (a pause that outlives the lease) instead of arguing about it, and prove the fix with logs rather than confidence. The unfenced run shows why a lock or lease alone cannot stop a paused writer; the fenced run shows why a monotonic token enforced at the resource can. Once you have watched your own two processes get fenced apart at the storage boundary, ‘elect for liveness, fence for safety’ stops being a slogan and becomes muscle memory.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.