Crux Read real lock, lease, and fencing-token snippets plus a split-brain log, predict the behaviour, and pick the fix a senior engineer would make first.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min
Leader-election bugs hide in the gap between ‘I hold the lock’ and ‘my write landed.’ Read the code and the log, then choose the fix that closes the gap — not the one that merely narrows it.
Goal
Practise the loop you run in every coordination incident: read the lock/lease path, find the moment a pause or partition can slip a second writer in, and reach for the resource-enforced fix before tuning a timeout.
Snippet 1 — acquire, then write
func runJob(lock *LockService, store *ObjectStore) error { if err := lock.Acquire("job-leader", 10*time.Second); err != nil { return err // someone else leads } defer lock.Release("job-leader") // ... minutes of work, including a possible long GC pause ... return store.Put("result.csv", data) // write to shared storage}
Quiz
Completed
The lock is correct and exclusive. Why can this still produce two concurrent writers to result.csv?
Heads-up defer runs at function return, after Put completes — the ordering is fine. The problem is a pause BEFORE Put, while the lock is held but the lease has expired underneath it.
Heads-up Acquire is exclusive by construction — only one caller holds it. The bug is not contention; it is a paused holder waking after its lease expired.
Heads-up Any finite TTL can be exceeded by a long enough pause. Raising it lowers the false-expiry rate but never removes the possibility of a stale write.
Snippet 2 — adding a fencing token
func runJob(lock *LockService, store *ObjectStore) error { token, err := lock.Acquire("job-leader", 10*time.Second) // returns monotonic token if err != nil { return err } defer lock.Release("job-leader") // ... work, possible pause ... return store.Put("result.csv", data, token) // token passed to the write}
Quiz
Completed
The token is now plumbed through to store.Put. Under what condition does this actually stop the stale write — and what must store.Put do?
Heads-up Passing the token does nothing unless store.Put compares it against the highest seen and rejects lower ones. A store that just logs the token still accepts the stale write.
Heads-up Tokens are compared as monotonic integers, not times. Timestamps reintroduce the clock-trust problem fencing exists to avoid.
Heads-up Put hits an external object store, outside any lock-service transaction. Enforcement must be a per-resource token-vs-highest check, not a shared transaction.
Snippet 3 — the lease keep-alive loop
func keepLeadership(c *etcd.Client, leaseID etcd.LeaseID, onLost func()) { ka, _ := c.KeepAlive(ctx, leaseID) // channel of renew acks for { select { case resp, ok := <-ka: if !ok { // channel closed = renew failed onLost() // we are no longer leader return } _ = resp } }}
Quiz
Completed
onLost() fires when keep-alive renewals fail. Why is acting on onLost necessary but NOT sufficient for safety?
Heads-up A paused process executes no goroutines, so onLost cannot fire during the pause. By the time it runs, the stale write may already be in flight. Self-notification cannot be the safety mechanism.
Heads-up Interval tuning changes renewal frequency, not the fact that a stopped process observes nothing. The gap is structural, not a tuning value.
Heads-up A closed channel reliably signals renew failure; spurious acks are not the issue. The issue is that the callback depends on the suspect node staying alive to fire.
Reading this log, which statement is the correct senior reading?
Heads-up The lease worked as designed: it expired and triggered re-election. Leases cannot stop a paused process from waking and writing — that is exactly why the token check, which DID catch it, is needed.
Heads-up Incrementing the term on each election is correct Raft behaviour, not a bug. The term bump is how the new leadership is made unambiguous.
Heads-up Waiting on a paused node for confirmation is how you lose availability forever — it may never respond. Quorum-based election deliberately does not wait for the suspected-dead node.
Recap
Every leader-election bug reads the same way: a lock or lease guarantees exclusion only while the holder is running, so a pause between ‘acquire’ and ‘write’ lets a deposed leader resume and write stale data. Plumbing a monotonic fencing token through to the write is necessary, but it is the RESOURCE rejecting any token below its highest-seen that actually enforces safety. Keep-alive callbacks and shorter TTLs cannot help a process that is not executing. Diagnose from the log, find the pause-or-partition gap, and fix it where the write lands — not where the lock is granted.