Distributed Systems DIST · 04 · 09

Leader election: code and log reading

Read real lock, lease, and fencing-token snippets plus a split-brain log, predict the behaviour, and pick the fix a senior engineer would make first.

DIST Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

Leader-election bugs hide in the gap between ‘I hold the lock’ and ‘my write landed.’ Read the code and the log, then choose the fix that closes the gap — not the one that merely narrows it.

Goal

Practise the loop you run in every coordination incident: read the lock/lease path, find the moment a pause or partition can slip a second writer in, and reach for the resource-enforced fix before tuning a timeout.

Snippet 1 — acquire, then write

func runJob(lock *LockService, store *ObjectStore) error {
    if err := lock.Acquire("job-leader", 10*time.Second); err != nil {
        return err // someone else leads
    }
    defer lock.Release("job-leader")

    // ... minutes of work, including a possible long GC pause ...
    return store.Put("result.csv", data) // write to shared storage
}

Quiz

The lock is correct and exclusive. Why can this still produce two concurrent writers to result.csv?

Snippet 2 — adding a fencing token

func runJob(lock *LockService, store *ObjectStore) error {
    token, err := lock.Acquire("job-leader", 10*time.Second) // returns monotonic token
    if err != nil {
        return err
    }
    defer lock.Release("job-leader")

    // ... work, possible pause ...
    return store.Put("result.csv", data, token) // token passed to the write
}

Quiz

The token is now plumbed through to store.Put. Under what condition does this actually stop the stale write — and what must store.Put do?

Snippet 3 — the lease keep-alive loop

func keepLeadership(c *etcd.Client, leaseID etcd.LeaseID, onLost func()) {
    ka, _ := c.KeepAlive(ctx, leaseID) // channel of renew acks
    for {
        select {
        case resp, ok := <-ka:
            if !ok {                 // channel closed = renew failed
                onLost()             // we are no longer leader
                return
            }
            _ = resp
        }
    }
}

Quiz

onLost() fires when keep-alive renewals fail. Why is acting on onLost necessary but NOT sufficient for safety?

Snippet 4 — a split-brain log

12:00:01  leaderA  acquired lease (term=7, token=33), writing batch
12:00:04  leaderA  >>> stop-the-world GC pause begins
12:00:11  coord    leaseA expired (no keep-alive 10s); electing
12:00:11  coord    leaderB won (term=8, token=34)
12:00:11  leaderB  acquired lease, store accepted write token=34
12:00:18  leaderA  <<< GC pause ends (14s)
12:00:18  leaderA  store.Put(result.csv, token=33) -> REJECTED (highest=34)

Quiz

Reading this log, which statement is the correct senior reading?

Recap

Every leader-election bug reads the same way: a lock or lease guarantees exclusion only while the holder is running, so a pause between ‘acquire’ and ‘write’ lets a deposed leader resume and write stale data. Plumbing a monotonic fencing token through to the write is necessary, but it is the RESOURCE rejecting any token below its highest-seen that actually enforces safety. Keep-alive callbacks and shorter TTLs cannot help a process that is not executing. Diagnose from the log, find the pause-or-partition gap, and fix it where the write lands — not where the lock is granted.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.