Crux Read real config and code from across the track — quorum sizing, fencing checks, saga compensation, backoff — predict the failure, and pick the highest-leverage fix.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min
Composition bugs are caught by reading the config and the code at the seams, not the prose. Read each snippet, predict how it behaves under failure, and choose the fix a senior makes first.
Goal
Practise the loop you run in every distributed incident: read the quorum sizing, the fencing check, the compensation, and the retry policy, then find the one change that makes the seam safe.
Snippet 1 — the quorum config
# replicated order storereplication: N: 3 # replicas per key W: 1 # acks required to confirm a write R: 1 # replicas read on a get
Quiz
Completed
With N=3, W=1, R=1, what does a read guarantee about a just-confirmed write, and how do you fix it for read-your-writes?
Heads-up W=1 acknowledges after a single replica accepts the write. The other two may not have it yet, and a R=1 read can land on one of those — acknowledgement does not mean replicated to a read-overlapping quorum.
Heads-up W=1 also risks losing the write if that one replica fails before propagating. R + W > N is precisely the correctness property — it buys read-your-write overlap and durability, not just latency cost.
Heads-up N=1 removes fault tolerance entirely — a single node loss loses the data. The goal is overlap with redundancy: keep N=3 and set W=2, R=2.
Snippet 2 — the fencing check
# resource guarding a saga step; called by whichever node believes it is leaderhighest_token_seen = 0def apply_step(step, token): global highest_token_seen # accept the write, then remember the token write(step) highest_token_seen = max(highest_token_seen, token) return "ok"
Quiz
Completed
A paused old leader wakes with token=7 while a new leader has already applied steps at token=12. What does this code do, and what is the fix?
Heads-up max() updates the bookkeeping but only after write(step) has already run. The corrupting write happens regardless; the token must be checked and rejected before the write, not recorded after it.
Heads-up Monotonic integers are the canonical fencing token. The defect is the ordering — writing before validating — not the token type. A UUID would not even be comparable for monotonicity.
Heads-up Relying on a later write to repair corruption is a race, not a guarantee — the stale step may be read or trigger effects before any overwrite. The fence must reject the stale write outright.
Snippet 3 — the saga compensation
# compensation for a cancelled order; invoked by the retry layer on timeoutdef refund(order_id, amount): # no idempotency key threaded in charge_id = payment_api.create_refund(order_id=order_id, amount=amount) return charge_id
Quiz
Completed
The first call's refund succeeds but its response is lost; the retry layer calls refund() again within budget. What happens, and what is the single highest-leverage fix?
Heads-up The saga emits one compensation, but the retry layer re-invokes it on timeout — that is the whole point. One logical compensation becomes two physical calls, and with no key the second one refunds again.
Heads-up Passing order_id is not an idempotency key — the API has no record tying this call to the first attempt, so it creates a fresh refund. A dedicated, recorded idempotency key is what makes the repeat a no-op.
Heads-up Then a genuinely lost refund request never completes and a cancelled order silently keeps the customer's money — a wrong number in the other direction. Retries are needed; they need a key, not removal.
Snippet 4 — the retry policy
def call_with_retry(fn, attempts=6): for i in range(attempts): try: return fn() except Timeout: # fixed delay, no jitter, no budget sleep(1.0) raise
Quiz
Completed
A downstream service blips for two seconds under load. Thousands of callers run this exact policy. What does the fixed 1.0s sleep produce, and what does a senior change?
Heads-up The delay length is not the issue; the synchronization is. A fixed delay makes every caller retry in lockstep, so the recovering service is hit by aligned waves. Jitter desynchronizes them.
Heads-up More attempts amplify the herd further and consume more load during the fault. The fix is the shape of the retries — backoff, jitter, and a budget — not the count.
Heads-up A longer fixed delay still synchronizes every caller; it just moves the wave later. Only randomized jitter breaks the alignment, and only a budget bounds total retry load.
Recap
Every distributed incident is read at the seam: a quorum with R + W not greater than N silently serves stale reads; a fence that writes before checking the token lets a stale leader corrupt state; a compensation invoked by the retry layer with no shared key double-refunds; and a fixed-delay retry with no jitter or budget turns a blip into a herd. Read the config and the code, fix the seam — add the overlap, reject before writing, thread the key, add jitter and a budget — then verify under failure injection.