Distributed Systems DIST · 08 · 09

Distributed capstone: code and config reading

Read real config and code from across the track — quorum sizing, fencing checks, saga compensation, backoff — predict the failure, and pick the highest-leverage fix.

DIST Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

Composition bugs are caught by reading the config and the code at the seams, not the prose. Read each snippet, predict how it behaves under failure, and choose the fix a senior makes first.

Goal

Practise the loop you run in every distributed incident: read the quorum sizing, the fencing check, the compensation, and the retry policy, then find the one change that makes the seam safe.

When you read each snippet below, ask yourself: what happens if this runs during a failure, and which single line is the load-bearing fix?

Snippet 1 — the quorum config

# replicated order store
replication:
  N: 3          # replicas per key
  W: 1          # acks required to confirm a write
  R: 1          # replicas read on a get

Quiz

With N=3, W=1, R=1, what does a read guarantee about a just-confirmed write, and how do you fix it for read-your-writes?

Snippet 2 — the fencing check

# resource guarding a saga step; called by whichever node believes it is leader
highest_token_seen = 0

def apply_step(step, token):
    global highest_token_seen
    # accept the write, then remember the token
    write(step)
    highest_token_seen = max(highest_token_seen, token)
    return "ok"

Quiz

A paused old leader wakes with token=7 while a new leader has already applied steps at token=12. What does this code do, and what is the fix?

Snippet 3 — the saga compensation

# compensation for a cancelled order; invoked by the retry layer on timeout
def refund(order_id, amount):
    # no idempotency key threaded in
    charge_id = payment_api.create_refund(order_id=order_id, amount=amount)
    return charge_id

Quiz

The first call's refund succeeds but its response is lost; the retry layer calls refund() again within budget. What happens, and what is the single highest-leverage fix?

Snippet 4 — the retry policy

def call_with_retry(fn, attempts=6):
    for i in range(attempts):
        try:
            return fn()
        except Timeout:
            # fixed delay, no jitter, no budget
            sleep(1.0)
    raise

Quiz

A downstream service blips for two seconds under load. Thousands of callers run this exact policy. What does the fixed 1.0s sleep produce, and what does a senior change?

Recap

Every distributed incident is read at the seam: a quorum with R + W not greater than N silently serves stale reads; a fence that writes before checking the token lets a stale leader corrupt state; a compensation invoked by the retry layer with no shared key double-refunds; and a fixed-delay retry with no jitter or budget turns a blip into a herd. Read the config and the code, fix the seam — add the overlap, reject before writing, thread the key, add jitter and a budget — then verify under failure injection. Now when you open an unfamiliar distributed service during an incident, the four snippets above are the checklist: check quorum overlap, check fencing order, check whether idempotency keys cross every retry boundary, check whether backoff and a budget cap the retries.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.