awesome-everything RU
↑ Back to the climb

Distributed Systems

Distributed capstone: code and config reading

Crux Read real config and code from across the track — quorum sizing, fencing checks, saga compensation, backoff — predict the failure, and pick the highest-leverage fix.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min

Composition bugs are caught by reading the config and the code at the seams, not the prose. Read each snippet, predict how it behaves under failure, and choose the fix a senior makes first.

Goal

Practise the loop you run in every distributed incident: read the quorum sizing, the fencing check, the compensation, and the retry policy, then find the one change that makes the seam safe.

Snippet 1 — the quorum config

# replicated order store
replication:
  N: 3          # replicas per key
  W: 1          # acks required to confirm a write
  R: 1          # replicas read on a get
Quiz

With N=3, W=1, R=1, what does a read guarantee about a just-confirmed write, and how do you fix it for read-your-writes?

Snippet 2 — the fencing check

# resource guarding a saga step; called by whichever node believes it is leader
highest_token_seen = 0

def apply_step(step, token):
    global highest_token_seen
    # accept the write, then remember the token
    write(step)
    highest_token_seen = max(highest_token_seen, token)
    return "ok"
Quiz

A paused old leader wakes with token=7 while a new leader has already applied steps at token=12. What does this code do, and what is the fix?

Snippet 3 — the saga compensation

# compensation for a cancelled order; invoked by the retry layer on timeout
def refund(order_id, amount):
    # no idempotency key threaded in
    charge_id = payment_api.create_refund(order_id=order_id, amount=amount)
    return charge_id
Quiz

The first call's refund succeeds but its response is lost; the retry layer calls refund() again within budget. What happens, and what is the single highest-leverage fix?

Snippet 4 — the retry policy

def call_with_retry(fn, attempts=6):
    for i in range(attempts):
        try:
            return fn()
        except Timeout:
            # fixed delay, no jitter, no budget
            sleep(1.0)
    raise
Quiz

A downstream service blips for two seconds under load. Thousands of callers run this exact policy. What does the fixed 1.0s sleep produce, and what does a senior change?

Recap

Every distributed incident is read at the seam: a quorum with R + W not greater than N silently serves stale reads; a fence that writes before checking the token lets a stale leader corrupt state; a compensation invoked by the retry layer with no shared key double-refunds; and a fixed-delay retry with no jitter or budget turns a blip into a herd. Read the config and the code, fix the seam — add the overlap, reject before writing, thread the key, add jitter and a budget — then verify under failure injection.

Continue the climb ↑Distributed capstone: design a fault-tolerant pipeline
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.