Distributed Systems
Retry amplification: multiple-choice review
Six questions that cut across the whole unit. Each one mirrors a call you make mid-incident — not a definition to recite, but a fan-out number to compute and a fix to rank while the backend is on fire.
Confirm you can connect retry fan-out math, the metastable sustaining loop, and the defense ladder — jitter, retry budgets, circuit breakers, idempotency, deadline propagation — the synthesis the unit built toward.
A request crosses 5 service layers and every layer retries 3 times on failure. In the worst case, how many calls hit the deepest dependency for one failing request, and why does this ambush people?
The 8-second database failover that triggered the incident finished 30 minutes ago, but the fleet is still down under a retry storm. What is actually keeping it down?
A team adds fixed-interval retries (wait 200 ms, retry, wait 200 ms, retry) at every layer to 'be gentle.' During the next dependency blip the outage is worse than before. Why?
You have already shipped exponential backoff with full jitter at every layer, yet a full dependency outage still drives the backend to roughly 80x its normal load. What did jitter not buy you, and what do you add?
A circuit breaker opens after the failure rate to a dependency crosses its threshold. What does this accomplish that a retry budget alone does not?
Reviewing a service's retry config, you find it retries every failure up to 3 times — including 400s, 409 conflicts, and non-idempotent POSTs — with no deadline propagation. Which two rules is it violating, and why do they matter most under load?
The through-line is one decision tree: retries compose by multiplication (retries^depth, so 3⁵ = 243), and a tiny error rate becomes a self-inflicted DDoS; the storm then sustains itself as a metastable failure long after the trigger clears. The defenses layer from broadest to finest — jitter de-synchronizes the herd in time, a ~10% retry budget caps the volume, a circuit breaker grants a zero-retry recovery window — and the non-negotiable rules are: retry only idempotent operations and retryable errors, and propagate the deadline so no layer ever retries a request that is already dead.