Distributed Systems DIST · 07 · 07

Retry amplification: multiple-choice review

Multiple-choice synthesis across the retry-amplification unit — fan-out math, metastable loops, jitter, retry budgets, and circuit breakers under load.

DIST Senior ◷ 13 min

Level

FoundationsJuniorMiddleSenior

Six questions that cut across the whole unit. Each one mirrors a call you make mid-incident — not a definition to recite, but a fan-out number to compute and a fix to rank while the backend is on fire.

Goal

Confirm you can connect retry fan-out math, the metastable sustaining loop, and the defense ladder — jitter, retry budgets, circuit breakers, idempotency, deadline propagation — the synthesis the unit built toward.

Quiz

A request crosses 5 service layers and every layer retries 3 times on failure. In the worst case, how many calls hit the deepest dependency for one failing request, and why does this ambush people?

Quiz

The 8-second database failover that triggered the incident finished 30 minutes ago, but the fleet is still down under a retry storm. What is actually keeping it down?

Quiz

A team adds fixed-interval retries (wait 200 ms, retry, wait 200 ms, retry) at every layer to 'be gentle.' During the next dependency blip the outage is worse than before. Why?

Quiz

You have already shipped exponential backoff with full jitter at every layer, yet a full dependency outage still drives the backend to roughly 80x its normal load. What did jitter not buy you, and what do you add?

Quiz

A circuit breaker opens after the failure rate to a dependency crosses its threshold. What does this accomplish that a retry budget alone does not?

Quiz

Reviewing a service's retry config, you find it retries every failure up to 3 times — including 400s, 409 conflicts, and non-idempotent POSTs — with no deadline propagation. Which two rules is it violating, and why do they matter most under load?

Recap

The through-line is one decision tree: retries compose by multiplication (retries^depth, so 3⁵ = 243), and a tiny error rate becomes a self-inflicted DDoS; the storm then sustains itself as a metastable failure long after the trigger clears. The defenses layer from broadest to finest — jitter de-synchronizes the herd in time, a ~10% retry budget caps the volume, a circuit breaker grants a zero-retry recovery window — and the non-negotiable rules are: retry only idempotent operations and retryable errors, and propagate the deadline so no layer ever retries a request that is already dead. Now when you see a post-incident timeline where the outage ran 30 minutes past the trigger, you can name the sustaining loop and reach for the right tool immediately.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.