Distributed Systems
Retry amplification: break a metastable storm
Reading about retry storms is not the same as pulling a service out of one. Build a short layered call chain, inject a brief dependency outage, and watch a temporary blip turn into a self-sustaining outage that does not heal when you remove the trigger. Then apply the defense ladder until the storm cannot start — with evidence at every step.
Turn the unit’s mental model into a reproducible loop: reproduce a metastable retry storm, measure the fan-out, then bound it with jitter, a retry budget, and a circuit breaker, and prove with before/after numbers that the system now recovers on its own when the trigger clears.
Build a 3-to-4-layer service chain that collapses into a metastable retry storm under a brief dependency outage, then add jitter, a retry budget, and a circuit breaker so the same outage stays a blip — proving each step with measurements, not assertions.
- A before/after table: peak dependency call rate (as a multiple of baseline), retry-to-original ratio, time-to-recovery after the trigger clears, and p99 latency — all measured under the identical injected outage, not estimated.
- The baseline run demonstrably shows metastability: the dependency call rate stays elevated and requests keep failing after the 8-second outage ends, until you force load down.
- With all defenses on, peak dependency load during a full outage stays within ~10% above baseline (the retry budget holds), and the breaker is observed opening and then half-open-probing back to closed.
- A one-paragraph write-up explaining which defense bounded which property (jitter -> timing, budget -> volume, breaker -> recovery window, idempotency/deadline -> wasted work) and why backoff alone was not enough.
- Add a one-page on-call runbook: how to recognise a metastable retry storm from the four panels, why hunting the cleared trigger is the trap, and the load-shed / breaker / restart playbook to force recovery.
- Make the dependency call non-idempotent (e.g. it increments a counter), add an idempotency key, and show that retries no longer double-apply the effect under the storm.
- Sweep retry count (1, 2, 3) and chain depth (2, 3, 4) and plot measured peak dependency load against the predicted retries^depth curve to confirm the geometric law empirically.
- Compare full jitter vs equal jitter vs no jitter under the same herd and reproduce the AWS result: jittered variants complete in far fewer total calls with lower time-to-recovery than fixed backoff.
This is the loop you run in every real retry incident: reproduce the storm and confirm it is metastable (it outlives the trigger), measure the fan-out against retries^depth, then bound each property with the right tool — jitter for timing, a ~10% retry budget for volume, a circuit breaker for the recovery window, idempotency and deadline propagation for wasted work — and verify with before/after numbers under the identical outage. Doing it once on a toy chain makes the production version muscle memory: you will recognise the signature instantly and reach for load reduction, not the cleared trigger.