awesome-everything RU
↑ Back to the climb

Backend Architecture

At scale: per-instance state, retry storms, and coordinated shedding

Crux Everything so far assumed one process. Across many instances a breaker becomes a per-instance guess with no shared view, retries multiply into call amplification, half-open probes herd, and the only stable shedding is coordinated. Resilience as a fleet property.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 18 min

Every lesson so far quietly assumed one process: one breaker, one thread pool, one counter. Production is fifty instances behind a load balancer, and the assumptions break in ways that are not obvious. Each instance has its own breaker with its own count, so instance A can be tripped while instance B still hammers the dying dependency — there is no shared verdict. Worse, you almost never have just a breaker; you have retries under it, and retries stack multiplicatively through a call chain. Google’s SRE book has the canonical horror: three layers of service, each retrying up to four times, turns one user request into 4 × 4 × 4 = 64 calls to the bottom service — a request amplifier that converts a small downstream wobble into a self-sustaining retry storm. Resilience at scale is not the single-process state machine of the last five lessons; it is a fleet property, and the fleet has failure modes a single breaker never sees.

Per-instance state: fifty breakers, no shared verdict

A circuit breaker’s state lives in the process that owns it. With fifty instances you have fifty independent breakers, each counting only the calls it made. This has direct consequences. Instance A might see enough failures to trip while instance B, having sampled a luckier slice of traffic, stays closed and keeps calling the sick dependency. There is no shared verdict — the dependency does not get “a breaker,” it gets fifty opinions. Right after a deploy or a scale-up, a fresh instance starts with an empty window and a closed breaker, so it must rediscover the outage from scratch by failing real calls, even though forty-nine of its siblings already know.

You can centralize breaker state in a shared store (Redis, a coordination service) so the fleet shares one count — but now the breaker check is a network call on the hot path, the shared store is a new dependency that can itself fail, and you have traded local-but-stale for shared-but-coupled. Most production systems keep breaker state local on purpose and accept the fuzziness, because a breaker is a statistical safety device, not a precise switch: fifty instances each sampling the dependency will, in aggregate, converge on tripping during a real outage. Local state is the pragmatic default; shared state is a deliberate, costly choice.

Retry amplification: the multiplier under the breaker

The dangerous interaction is not the breaker — it is the retry layered beneath it. Retries multiply along a call chain. If service A calls B calls C, and each retries 4× on failure, one inbound request to A can become 64 requests at C. When C is already struggling, this is the worst possible response: the layer that should reduce load instead multiplies it, and every layer’s retries feed the layer below. This is a retry storm (or retry amplification), and it is how a brief downstream blip becomes a sustained, self-inflicted overload that outlives the original fault.

The senior fixes are layered and specific:

  • Retry at one level, not every level. The SRE guidance is to retry close to the failure or at the top, but not at every layer simultaneously — stacked retries are what produce the 64×. Pick a layer; make the others fail fast.
  • Cap the absolute retry rate, not just the per-call count. Google’s pattern is a per-process budget — “60 retries per minute” as a server-wide cap — so retries cannot scale with traffic into a storm.
  • Exponential backoff with jitter. Fixed-interval retries from many clients re-synchronize into waves; AWS’s builder’s library shows that adding randomized jitter to exponential backoff spreads the retries out and is what actually breaks the wave. Backoff alone is not enough — without jitter, every client backs off by the same amount and they all return together.
  • Breakers sit above retries. The breaker is the circuit-level off-switch: once it is open, the retries beneath it never fire at all, because the whole call short-circuits. A breaker is what makes a retry policy safe — it bounds the storm.

Herding: half-open and recovery are fleet events

The half-open state from lesson two has a distributed failure mode. When fifty instances tripped at roughly the same moment, their cooldown timers also expire at roughly the same moment, so all fifty enter half-open and fire their probe calls together — a synchronized thundering herd onto a dependency that has just barely come back. The fix is the same primitive as retry backoff: jitter the probe timing so the fleet’s probes spread across a window instead of landing in one spike. AWS’s token-bucket retry throttling is the related idea on the retry side — each client maintains a token bucket that only permits retries while tokens remain, so a fleet-wide failure cannot translate into a fleet-wide retry burst.

Why this works

Why does jitter matter so much that both retries and half-open probes need it — isn’t exponential backoff already spreading the load out over time? Backoff spreads each individual client’s attempts across increasing intervals, but it does nothing about correlation between clients. Picture the failure: a dependency hiccups and, in the same few milliseconds, a thousand clients all get an error and all start the identical backoff schedule — wait 1 s, then 2 s, then 4 s. Because they started together and the schedule is deterministic, they also retry together: a thundering wave at t=1 s, another at t=3 s, another at t=7 s. The dependency, trying to recover, is hit by synchronized spikes that look exactly like the original overload, so it never gets a quiet moment to drain its backlog. Each client is behaving perfectly; the fleet is pathological. Jitter breaks the correlation: instead of waiting exactly 1 s, each client waits a random duration up to 1 s, so the same thousand retries smear across the whole interval as steady pressure the dependency can actually absorb. The same logic applies to half-open probes — fifty breakers that tripped together will, without jitter, un-trip together and herd. The deep lesson is that in a distributed system the enemy is rarely any single client’s behavior; it is synchronization across clients, and the cure is almost always to deliberately inject randomness to desynchronize them. This is why every mature retry and breaker implementation has jitter baked in, not bolted on.

Coordinated shedding: the floor must agree

Load shedding from the last lesson also changes shape across a fleet. If each instance sheds independently based on its own load, a load balancer routing unevenly can have some instances shedding hard while others sit idle — the fleet sheds the wrong requests. Worse, an instance that sheds a request does not make the work disappear; if the client retries, the shed request lands on another instance, so uncoordinated shedding can just shuffle load around the fleet instead of reducing it. The stable version is shedding that the fleet agrees on: shed by priority (drop low-priority traffic everywhere first, consistently), shed deterministically on a signal every instance can compute the same way, and pair it with the deadline-aware queuing from the last lesson so a request shed at the front door is not retried into the back. The principle that ties the whole lesson together: in a distributed system, every per-instance safety device — breaker, retry, shed — needs a fleet-level story, or the instances’ locally-correct decisions compose into a globally-wrong outcome.

Failure modeSingle processAcross the fleetThe fix
Breaker stateOne count, one verdict50 counts, no shared viewAccept local fuzziness (or pay for shared state)
RetriesBounded per callMultiply 4×4×4 = 64 down the chainRetry one layer; cap rate (60/min); breaker above
Recovery probeOne half-open trickle50 breakers herd at onceJitter the probe timing
BackoffSpreads one clientClients re-sync into wavesExponential backoff with jitter
Load sheddingShed local overloadShuffles load between instancesCoordinated, priority-based, deadline-aware
Quiz

Service A calls B calls C, each retrying up to 4 times on failure. C starts to struggle. Why does this make C's situation dramatically worse rather than better?

Quiz

Fifty instances trip their breakers at nearly the same moment during an outage. Why is jitter on the half-open probe (and on retry backoff) essential and not just nice-to-have?

Order the steps

Order how a small downstream wobble becomes a fleet-wide retry storm without the right guards:

  1. 1 A downstream dependency briefly slows under normal load
  2. 2 Each layer's retries fire, multiplying one request into many (4×4×4 = 64)
  3. 3 The amplified load pins the dependency, turning a wobble into a sustained outage
  4. 4 Clients that failed together retry together in synchronized waves, sustaining the storm
Recall before you leave
  1. 01
    Why is per-instance breaker state both a limitation and the pragmatic default, and what is retry amplification?
  2. 02
    What are the senior fixes for retry storms and herding, and why is jitter essential?
Recap

The first five lessons built a breaker for one process; this one scales it to a fleet, where the single-process intuitions quietly fail. Breaker state is per-instance — fifty instances hold fifty independent counts with no shared verdict, so one trips while another keeps calling and a new instance rediscovers the outage alone; local state is the pragmatic default because a breaker is statistical, and shared state buys a shared view at the cost of a hot-path network call and a new dependency. The real fleet hazard lives beneath the breaker in the retry layer, which multiplies along a chain into Google’s 4×4×4 = 64 — a retry storm tamed only by retrying at a single layer, capping the absolute rate at something like 60 per minute, using exponential backoff with jitter, and keeping the breaker above the retries so an open circuit short-circuits the stack. Recovery itself herds: breakers that tripped together un-trip together, so half-open probes need the same jitter retries do, because the distributed enemy is synchronization across clients rather than any single client’s behavior. And load shedding only works coordinated, priority-based, and deadline-aware, or instances just shuffle load between themselves. The unit’s arc is complete — a slow dependency, a state machine, thresholds, bulkheads, fallbacks, and now the fleet — and the next unit turns from keeping a service alive under load to bringing it down cleanly: graceful shutdown.

Connected lessons
Continue the climb ↑Circuit breakers: multiple-choice review
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.