Networking & Protocols NET · 09 · 06

Retry storms, circuit breakers, and load shedding

Automatic retries amplify load 2^K× across K microservice layers; circuit breakers stop the cascade with fast 503s; retry budgets and exponential backoff with jitter are the minimum production mitigations.

NET Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

A backend has a 500 ms GC pause. One hundred clients time out. Each retries. Now the backend has 200 requests queued on an already-paused backend. The retries from the retries pile on. Within seconds, a 500 ms blip has become a full cluster outage. This is the retry storm — and automatic retries caused it.

The retry-amplification problem

A backend blips (GC pause, thread pool exhaustion, momentary overload). Requests time out. Clients and the LB retry. The retry adds load to an already-impaired backend. The backend falls further behind. More retries accumulate.

Amplification math. With N=100 clients and 1 retry each: 100 original requests + 100 retries = 200 requests on a backend that could not handle 100. With 2 retries each: 300. With K=10 microservice layers each doing 1 retry on error, worst-case amplification approaches 2^10 = 1024× the original load.

One retry per layer is not additive — it compounds. Each layer doubles the worst-case load, so amplification grows as 2^K: by 10 layers a single blip can become 1024x its original load. This is why retries must be budgeted, not left per-layer.

This is not a theoretical concern. It is the most common cause of sustained cluster-wide outages in microservice architectures.

Why retries make it worse, not better. Retries are correct for transient failures (momentary packet loss, brief DNS blip). They are catastrophic for overload failures: retrying into an overloaded backend adds the exact load that caused the overload. The backend cannot drain its queue because retries keep refilling it.

Circuit breakers

A circuit breaker sits at the LB or service client level and tracks failure rates for each backend. Without one, you have no mechanism to stop the cascade — your only option is to watch the cluster drown. With one, a single configuration parameter turns a 30-second timeout avalanche into a fast 503 that clients can act on in milliseconds. Three states:

Closed (normal): Requests flow through. Failure count is monitored.
Open: After N consecutive failures (or a failure rate threshold is exceeded), the breaker opens. All new requests receive 503 Service Unavailable immediately — no queuing, no waiting.
Half-open: After a cooldown period, one request is allowed through. If it succeeds, the breaker closes. If it fails, it reopens and the cooldown resets.

Why fast 503 helps. When the breaker is open, clients get 503 immediately instead of waiting for a timeout (30 s). They can route to a fallback, retry a different service, or shed the request. The overloaded backend gets no new load — it can drain its existing queue and recover.

Envoy circuit breaking config:

max_connections: max concurrent TCP connections to a backend (e.g., 1 000). 1 001st request: 503 immediately.
max_pending_requests: max requests queued when all connections are busy (e.g., 100). Excess: 503 immediately.
max_requests: max concurrent HTTP/2 requests per connection (e.g., 1 000).

Retry storm amplification

100 clients × 1 retry: 200 requests on a drowning backend
100 clients × 2 retries: 300 requests
K=10 layers × 1 retry each: up to 1 024× amplification
Safe retry rate (production SLO): <0.1% of request rate
Circuit breaker open → client sees: 503 immediately (no timeout wait)
Jitter range (backoff base 1 s): 0–1 s random delay added

Retry budgets

Instead of unlimited retries, set a global retry budget: cap the total number of retries across all clients to ~10% of the total request rate.

Example: If the service handles 10 000 RPS, allow at most 1 000 RPS of retries. Any retry beyond this returns 503 immediately (fail fast). This prevents the amplification from exceeding a bounded factor regardless of how many clients are retrying simultaneously.

Exponential backoff with jitter

Clients must not retry immediately — that synchronizes all retries and creates a thundering herd. Two components:

Exponential backoff: Retry delay doubles on each attempt: 1 s → 2 s → 4 s → 8 s → … up to a maximum (e.g., 32 s).
Jitter: Add a random value random(0, base) to each delay. This desynchronizes clients so they retry at staggered times.

delay = min(base × 2^attempt + random(0, base), max_delay)

Example with base=1 s, max=32 s:

Attempt 1: 1 s + rand(0, 1 s).
Attempt 2: 2 s + rand(0, 2 s).
Attempt 3: 4 s + rand(0, 4 s).

Without jitter, all 100 clients that timed out at t=30 s retry at t=31 s, creating another burst. With jitter, they retry spread across t=31–32 s.

Load shedding

The LB and backend both enforce queue depth limits. When the queue exceeds a threshold, new requests are dropped immediately with 503 rather than accepted and queued.

Without load shedding:

Queue grows unbounded.
Latency of every queued request increases.
Memory bloats.
Eventually, everything times out simultaneously.

With load shedding:

Clients get fast 503 — they know to back off.
The backend queue stays bounded.
Requests that do enter the queue get served in finite time.
System drains and recovers instead of collapsing.

Trace it

1/5

Trace a cascading failure: backend overload, retry storm, circuit breaker engagement.

Step 1 of 5

Backend B1 experiences thread pool exhaustion. Requests queue. Active health check (HTTP GET) still succeeds (the endpoint is reachable). What do the LB and clients see?

Locked

With 100 clients each retrying once, load on B1 goes from 100 to how many requests? What happens to B1's queue?

Locked

B1 is ejected. Its 200 queued requests now redistribute to B2, B3, B4. What is the risk?

Locked

The LB has a circuit breaker: max 100 concurrent requests per backend. B2 hits it. What do new requests see?

Locked

B1's thread pool recovers after 2 minutes. How should it rejoin the pool?

Debug this

Envoy stats dump during a retry storm

log

cluster.api_backend.requests: 10000
cluster.api_backend.errors: 245
cluster.api_backend.retries: 189
cluster.api_backend_retry_limit: 1200
upstream_rq_retry.api_backend: 189
upstream_rq_retry_limit_exceeded: 45
upstream_rq_total.api_backend: 10000
upstream_cx: 156
health_checks.failed: 12
health_checks.success: 188
circuit_breaker.default.rq_open: 0
circuit_breaker.default.cx_open: 0
outlier_detection.ejected_count: 0
load_balancer.least_request.unbalanced_requests_delta: 234

Stats show 189 retries out of 10 000 requests (~1.9% retry rate) and 45 requests hit the retry limit. Is the cluster healthy, and what should the operator do?

▸Edge cases

Per-layer retry budget vs global. In a microservice stack, if each of 10 layers allows 1 retry, the worst-case amplification is 2^10. The correct architecture: allocate the total retry budget at the outermost layer (the API gateway or client) and propagate a retry-remaining header inward. Inner services do not retry at all unless the header grants them budget. This keeps amplification bounded at 2× regardless of stack depth.

The Open state is what stops a retry storm: failing requests get a fast 503 instead of piling onto a drowning backend, so the backend gets no new load and can drain its queue. Half-open tests recovery with a single request before reopening the floodgates.

Recall before you leave

01
Explain the retry amplification problem in a microservice architecture with N layers. Why is a 1% retry rate per layer catastrophic?
02
How does a circuit breaker stop a retry storm, and what are its three states?
03
Why must jitter be added to exponential backoff, and what does it prevent?

Recap

Retry storms are the most common cause of sustained microservice outages. When a backend blips, automatic retries add load to an already-impaired backend — amplified 2^K× across K service layers. Circuit breakers stop the cascade by returning 503 immediately after N failures, giving the backend space to drain and recover. The minimum production mitigations are: retry budgets capping retries at <10% of RPS; exponential backoff with jitter to desynchronize retry timing; and load shedding to drop new requests when queue depth exceeds a threshold. Alert on retry rates above 0.1% — that is the earliest warning sign before a storm escalates. Now when you see a retry rate spike in your Envoy dashboard, you will know: this is not a latency anomaly to ignore — it is the leading edge of an outage, and the circuit breaker is your first line of defense.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

deepens into

Resilient LB architecture: anycast, zone-aware routing, and observabilitysenior

appears again in287

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.