awesome-everything RU
↑ Back to the climb

Networking & Protocols

Retry storms, circuit breakers, and load shedding

Crux Automatic retries amplify load 2^K× across K microservice layers; circuit breakers stop the cascade with fast 503s; retry budgets and exponential backoff with jitter are the minimum production mitigations.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min

A backend has a 500 ms GC pause. One hundred clients time out. Each retries. Now the backend has 200 requests queued on an already-paused backend. The retries from the retries pile on. Within seconds, a 500 ms blip has become a full cluster outage. This is the retry storm — and automatic retries caused it.

The retry-amplification problem

A backend blips (GC pause, thread pool exhaustion, momentary overload). Requests time out. Clients and the LB retry. The retry adds load to an already-impaired backend. The backend falls further behind. More retries accumulate.

Amplification math. With N=100 clients and 1 retry each: 100 original requests + 100 retries = 200 requests on a backend that could not handle 100. With 2 retries each: 300. With K=10 microservice layers each doing 1 retry on error, worst-case amplification approaches 2^10 = 1024× the original load.

This is not a theoretical concern. It is the most common cause of sustained cluster-wide outages in microservice architectures.

Why retries make it worse, not better. Retries are correct for transient failures (momentary packet loss, brief DNS blip). They are catastrophic for overload failures: retrying into an overloaded backend adds the exact load that caused the overload. The backend cannot drain its queue because retries keep refilling it.

Circuit breakers

A circuit breaker sits at the LB or service client level and tracks failure rates for each backend. Three states:

  1. Closed (normal): Requests flow through. Failure count is monitored.
  2. Open: After N consecutive failures (or a failure rate threshold is exceeded), the breaker opens. All new requests receive 503 Service Unavailable immediately — no queuing, no waiting.
  3. Half-open: After a cooldown period, one request is allowed through. If it succeeds, the breaker closes. If it fails, it reopens and the cooldown resets.

Why fast 503 helps. When the breaker is open, clients get 503 immediately instead of waiting for a timeout (30 s). They can route to a fallback, retry a different service, or shed the request. The overloaded backend gets no new load — it can drain its existing queue and recover.

Envoy circuit breaking config:

  • max_connections: max concurrent TCP connections to a backend (e.g., 1 000). 1 001st request: 503 immediately.
  • max_pending_requests: max requests queued when all connections are busy (e.g., 100). Excess: 503 immediately.
  • max_requests: max concurrent HTTP/2 requests per connection (e.g., 1 000).
Retry storm amplification
100 clients × 1 retry
200 requests on a drowning backend
100 clients × 2 retries
300 requests
K=10 layers × 1 retry each
up to 1 024× amplification
Safe retry rate (production SLO)
<0.1% of request rate
Circuit breaker open → client sees
503 immediately (no timeout wait)
Jitter range (backoff base 1 s)
0–1 s random delay added

Retry budgets

Instead of unlimited retries, set a global retry budget: cap the total number of retries across all clients to ~10% of the total request rate.

Example: If the service handles 10 000 RPS, allow at most 1 000 RPS of retries. Any retry beyond this returns 503 immediately (fail fast). This prevents the amplification from exceeding a bounded factor regardless of how many clients are retrying simultaneously.

Exponential backoff with jitter

Clients must not retry immediately — that synchronizes all retries and creates a thundering herd. Two components:

  1. Exponential backoff: Retry delay doubles on each attempt: 1 s → 2 s → 4 s → 8 s → … up to a maximum (e.g., 32 s).
  2. Jitter: Add a random value random(0, base) to each delay. This desynchronizes clients so they retry at staggered times.
delay = min(base × 2^attempt + random(0, base), max_delay)

Example with base=1 s, max=32 s:

  • Attempt 1: 1 s + rand(0, 1 s).
  • Attempt 2: 2 s + rand(0, 2 s).
  • Attempt 3: 4 s + rand(0, 4 s).

Without jitter, all 100 clients that timed out at t=30 s retry at t=31 s, creating another burst. With jitter, they retry spread across t=31–32 s.

Load shedding

The LB and backend both enforce queue depth limits. When the queue exceeds a threshold, new requests are dropped immediately with 503 rather than accepted and queued.

Without load shedding:

  • Queue grows unbounded.
  • Latency of every queued request increases.
  • Memory bloats.
  • Eventually, everything times out simultaneously.

With load shedding:

  • Clients get fast 503 — they know to back off.
  • The backend queue stays bounded.
  • Requests that do enter the queue get served in finite time.
  • System drains and recovers instead of collapsing.
Trace it
1/5

Trace a cascading failure: backend overload, retry storm, circuit breaker engagement.

1
Step 1 of 5
Backend B1 experiences thread pool exhaustion. Requests queue. Active health check (HTTP GET) still succeeds (the endpoint is reachable). What do the LB and clients see?
2
Locked
With 100 clients each retrying once, load on B1 goes from 100 to how many requests? What happens to B1's queue?
3
Locked
B1 is ejected. Its 200 queued requests now redistribute to B2, B3, B4. What is the risk?
4
Locked
The LB has a circuit breaker: max 100 concurrent requests per backend. B2 hits it. What do new requests see?
5
Locked
B1's thread pool recovers after 2 minutes. How should it rejoin the pool?
Debug this

Envoy stats dump during a retry storm

log
cluster.api_backend.requests: 10000
cluster.api_backend.errors: 245
cluster.api_backend.retries: 189
cluster.api_backend_retry_limit: 1200
upstream_rq_retry.api_backend: 189
upstream_rq_retry_limit_exceeded: 45
upstream_rq_total.api_backend: 10000
upstream_cx: 156
health_checks.failed: 12
health_checks.success: 188
circuit_breaker.default.rq_open: 0
circuit_breaker.default.cx_open: 0
outlier_detection.ejected_count: 0
load_balancer.least_request.unbalanced_requests_delta: 234

Stats show 189 retries out of 10 000 requests (~1.9% retry rate) and 45 requests hit the retry limit. Is the cluster healthy, and what should the operator do?

Edge cases

Per-layer retry budget vs global. In a microservice stack, if each of 10 layers allows 1 retry, the worst-case amplification is 2^10. The correct architecture: allocate the total retry budget at the outermost layer (the API gateway or client) and propagate a retry-remaining header inward. Inner services do not retry at all unless the header grants them budget. This keeps amplification bounded at 2× regardless of stack depth.

Recall before you leave
  1. 01
    Explain the retry amplification problem in a microservice architecture with N layers. Why is a 1% retry rate per layer catastrophic?
  2. 02
    How does a circuit breaker stop a retry storm, and what are its three states?
  3. 03
    Why must jitter be added to exponential backoff, and what does it prevent?
Recap

Retry storms are the most common cause of sustained microservice outages. When a backend blips, automatic retries add load to an already-impaired backend — amplified 2^K× across K service layers. Circuit breakers stop the cascade by returning 503 immediately after N failures, giving the backend space to drain and recover. The minimum production mitigations are: retry budgets capping retries at <10% of RPS; exponential backoff with jitter to desynchronize retry timing; and load shedding to drop new requests when queue depth exceeds a threshold. Alert on retry rates above 0.1% — that is the earliest warning sign before a storm escalates.

Connected lessons
appears again in258
Continue the climb ↑Resilient LB architecture: anycast, zone-aware routing, and observability
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.