Backend Architecture BE · 01 · 06

Timeouts and tail latency: budgets, deadlines, and the fan-out trap

Every hop needs a timeout, and timeouts must compose into a request-wide budget. At scale the tail dominates: fan out to enough services and the slowest one decides almost every user''''s latency.

BE Senior ◷ 18 min

Level

FoundationsJuniorMiddleSenior

Each service in the chain is healthy: every dependency answers in 10 ms at the 99th percentile. The product page fans out to 100 of them in parallel and waits for all. Yet 63% of page loads take over a second. No single service is slow. The math of the tail is slow — and most engineers’ intuition about averages hides it completely.

A timeout on every hop, or a hang on every incident

A network call without a timeout is a bug waiting for an incident. When a dependency stops responding (not refuses — hangs), a call with no timeout waits forever, holding a thread or connection. Enough such calls exhaust the pool, and a single slow dependency takes down a healthy service. Every outbound call — DB query, cache, HTTP, RPC — needs an explicit timeout. The default in most clients is no timeout, which is the worst default in backend engineering.

Timeouts must compose into a budget

Per-hop timeouts set in isolation lie. If a request has a 1 s SLA but calls service A (timeout 1 s) which calls service B (timeout 1 s), then when B is slow, A waits the full second and the client has already given up — A is now doing work nobody is waiting for. The fix is a timeout budget (a deadline): the entry point allocates a total, and each hop passes the remaining time downward. gRPC formalizes this as a deadline propagated in metadata; every service computes its local timeout as min(local default, remaining budget).

Approach	What each hop uses	Failure mode
No timeouts	∞	One hung dependency exhausts pools, cascades
Independent per-hop timeouts	A fixed local value	Inner work outlives the caller’s patience
Propagated deadline (budget)	min(local, remaining)	Bounded; inner hops stop when the budget is spent

Why the tail, not the average, is the SLA

Users do not experience your average. They experience their own request, and the slow ones are what they remember and what trips alerts. So latency is reported as percentiles: p50 (median), p99 (1 in 100 is worse), p99.9. The gap between p50 and p99 is “tail latency,” caused by GC pauses, queueing, cache misses, lock contention, and retries.

The danger is tail amplification under fan-out. If one request to a service is slow with probability p, then a request that fans out to N services in parallel and waits for all is slow if any one is slow — probability 1 − (1 − p)^N. With a per-service p99 (p = 1%) and N = 100, that is 1 − 0.99^100 ≈ 63%. This is the Hook’s number, straight from Dean and Barroso’s The Tail at Scale: a service fanning out to 2,000 leaves about 20% of requests over a second even when each backend’s p99 is fine.

Fix every backend at a 1% slow rate and the page's slow chance still climbs to 63% at N=100 and 87% at N=200 — because you wait for all N and are slow if any one is. The tail compounds, not averages.

Defending the tail: hedging, not just timeouts

A timeout caps the worst case but does not improve the typical tail. The technique from The Tail at Scale is hedged requests: send the request, and if no answer arrives by the p95 latency, send a second copy to another replica and take whichever returns first. Because only the slow ~5% get hedged, the extra load is small (~5%) while the tail collapses — in Google’s measurements, sending a hedge after a 10 ms delay cut p99.9 from 1,800 ms to 74 ms at the cost of ~2% more requests. Tied requests go further: the duplicates tell each other to cancel once one starts executing, trimming wasted work.

▸Why this works

Why not just retry on timeout instead of hedging? A retry fires only after you have already paid the full timeout — so it improves availability but not latency, and naive retries amplify load exactly when a service is already struggling (the retry storm). Hedging fires speculatively at p95, before the timeout, so it attacks latency directly; and because it is capped to the slow tail it adds bounded load. The two are complementary: hedge to cut the tail, retry with backoff and jitter to survive failures, and a circuit breaker (next unit) to stop both when the dependency is truly down.

Quiz

A page fans out to 100 independent services in parallel and waits for all. Each has a p99 of 10 ms (1% chance a call exceeds it). Roughly what fraction of page loads exceed 10 ms on at least one call?

Quiz

Why do independent per-hop timeouts fail to protect a request with an overall SLA?

Quiz

How does a hedged request reduce tail latency without large extra load?

The entry point sets a total budget; each hop forwards the remaining time and uses min(local timeout, remaining). When the budget is spent, inner hops stop instead of doing work nobody is waiting for.

Recall before you leave

01
Why does every outbound call need an explicit timeout, and why is a propagated deadline better than independent per-hop timeouts?
02
Explain tail amplification under fan-out with the math and the canonical numbers.
03
What are hedged requests and tied requests, and why are they preferred over retries for cutting the tail?

Recap

The last stop turns a request from “it returns” into “it returns in time.” Every outbound call needs an explicit timeout, because the default of waiting forever lets one hung dependency exhaust pools and cascade. But isolated timeouts do not compose, so they must roll up into a propagated deadline where each hop uses min(local, remaining) — the model gRPC standardizes. At scale, the average is a lie: users feel their own request, so you track p99/p99.9, and fan-out amplifies the tail brutally — 1 − (1 − p)^N reaches 63% slow at p=1%, N=100. Timeouts cap the worst case but do not fix the typical tail; hedged and tied requests, fired at p95, collapse it for a few percent extra load. Now when you see a p99 that is fine on each individual service but terrible on the page, draw the fan-out tree and apply the math before adding more caching. This is the bridge to resilience: when a dependency is not just slow but failing, timeouts and hedging are not enough, and the next unit’s circuit breakers and bulkheads take over.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Streaming and backpressure: when the client reads slower than you writesenior

deepens into

appears again in188

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.