Networking & Protocols NET · 07 · 05

Stale-while-revalidate and cache stampede

How stale-while-revalidate defeats cache stampedes, when stale-if-error saves you during origin outages, and the four strategies to prevent thundering herds.

NET Senior ◷ 12 min

Level

FoundationsJuniorMiddleSenior

At exactly T+3600 seconds, your most popular article’s cache entry expires across all edge POPs simultaneously. One thousand users request that page in the next second. Every one of them gets a cache miss. Every one triggers an origin fetch. Your origin sees 1000× normal traffic and starts timing out — and because some of those requests timed out, the CDN stored the 503 response as the new cached entry. Now every user sees a 503 for the next 3600 seconds.

The cache stampede problem

A cache stampede (also called thundering herd) happens when:

A popular cached response expires.
Many concurrent requests arrive simultaneously after expiry.
All miss the cache and each independently fetches from origin.
Origin is overwhelmed; some requests time out.
The CDN caches the error responses — making things worse.

Without mitigation, the amplification factor is (requests/sec at expiry) × (origin response time). A page receiving 500 req/s with 200 ms origin response time can generate 100 simultaneous in-flight origin requests — 100× normal origin load. Steps 4 and 5 are the lethal combination: without step 5, origin would at least recover when the burst passes; with cached errors, the stampede locks in for another full TTL.

The same expiry event amplifies into hundreds or thousands of simultaneous origin requests with no defence; any one of the four mitigations collapses it to a single origin request.

stale-while-revalidate (SWR)

RFC 5861 defines stale-while-revalidate=<seconds>:

Cache-Control: public, max-age=60, stale-while-revalidate=604800

After max-age=60 seconds expire:

Serve the stale response immediately to every incoming request.
Send one background revalidation request to origin.
Cache becomes fresh again after origin responds.
The staleness window is max-age + stale-while-revalidate = 60 s + 7 days.

All 1000 concurrent users at T+60 still get a response in ~20 ms (stale edge hit), while origin sees exactly one revalidation request. The stampede never happens.

The trade-off: users may see content up to stale-while-revalidate seconds out of date. For a news article body (max-age=300, swr=3600) this means content can be 1 hour stale after max-age expires. For a breaking-news ticker, this is unacceptable — use a short SWR or no SWR.

stale-while-revalidate by content type

News article body (acceptable staleness 1h): max-age=300, stale-while-revalidate=3600
Product listing (acceptable staleness 10 min): max-age=60, stale-while-revalidate=600
Breaking news ticker (freshness critical): max-age=5, stale-while-revalidate=10
Static asset (content-hashed URL): max-age=31536000, immutable — no SWR needed
User-specific data (bank balance): no-store — no caching at all

stale-if-error: graceful degradation on origin failure

RFC 5861 also defines stale-if-error=<seconds>:

Cache-Control: public, max-age=3600, stale-if-error=86400

When origin returns a 5xx or is unreachable, serve the stale cached response for up to stale-if-error seconds (1 day here) rather than returning an error to users. This is the CDN equivalent of a circuit breaker.

Use cases: marketing pages, documentation, article pages — anything where a 1-day-stale version is better than a 503. Not for checkout, payment, or any operation that must reflect real-time state.

The four stampede mitigations

Strategy	How it works	Best for
Origin shield	Collapses all edge misses in a region to one origin request	All cache tiers
stale-while-revalidate	Serves stale immediately, one background revalidation	Mutable content, tolerable staleness
Request coalescing (singleflight)	Application-level: first miss starts origin fetch; others wait for the same result	Origin application layer
Probabilistic early expiration (PER / XFetch)	Stochastically refresh slightly before TTL, spreading the load over time	High-traffic caches

▸Why this works

Why origin shield is the first line of defense. Without an origin shield, every CDN edge POP has a separate cache. When the same URL expires across 200 POPs in a region, all 200 independently fetch from origin. With an origin shield, all 200 edges route their misses through one shield node. The shield has its own cache (larger than any single edge); it makes at most one origin request per URL per region. SWR adds a second layer: even when the shield misses, users still see the stale response while one origin request is in flight. Both layers together mean a popular URL expiry generates exactly one origin request globally, not one per edge or one per concurrent user.

Trace it

1/4

A news site experiences a 10× traffic spike from a viral article. Origin load alarm fires despite CDN being in front. Diagnose.

Step 1 of 4

Step 1: check CDN cache hit rate during the spike. It shows 30% instead of the usual 90%. What does this indicate?

Locked

Step 2: dig into the article response headers. You find: Vary: User-Agent. Why is this catastrophic for cache hit rate?

Locked

Step 3: what is the immediate mitigation while you deploy a fix?

Locked

Step 4: add stale-while-revalidate to the article's Cache-Control. How does this change the behaviour at the next traffic spike?

Quiz

Why is stale-while-revalidate important for cache stampede defence?

Which RFC?

Which RFC defines stale-while-revalidate and stale-if-error Cache-Control extensions?

Trace it

1/4

Diagnose: users in two regions see different versions of the same page 2 hours after a deployment.

Step 1 of 4

Step 1: confirm both edges saw the deploy. Check origin last-modified via each edge.

Locked

Step 2: how long until B's cache naturally expires?

Locked

Step 3: how do you force-refresh now?

Locked

Step 4: how do you prevent this in future deploys?

stale-while-revalidate inverts the stampede: all 1000 concurrent users get the stale response immediately, so none of them waits on origin; the edge sends a single background revalidation. Origin sees one request instead of 1000.

Recall before you leave

01
Explain the cache stampede problem and why stale-while-revalidate prevents it.
02
Under what conditions should you NOT use stale-while-revalidate?
03
What does stale-if-error do and how does it differ from stale-while-revalidate?

Recap

The cache stampede problem: a popular cache entry expires; many concurrent users generate simultaneous origin requests; origin is overwhelmed and may start returning errors; those errors get cached. The four mitigations are: (1) origin shield, which collapses all edge misses in a region to one origin request; (2) stale-while-revalidate, which serves the stale response to all users while sending one background revalidation; (3) application-level request coalescing (singleflight), which prevents concurrent origin requests at the application layer; (4) probabilistic early expiration, which spreads revalidations across time. stale-if-error (RFC 5861) adds graceful degradation: on origin failure, serve the last cached version for up to N seconds instead of propagating errors. Match staleness windows to content correctness requirements — a news article can tolerate 10 minutes stale; a checkout price cannot tolerate 10 seconds. Now when you see a sudden origin traffic spike on a well-cached route, look for expiry-time synchronisation: if all edges cached a response at the same moment, they will all expire at the same moment — that is your stampede.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

deepens into

CDN operations and observabilitysenior

appears again in165

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.