awesome-everything RU
↑ Back to the climb

Networking & Protocols

Stale-while-revalidate and cache stampede

Crux How stale-while-revalidate defeats cache stampedes, when stale-if-error saves you during origin outages, and the four strategies to prevent thundering herds.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 12 min

At exactly T+3600 seconds, your most popular article’s cache entry expires across all edge POPs simultaneously. One thousand users request that page in the next second. Every one of them gets a cache miss. Every one triggers an origin fetch. Your origin sees 1000× normal traffic and starts timing out — and because some of those requests timed out, the CDN stored the 503 response as the new cached entry. Now every user sees a 503 for the next 3600 seconds.

The cache stampede problem

A cache stampede (also called thundering herd) happens when:

  1. A popular cached response expires.
  2. Many concurrent requests arrive simultaneously after expiry.
  3. All miss the cache and each independently fetches from origin.
  4. Origin is overwhelmed; some requests time out.
  5. The CDN caches the error responses — making things worse.

Without mitigation, the amplification factor is (requests/sec at expiry) × (origin response time). A page receiving 500 req/s with 200 ms origin response time can generate 100 simultaneous in-flight origin requests — 100× normal origin load.

stale-while-revalidate (SWR)

RFC 5861 defines stale-while-revalidate=<seconds>:

Cache-Control: public, max-age=60, stale-while-revalidate=604800

After max-age=60 seconds expire:

  • Serve the stale response immediately to every incoming request.
  • Send one background revalidation request to origin.
  • Cache becomes fresh again after origin responds.
  • The staleness window is max-age + stale-while-revalidate = 60 s + 7 days.

All 1000 concurrent users at T+60 still get a response in ~20 ms (stale edge hit), while origin sees exactly one revalidation request. The stampede never happens.

The trade-off: users may see content up to stale-while-revalidate seconds out of date. For a news article body (max-age=300, swr=3600) this means content can be 1 hour stale after max-age expires. For a breaking-news ticker, this is unacceptable — use a short SWR or no SWR.

stale-while-revalidate by content type
News article body (acceptable staleness 1h)
max-age=300, stale-while-revalidate=3600
Product listing (acceptable staleness 10 min)
max-age=60, stale-while-revalidate=600
Breaking news ticker (freshness critical)
max-age=5, stale-while-revalidate=10
Static asset (content-hashed URL)
max-age=31536000, immutable — no SWR needed
User-specific data (bank balance)
no-store — no caching at all

stale-if-error: graceful degradation on origin failure

RFC 5861 also defines stale-if-error=<seconds>:

Cache-Control: public, max-age=3600, stale-if-error=86400

When origin returns a 5xx or is unreachable, serve the stale cached response for up to stale-if-error seconds (1 day here) rather than returning an error to users. This is the CDN equivalent of a circuit breaker.

Use cases: marketing pages, documentation, article pages — anything where a 1-day-stale version is better than a 503. Not for checkout, payment, or any operation that must reflect real-time state.

The four stampede mitigations

StrategyHow it worksBest for
Origin shieldCollapses all edge misses in a region to one origin requestAll cache tiers
stale-while-revalidateServes stale immediately, one background revalidationMutable content, tolerable staleness
Request coalescing (singleflight)Application-level: first miss starts origin fetch; others wait for the same resultOrigin application layer
Probabilistic early expiration (PER / XFetch)Stochastically refresh slightly before TTL, spreading the load over timeHigh-traffic caches
Why this works

Why origin shield is the first line of defense. Without an origin shield, every CDN edge POP has a separate cache. When the same URL expires across 200 POPs in a region, all 200 independently fetch from origin. With an origin shield, all 200 edges route their misses through one shield node. The shield has its own cache (larger than any single edge); it makes at most one origin request per URL per region. SWR adds a second layer: even when the shield misses, users still see the stale response while one origin request is in flight. Both layers together mean a popular URL expiry generates exactly one origin request globally, not one per edge or one per concurrent user.

Trace it
1/4

A news site experiences a 10× traffic spike from a viral article. Origin load alarm fires despite CDN being in front. Diagnose.

1
Step 1 of 4
Step 1: check CDN cache hit rate during the spike. It shows 30% instead of the usual 90%. What does this indicate?
2
Locked
Step 2: dig into the article response headers. You find: Vary: User-Agent. Why is this catastrophic for cache hit rate?
3
Locked
Step 3: what is the immediate mitigation while you deploy a fix?
4
Locked
Step 4: add stale-while-revalidate to the article's Cache-Control. How does this change the behaviour at the next traffic spike?
Quiz

Why is stale-while-revalidate important for cache stampede defence?

Which RFC?

Which RFC defines stale-while-revalidate and stale-if-error Cache-Control extensions?

Trace it
1/4

Diagnose: users in two regions see different versions of the same page 2 hours after a deployment.

1
Step 1 of 4
Step 1: confirm both edges saw the deploy. Check origin last-modified via each edge.
2
Locked
Step 2: how long until B's cache naturally expires?
3
Locked
Step 3: how do you force-refresh now?
4
Locked
Step 4: how do you prevent this in future deploys?
Recall before you leave
  1. 01
    Explain the cache stampede problem and why stale-while-revalidate prevents it.
  2. 02
    Under what conditions should you NOT use stale-while-revalidate?
  3. 03
    What does stale-if-error do and how does it differ from stale-while-revalidate?
Recap

The cache stampede problem: a popular cache entry expires; many concurrent users generate simultaneous origin requests; origin is overwhelmed and may start returning errors; those errors get cached. The four mitigations are: (1) origin shield, which collapses all edge misses in a region to one origin request; (2) stale-while-revalidate, which serves the stale response to all users while sending one background revalidation; (3) application-level request coalescing (singleflight), which prevents concurrent origin requests at the application layer; (4) probabilistic early expiration, which spreads revalidations across time. stale-if-error (RFC 5861) adds graceful degradation: on origin failure, serve the last cached version for up to N seconds instead of propagating errors. Match staleness windows to content correctness requirements — a news article can tolerate 10 minutes stale; a checkout price cannot tolerate 10 seconds.

Connected lessons
appears again in162
Continue the climb ↑Edge workers and edge-side composition
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.