awesome-everything RU
↑ Back to the climb

Caching

What is a cache stampede and why it makes things worse

Crux A single TTL expiry under concurrent traffic forces every waiting request to rebuild the cache at once, turning the cache into a weapon against its own database.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at junior altitude — the surface
◷ 10 min

A flash sale launches at noon. The homepage has been cached for 30 seconds. At 12:00:30 the TTL fires — and 10,000 customers hit the database at once. The cache was supposed to prevent this.

The shape of the failure

A cache layer works by absorbing repeated reads. When a key is live, every request returns in microseconds without touching the database. The flaw appears at TTL expiry: the instant the key becomes stale, every concurrent request sees a miss simultaneously.

Under low traffic this is fine — one request misses, rebuilds, stores the new value, and the next request hits. Under high traffic the expiry window is destroyed. All N concurrent requests arrive between the moment the key expires and the moment any of them writes the new value. Each one independently runs the rebuild. The database — which the cache was hiding — now sees N parallel queries.

PhaseCache stateDB load
Normal operation (60 s window)Key live, TTL > 0Near zero
Expiry instantKey expiredN concurrent rebuild queries
After first rebuild writesKey live againNear zero

The total number of user requests is low — the cache absorbed them for 60 s. But the peak concurrency at expiry equals the full unfiltered traffic rate for that one second. The database was sized for steady-state behind a cache, not for a one-second burst at the traffic ceiling.

Why longer TTL does not help

The intuitive fix is “set a longer TTL”. This does not fix stampede — it only shifts when it happens. A 1-hour TTL means the herd arrives once per hour instead of once per minute. Each hourly stampede is more severe because more cache-writes accumulate behind a single expiry. The right fix changes what happens at expiry, not when expiry occurs.

The bursty traffic shape is itself the problem

Without a cache, the database sees a steady 5,000 RPS. With a cache and TTL=60s, the database sees near 0 RPS for 59 seconds and then 5,000 RPS in one second. The total work is far lower — but the peak is identical to no-cache. The cache reshapes traffic from steady to bursty, and it is the burst, not the volume, that causes failures.

A concrete timeline

  1. T=0s — homepage:v1 cached with TTL=60s. Traffic: 5,000 RPS.
  2. T=0s–59.9s — cache hits. DB sees ~10 QPS (health checks, etc.).
  3. T=60.0s — key expires.
  4. T=60.0s–60.4s — 2,000 requests arrive (5,000 RPS × 0.4s rebuild time). Each runs GET homepage:v1, gets nil, starts a 400ms rebuild. 2,000 parallel DB queries.
  5. T=60.4s — all 2,000 rebuilds complete. Each writes the new value. DB CPU falls.
  6. T=60.4s–120.0s — cache hits again. Cycle repeats at T=120.
Quiz

What triggers a cache stampede?

Quiz

Why does increasing the cache TTL from 60 s to 1 hour not fix stampede?

Order the steps

Put the events of a cache stampede in order:

  1. 1 A hot cache key has TTL=60 s and receives 5,000 RPS of read traffic
  2. 2 Second 60 arrives: the cache key expires
  3. 3 5,000 concurrent requests in the next second all see a cache miss
  4. 4 All 5,000 requests run the same expensive backend rebuild independently
  5. 5 Database CPU saturates at 100%; renders queue, then time out
  6. 6 Half the requests succeed in writing the new value; the other half error to the user
  7. 7 For the next 60 s the cache absorbs traffic normally — until the next expiry
Complete the analogy

Fill in the blank: a cache stampede happens because many requests miss the cache _______, all at the same instant.

Recall before you leave
  1. 01
    Why does a cache make a failure mode worse than no cache at all, even though total request volume is lower?
  2. 02
    A homepage cached at 60 s TTL has 2,000 concurrent requests arriving per second. The rebuild takes 400 ms. How many parallel DB queries does a stampede produce?
Recap

A cache stampede is not a server crash or a misconfiguration — it is the normal TTL mechanism combined with concurrent traffic. When a hot key expires, every in-flight request sees a miss and runs the rebuild independently. The database, sized for cached steady-state, sees a one-second burst equal to the full unfiltered traffic rate. Increasing the TTL only moves the burst in time; it does not prevent it. The next lesson covers the two simplest mitigations: the distributed lock and in-process single-flight.

Connected lessons
appears again in178
Continue the climb ↑Lock and single-flight: bounding concurrent rebuilds
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.