Caching CACHE · 03 · 06

Metastable failure, fencing tokens, and production postmortems

How an unmitigated stampede pushes a system into a self-sustaining retry storm it cannot escape, the fencing token fix for lock TTL races, four production stampede postmortems, and a three-tier global cache design exercise.

CACHE Senior ◷ 18 min

Level

FoundationsJuniorMiddleSenior

A stampede lasts 10 seconds. Four hours later the DB is still at 100% CPU. The stampede is over — but the system cannot self-recover. This is the metastable failure pattern: once the system falls into the retry storm, it stays there.

The metastable failure sequence

A cache stampede can escalate into a self-sustaining failure:

T=0 — TTL fires, herd hits DB. DB CPU saturates.
T=0–5s — queries queue. Queue depth grows past client request timeout (e.g., 5 s).
T=5s — client timeouts begin. Most client SDKs auto-retry with exponential backoff. Backoff is short (100 ms to 1 s) because the original timeout looked like a transient network failure, not an overloaded server.
T=5–30s — retries multiply the herd. Original 10,000 misses become 30,000 in-flight requests (3 retry attempts per client). DB is now serving retries instead of fresh queries. Cache rebuild cannot complete because DB is too busy to answer rebuild queries.
T=4h — system remains saturated. Every new arrival retries. Retries prevent DB recovery. DB prevents cache rebuild. Cache emptiness generates more retries. Self-sustaining loop.

Together these five steps explain why a 10-second stampede can hold your DB at 100% CPU for hours: the retry amplification alone keeps the DB too busy to complete even a single cache rebuild, and without the cache the arrivals keep retrying. Without an external kill signal, no component in the loop can break it.

The system has two stable states: healthy (cache full, low DB load) and storm (cache empty, DB at 100%). A perturbation moved it from healthy to storm. It cannot move back on its own.

Why the system cannot self-recover

The loop is: no cache → retries → DB overload → no DB capacity → no cache rebuild → no cache. Each component is correctly following its designed behaviour. No component “knows” it is in a storm. The client SDK retries because it sees timeouts. The DB processes queries in order. The cache does not rebuild because the DB never answers fast enough.

Recovery requires breaking the loop from outside:

Load-shedding at the gateway — return 503 Service Unavailable immediately when DB CPU exceeds 90%. Most client SDKs treat 503 as “do not retry immediately” and back off with long jitter. After 30–60 s the queue drains, the DB recovers, and the cache rebuilds.
Circuit-breaker on the rebuild path — if the rebuild query fails or takes too long, serve stale forever until the DB recovers. This breaks the dependency between cache rebuild and DB health.
Manual cache warmup — operators write known-good values directly to the cache, bypassing the rebuild path. Immediately returns the system to the healthy state.

Recovery mechanism	How it breaks the loop	Side effect
503 at gateway	Clients stop retrying	503s visible to users for 30–60 s
Circuit-breaker	Rebuild stops depending on DB	Stale data served indefinitely
Manual warmup	Inserts cache values directly	Requires operator action

The fencing token fix for lock TTL races

A Redis lock with SET lock:key uuid EX 30 NX has a race condition when the rebuild outlasts EX:

Request-1 acquires lock with uuid-A, EX=10 s. Starts a 12-second rebuild.
T=10 s: lock auto-expires. Request-2 acquires lock with uuid-B.
T=12 s: request-1 finishes rebuild. Writes to cache.
T=12 s: request-2 also finishes rebuild. Writes to cache.
Result: duplicate writes. If there is a concurrent invalidation from a DB write, request-1’s stale value may overwrite a newer-correct value.

Fix 1: increase EX above rebuild p99.9. Most duplicates prevented. Not proof against all races.

Fix 2: fencing token check before write.

# Before writing the rebuilt value:
current_lock = GET lock:key
if current_lock != my_uuid:
  ABORT  # we lost the lock; someone else is rebuilding
ELSE:
  SET cache:key new_value EX 60  # write only if still holding lock

Fix 3: monotonic version per key. Include a version in every cache write. The DB or cache layer rejects writes with version ≤ current version. Stale rebuilds write with an old version and are silently rejected. Martin Kleppmann’s 2016 critique of RedLock formalised this: distributed locks alone are insufficient for correctness; fencing is required.

XFetch: why the exponential is optimal

Vattani et al. (VLDB 2015) prove no coordination-free algorithm can do better than exponential at keeping expected refreshes per TTL window at exactly 1. Any tighter rule either:

Requires coordination (distributed lock, lease) to bound variance, or
Accepts higher variance (some windows get 0 refreshes → stampede, some get many → wasted work).

The exponential is uniquely optimal because: the minimum of N independent Exp(λ) draws is Exp(N·λ). As fleet size N grows, the “winning” reader fires sooner — but the expected total remains 1. No other distribution has this scaling property while remaining memoryless (no state required per-reader).

Four production postmortems

Reddit 2017. A feed-cache change shipped without single-flight. Result: 3× DB peaks at every TTL boundary. On-call paged on DB CPU 30 minutes after deploy; rolled back within 1 hour.

Shopify 2020 (Black Friday). A flash-sale homepage had TTL=30 s and no SWR. At 12:00:30 the TTL fired under 10× normal traffic. The entire storefront edge had multi-second outages at every TTL boundary for the first 10 minutes of the sale. Fixed by deploying stale-grace=5 min on the next push.

Twitter 2022. An unrelated cache invalidation bug triggered 30M concurrent rebuild attempts. The DB connection pool saturated for 4 hours. Recovery required load-shedding + manual cache seed.

Cloudflare 2024. A KV cache coalescing bug briefly allowed N requests for the same key to bypass coalescing during a deploy window. The origin saw proportional load during the window. Fixed by deploying a guard flag that disabled coalescing rollout until the bug was patched.

Pattern across all four: stampede protection was either absent, misconfigured, or had a bug that defeated it under load. The cost was proportional to the protection gap.

Full-stack composition design

A production cache stack for a global high-traffic site needs protection at every tier:

Tier	Mechanism	Stampede reduction
CDN edge	SWR + request coalescing	N concurrent misses → 1 origin request
Application cache (Redis)	XFetch + single-flight + lock	N concurrent misses → 1 rebuild per TTL window
DB	Circuit-breaker, read replicas	Origin load shedding during overload

Each tier must independently bound its herd — a failure at any tier cascades to the one below. SWR at the CDN absorbs 99% of TTL-boundary traffic. XFetch prevents the next layer’s stampede by refreshing before expiry. Single-flight collapses per-node herds. The Redis lock collapses cross-node herds for the hottest 1% of keys. TTL jitter desynchronises multi-key boundaries. Together they reduce 10,000 concurrent misses to 1 rebuild anywhere in the stack.

Defense in depth: each tier independently drops the herd by ~10×, so 10,000 simultaneous misses collapse to a single rebuild — no one layer has to be perfect.

▸Why this works

Bouman et al. (SOSP 2024) formalised the metastable failure pattern as a class: a system in a metastable failure is stable in both the healthy and storm states, and a perturbation moves it between them. The transitions are: healthy → storm (stampede + retry amplification) and storm → healthy (external intervention). Their key result: any system with at-least-once retries and no load-shedding can be pushed into metastability by a sufficiently large transient overload. The fix is an explicit “kill the herd” mechanism (503-on-overload, circuit-breaker) that is separate from the caching logic.

Quiz

A system enters metastable failure after a stampede. DB CPU stays at 100% for 4 hours. The cache is empty. What is the correct diagnosis?

Quiz

A Redis lock uses EX=10 s, but rebuilds can take up to 15 s. A rebuild starts at T=0 s. What is the exact failure at T=10 s?

Quiz

A global news site needs stampede protection at every cache tier. Which stack satisfies: p99 latency under 200 ms during 10× viral spikes AND DB never sees more than 5× steady-state QPS?

Each component follows correct local behaviour, yet the retries → DB-saturation → timeouts → more-retries cycle is self-sustaining. Only an external kill (503 gate, circuit-breaker, manual warmup) breaks it.

Recall before you leave

01
Explain the metastable failure shape that follows an unmitigated cache stampede and why the system cannot self-recover.
02
A Redis lock with EX=10 s produces duplicate writes when rebuilds exceed 10 s. Explain the three-component fix.
03
Why does XFetch's 'expected 1 refresh per TTL window' property hold regardless of fleet size, while a distributed lock requires explicit cross-node coordination?

Recap

A cache stampede that overloads the DB can escalate into a metastable failure lasting hours: client retries from timed-out requests sustain the load that prevents DB recovery, which prevents cache rebuild, which generates more retries. Breaking the loop requires an external mechanism — 503-on-overload at the gateway, circuit-breaking the rebuild path, or manual cache seeding. Distributed lock races (EX too short) require three defences: extended EX, fencing token verification before write, and monotonic version on cache keys. The full production stack layers SWR at the CDN edge, XFetch at the application cache, single-flight per process, and a Redis lock for the hottest keys — each tier reducing the herd by an order of magnitude. Four real incidents (Reddit 2017, Shopify 2020, Twitter 2022, Cloudflare 2024) all share one pattern: protection missing, misconfigured, or defeated under load. Now when you see a system that will not recover despite the stampede being over, your first instinct should be to shed load at the gateway — not to restart the cache cold.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Detecting stampedes and designing TTL for productionmiddle

appears again in228

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.