Caching CACHE · 03 · 05

Detecting stampedes and designing TTL for production

The observability fingerprint of a cache stampede, minimum-viable dashboard metrics, TTL jitter to spread boundaries, negative caching, and pre-warming strategies to survive cache-cold restarts.

CACHE Middle ◷ 13 min

Level

FoundationsJuniorMiddleSenior

A team deploys single-flight and locks. Three weeks later an on-call alert fires: DB CPU spiking every 5 minutes. The protection is in place — but it is protecting the wrong keys. The spike is coming from a different key with TTL=300 s that nobody instrumented.

The observability fingerprint

Cache stampedes leave a distinctive signature in metrics:

DB query rate: a sawtooth pattern — near zero between TTL boundaries, a sharp spike at each boundary. The spike width equals the rebuild duration.
Periodicity: spikes recur at intervals matching the TTL. TTL=60 s → spikes every 60 s. TTL=300 s → spikes every 5 minutes.
p99 latency: spikes at the same intervals as the DB query rate.
cache_miss_total rate: sharp periodic increases at boundaries instead of a smooth low baseline.

Without these instrumented, a stampede looks like a general “DB slowness” incident with no obvious cause.

Minimum-viable dashboard

You can deploy every mitigation from the previous lessons and still get paged at 3 am — because you protected the wrong keys. What tells you which key is stampeding, how bad it is, and whether your fix is working? Six metrics, charted in one place.

Six metrics cover all stampede scenarios:

Metric	Alert condition	What it signals
`cache_miss_total` rate	Periodic spikes > 5× steady-state	Stampede in progress
`db_query_rate` p99	Sawtooth pattern	Downstream stampede from cache boundaries
`cache_rebuild_duration_seconds` p99	Long tail at boundaries	Rebuild contention
`cache_lock_wait_seconds` p99	Above rebuild p99	Lock queue building — waiter starvation
`singleflight_subscriber_count` p99	> 1 (coalescing active)	Single-flight firing — normal under load
`cache_swr_stale_serve_total`	Non-zero during boundaries	SWR is absorbing expiry — expected

Alert 1: cache_miss_total rate above 5× steady-state → stampede forming. Alert 2: db_query_rate p99 above 10× p50 → sawtooth DB load → boundary spikes. Alert 3: cache_lock_wait_seconds p99 above rebuild duration → lock queue depth growing.

TTL jitter

Single-flight and locks handle per-key stampedes. But what if 1,000 keys all have TTL=300 s and they were all cached at the same time? They all expire together, producing 1,000 simultaneous per-key stampedes — single-flight correctly handles each one, but the sum of 1,000 concurrent rebuilds is a DB spike.

TTL jitter: instead of a fixed TTL, use a random value in a range:

ttl = base_ttl * (1 + jitter_fraction * (rand() - 0.5))
# Example: base=300, jitter=0.25 → TTL in range [225, 375]

A fleet of 1,000 keys with ±25% jitter spreads expiries over 150 s instead of all firing at once. DB load becomes a smooth low curve instead of a spike.

Most cache libraries support jitter natively (Caffeine in Java, Redis via application-level calculation). The default ±15–25% is sufficient for most workloads.

Negative caching

The same stampede shape applies when the database answers “no such row.” If the application does not cache null results, every request for a non-existent key hits the DB — and under high concurrency this is a miss-storm that overloads the DB as fully as a positive-key stampede.

Fix: cache the “missing” sentinel with a short TTL.

# On DB miss:
SET key:missing "" EX 10  # 10 s negative TTL

# On read:
val = GET key
if val == "":
  return NOT_FOUND  # from cache, no DB hit

Short negative TTL (5–30 s) bounds memory churn. The positive TTL can be much longer (60–300 s). Write-through invalidation must delete the negative entry when a real row is inserted.

One TTL does not fit every entry: short-lived 'missing' sentinels, a jitter window that desynchronises a fleet of boundaries, and long-lived hot rows are three separate design knobs.

Security note: without negative caching, an attacker can mint random non-existent keys (random UUIDs in a URL path) to amplify DB load by orders of magnitude — a documented pattern that affected CDN-backed sites in 2024.

Pre-warming after restarts

A cache that restarts cold (deploy, eviction, machine failure) starts empty. Every incoming request misses and hits the DB — a full-traffic origin spike. If restarted at peak traffic, this spike is equal in magnitude to a full stampede.

Pre-warming procedure:

Before accepting public traffic, replay the top-N most-accessed keys from an audit log or access log.
Warm the cache first, then cut traffic.
For blue-green deploys: warm the green cache instance to steady-state before switching the load balancer.

Cloudflare edges pre-warm from neighbouring POPs. Redis-backed services use a startup script reading “top 1,000 keys” from an audit table. The rule: never restart a cache under live traffic without pre-warming.

▸Why this works

Pre-warming is the most common missed step in cache tier upgrade runbooks. Teams test the lock and SWR logic, but neglect the cold-start window. The cold-start stampede is usually 3–10× worse than a normal TTL-boundary stampede because 100% of keys are cold simultaneously. Runbooks must include a “warm the new cache before routing traffic” step as a hard gate.

Quiz

A service's DB query rate shows sharp spikes every 60 seconds with near-zero load between spikes. What is the most likely cause?

Quiz

A cache stores 5,000 product pages, all cached at the same moment with TTL=300 s. What happens at second 300, even with single-flight protection?

Order the steps

Order the steps to diagnose and fix a 60-second-periodic DB spike:

1 Check DB query rate over time — confirm sawtooth pattern with 60-second periodicity
2 Identify the cache key(s) with TTL=60 s on the hot path
3 Deploy in-process single-flight as the first mitigation
4 Verify spikes drop from 4,000 QPS to 50 QPS (one rebuild per node × 50 nodes)
5 Add TTL jitter ±20% to desynchronise future expiries
6 Add dashboard alert: cache_miss_total rate above 5× steady-state
7 Run a synthetic stampede test in CI: inject 5,000 misses, assert DB query count stays under threshold

Quiz

A service caches user profile lookups for 5 minutes. An attacker requests 100,000 random non-existent user IDs per second. What happens without negative caching?

The sawtooth fingerprint: spikes recur at exactly the TTL interval. TTL jitter spreads synchronized boundaries across time, flattening the spikes into a smooth low curve.

Recall before you leave

01
A team has single-flight and Redis locks deployed. What does cache_lock_wait_seconds p99 above the rebuild p99 indicate, and what is the correct fix?
02
Explain why pre-warming is the most important step before traffic cutover in a blue-green cache deployment, and what happens if it is skipped.

Recap

Detecting a cache stampede in production requires instrumented metrics: a sawtooth db_query_rate pattern with periodicity matching the TTL is the canonical fingerprint. Minimum viable observability includes six metrics: miss rate, DB query rate, rebuild duration, lock wait, single-flight subscriber count, and SWR stale serve count. TTL jitter (±15–25%) prevents synchronized multi-key expiry by spreading boundaries across time. Negative caching (short-TTL sentinel for missing rows) prevents miss-storm amplification attacks. Pre-warming the cache before accepting live traffic after a restart prevents cold-start stampede. Together these operational practices close the gap between “mitigations deployed” and “stampedes actually prevented in production.” Now when you see a periodic DB spike in your dashboards, the first thing to check is the spike interval — if it matches a TTL value in your codebase, you have found your stampede.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Stale-while-revalidate and CDN request coalescingmiddle

unlocks

Metastable failure, fencing tokens, and production postmortemssenior

deepens into

Metastable failure, fencing tokens, and production postmortemssenior

appears again in228

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Cache stampede labReproduce a thundering-herd cache miss under load, then kill it with single-flight and early-expiry recomputation.