awesome-everything RU
↑ Back to the climb

Caching

Detecting stampedes and designing TTL for production

Crux The observability fingerprint of a cache stampede, minimum-viable dashboard metrics, TTL jitter to spread boundaries, negative caching, and pre-warming strategies to survive cache-cold restarts.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 13 min

A team deploys single-flight and locks. Three weeks later an on-call alert fires: DB CPU spiking every 5 minutes. The protection is in place — but it is protecting the wrong keys. The spike is coming from a different key with TTL=300 s that nobody instrumented.

The observability fingerprint

Cache stampedes leave a distinctive signature in metrics:

  • DB query rate: a sawtooth pattern — near zero between TTL boundaries, a sharp spike at each boundary. The spike width equals the rebuild duration.
  • Periodicity: spikes recur at intervals matching the TTL. TTL=60 s → spikes every 60 s. TTL=300 s → spikes every 5 minutes.
  • p99 latency: spikes at the same intervals as the DB query rate.
  • cache_miss_total rate: sharp periodic increases at boundaries instead of a smooth low baseline.

Without these instrumented, a stampede looks like a general “DB slowness” incident with no obvious cause.

Minimum-viable dashboard

Six metrics cover all stampede scenarios:

MetricAlert conditionWhat it signals
cache_miss_total ratePeriodic spikes > 5× steady-stateStampede in progress
db_query_rate p99Sawtooth patternDownstream stampede from cache boundaries
cache_rebuild_duration_seconds p99Long tail at boundariesRebuild contention
cache_lock_wait_seconds p99Above rebuild p99Lock queue building — waiter starvation
singleflight_subscriber_count p99> 1 (coalescing active)Single-flight firing — normal under load
cache_swr_stale_serve_totalNon-zero during boundariesSWR is absorbing expiry — expected

Alert 1: cache_miss_total rate above 5× steady-state → stampede forming. Alert 2: db_query_rate p99 above 10× p50 → sawtooth DB load → boundary spikes. Alert 3: cache_lock_wait_seconds p99 above rebuild duration → lock queue depth growing.

TTL jitter

Single-flight and locks handle per-key stampedes. But what if 1,000 keys all have TTL=300 s and they were all cached at the same time? They all expire together, producing 1,000 simultaneous per-key stampedes — single-flight correctly handles each one, but the sum of 1,000 concurrent rebuilds is a DB spike.

TTL jitter: instead of a fixed TTL, use a random value in a range:

ttl = base_ttl * (1 + jitter_fraction * (rand() - 0.5))
# Example: base=300, jitter=0.25 → TTL in range [225, 375]

A fleet of 1,000 keys with ±25% jitter spreads expiries over 150 s instead of all firing at once. DB load becomes a smooth low curve instead of a spike.

Most cache libraries support jitter natively (Caffeine in Java, Redis via application-level calculation). The default ±15–25% is sufficient for most workloads.

Negative caching

The same stampede shape applies when the database answers “no such row.” If the application does not cache null results, every request for a non-existent key hits the DB — and under high concurrency this is a miss-storm that overloads the DB as fully as a positive-key stampede.

Fix: cache the “missing” sentinel with a short TTL.

# On DB miss:
SET key:missing "" EX 10  # 10 s negative TTL

# On read:
val = GET key
if val == "":
  return NOT_FOUND  # from cache, no DB hit

Short negative TTL (5–30 s) bounds memory churn. The positive TTL can be much longer (60–300 s). Write-through invalidation must delete the negative entry when a real row is inserted.

Security note: without negative caching, an attacker can mint random non-existent keys (random UUIDs in a URL path) to amplify DB load by orders of magnitude — a documented pattern that affected CDN-backed sites in 2024.

Pre-warming after restarts

A cache that restarts cold (deploy, eviction, machine failure) starts empty. Every incoming request misses and hits the DB — a full-traffic origin spike. If restarted at peak traffic, this spike is equal in magnitude to a full stampede.

Pre-warming procedure:

  1. Before accepting public traffic, replay the top-N most-accessed keys from an audit log or access log.
  2. Warm the cache first, then cut traffic.
  3. For blue-green deploys: warm the green cache instance to steady-state before switching the load balancer.

Cloudflare edges pre-warm from neighbouring POPs. Redis-backed services use a startup script reading “top 1,000 keys” from an audit table. The rule: never restart a cache under live traffic without pre-warming.

Why this works

Pre-warming is the most common missed step in cache tier upgrade runbooks. Teams test the lock and SWR logic, but neglect the cold-start window. The cold-start stampede is usually 3–10× worse than a normal TTL-boundary stampede because 100% of keys are cold simultaneously. Runbooks must include a “warm the new cache before routing traffic” step as a hard gate.

Quiz

A service's DB query rate shows sharp spikes every 60 seconds with near-zero load between spikes. What is the most likely cause?

Quiz

A cache stores 5,000 product pages, all cached at the same moment with TTL=300 s. What happens at second 300, even with single-flight protection?

Order the steps

Order the steps to diagnose and fix a 60-second-periodic DB spike:

  1. 1 Check DB query rate over time — confirm sawtooth pattern with 60-second periodicity
  2. 2 Identify the cache key(s) with TTL=60 s on the hot path
  3. 3 Deploy in-process single-flight as the first mitigation
  4. 4 Verify spikes drop from 4,000 QPS to 50 QPS (one rebuild per node × 50 nodes)
  5. 5 Add TTL jitter ±20% to desynchronise future expiries
  6. 6 Add dashboard alert: cache_miss_total rate above 5× steady-state
  7. 7 Run a synthetic stampede test in CI: inject 5,000 misses, assert DB query count stays under threshold
Quiz

A service caches user profile lookups for 5 minutes. An attacker requests 100,000 random non-existent user IDs per second. What happens without negative caching?

Recall before you leave
  1. 01
    A team has single-flight and Redis locks deployed. What does cache_lock_wait_seconds p99 above the rebuild p99 indicate, and what is the correct fix?
  2. 02
    Explain why pre-warming is the most important step before traffic cutover in a blue-green cache deployment, and what happens if it is skipped.
Recap

Detecting a cache stampede in production requires instrumented metrics: a sawtooth db_query_rate pattern with periodicity matching the TTL is the canonical fingerprint. Minimum viable observability includes six metrics: miss rate, DB query rate, rebuild duration, lock wait, single-flight subscriber count, and SWR stale serve count. TTL jitter (±15–25%) prevents synchronized multi-key expiry by spreading boundaries across time. Negative caching (short-TTL sentinel for missing rows) prevents miss-storm amplification attacks. Pre-warming the cache before accepting live traffic after a restart prevents cold-start stampede. Together these operational practices close the gap between “mitigations deployed” and “stampedes actually prevented in production.”

Connected lessons
appears again in202
Continue the climb ↑Metastable failure, fencing tokens, and production postmortems
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources4
expand
  1. 01
  2. 02
  3. 03
  4. 04

Trademarks belong to their respective owners. Editorial reference only.