awesome-everything RU
↑ Back to the climb

Caching

Cache stampede: build and tame the herd

Crux Hands-on project — build a stampede-prone cache-aside service, reproduce the herd, layer the mitigations, and prove each step with before/after numbers.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 240 min

Reading about stampedes is not the same as watching your own DB fall over and then bringing it back. Build a small cache-aside service, drive a hot key into a stampede on purpose, and add the unit’s mitigations one layer at a time — measuring the herd before and after each.

Goal

Turn the unit’s mental model into a reproducible loop: reproduce the burst, instrument the fingerprint, then layer single-flight, a distributed lock, SWR, TTL jitter, and negative caching — proving with metrics that each layer reduces the herd it is supposed to.

Project
0 of 7
Objective

Build a cache-aside HTTP service backed by Redis and a deliberately slow origin, reproduce a cache stampede on a hot key, then layer the unit's mitigations until a TTL-boundary burst reaches the origin as a single rebuild — proving every step with before/after measurements, not estimates.

Requirements
Acceptance criteria
  • A before/after table per mitigation: origin queries per TTL boundary, request p99 latency, and cache miss rate — all measured under the identical load test, not estimated.
  • With the full stack enabled, a TTL boundary under sustained hot-key load produces at most a single origin rebuild, and the sawtooth origin-query fingerprint is gone from the metrics.
  • A demonstration that single-flight alone leaves one rebuild per instance, and only adding the cross-node lock (or SWR background refresh) collapses it to one fleet-wide — proving you understand the scope of each layer.
  • A short write-up naming, for each layer, exactly which herd it bounded (per-process, per-fleet, wait-time, multi-key, negative) and why that layer was needed on top of the previous one.
Senior stretch
  • Implement XFetch probabilistic early expiration on the hot key and show it refreshes before the boundary with ~1 early rebuild per window and zero misses at expiry; then show it underperforms on a cold key read once per TTL.
  • Add a fencing-token (or monotonic-version) check on the rebuild write and craft a test where the lock EX is shorter than a slow rebuild, proving the guard rejects the stale duplicate write.
  • Reproduce a metastable failure: add client retries with short backoff, push the origin to saturation, and show it stays pinned after load stops; then break the loop with a 503-on-overload gate and measure recovery time.
  • Add the minimum-viable dashboard with alerts (miss-rate spike, sawtooth origin rate, lock-wait above rebuild p99) and a one-page on-call runbook: triage from the panels, the mitigation ladder, and a pre-warm-before-cutover gate.
Recap

This is the loop you will run on any real cache tier: reproduce the burst before you trust a fix, instrument the fingerprint, then add mitigations in scope order — single-flight for the per-process herd, a lock for the per-fleet herd, SWR for the wait, jitter for synchronized multi-key expiry, negative caching for the miss storm — and verify each with before/after numbers under identical load. Doing it once on a toy service is what makes the production version, and the 3am incident, muscle memory.

Continue the climb ↑ETags and conditional requests: 304 saves the bytes, not the round-trip
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.