Caching
Cache stampede: multiple-choice review
Six questions that cut across the whole unit. None is a definition to recite — each mirrors a decision you make mid-incident, when the DB is on fire and you must pick the mitigation whose scope actually matches the herd.
Confirm you can connect the burst shape of a stampede to the right mitigation: which collapses the per-process herd, which collapses the per-fleet herd, which needs no coordination, which eliminates the wait, and which failure mode no caching layer alone can escape.
A homepage cached at TTL=60 s absorbs 5,000 RPS. The same endpoint with no cache also sees 5,000 RPS. Why can the cached version cause an outage the un-cached one never would?
A 50-node fleet uses in-process single-flight only. At a TTL boundary 100,000 concurrent misses arrive, evenly spread. How many rebuild queries reach the DB, and why?
You must protect two keys: an ultra-hot homepage read thousands of times per second, and a cold per-user report read once every ~30 s with TTL=60 s. Which mitigation fits which key?
A lock-based cache and a stale-while-revalidate cache both reduce DB load at a TTL boundary to one rebuild. What is the key difference a user feels?
A profile service caches lookups for 5 minutes. An attacker sends 100,000 requests/s for random non-existent user IDs. The positive-key protection (lock + single-flight) does nothing. Why, and what is the fix?
A 10-second stampede ended four hours ago, yet the DB is still pinned at 100% CPU with an empty cache. Why will the system not recover on its own, and what does recovery require?
The through-line of the unit is one matching exercise: the stampede’s burst shape — the full traffic rate concentrated at one expiry instant — determines which mitigation fits. Single-flight collapses the per-process herd for free; a distributed lock collapses the per-fleet herd at a round-trip’s cost; XFetch refreshes hot keys before expiry with no coordination; SWR serves stale and refreshes in the background for zero wait; negative caching stops miss-storm amplification; and once a stampede tips the system into a retry-driven metastable failure, no caching layer alone gets it back — only an external kill signal does.