Caching
Dogpile: build and tame the herd
Reading about dogpiles is not the same as pulling a service out of one. Stand up a small service with one expensive hot key, drive a real herd into the origin at the expiry instant, then implement single-flight, a leased distributed lock, and XFetch — measuring the origin fan-out at every step.
Turn the unit’s mental model into a reproducible loop: reproduce the herd, then add coalescing, a distributed lock with a renewed lease and a fenced write, and probabilistic early expiry — proving with origin-side counters that each one collapses N concurrent recomputes toward one.
Build a small cache-aside service with one expensive hot key, reproduce a measurable dogpile at the expiry instant under concurrent load, then implement and measure three mitigations — single-flight, a leased distributed lock, and XFetch — proving each collapses the origin recompute count with before/after numbers.
- A before/after table across all four strategies: origin recomputes per expiry, p99 latency, and max staleness — measured under the same concurrent load, not estimated.
- The distributed-lock variant shows ~1 origin recompute per expiry fleet-wide, and the fault-injection tests prove (a) a killed holder is auto-released by the TTL with no deadlock and (b) a holder paused past its lease cannot overwrite the newer value (fenced write rejected).
- The XFetch variant shows the hot key recomputing early and alone before its TTL hits zero, keeping origin recomputes near 1 with no lock held.
- A one-paragraph write-up choosing a default strategy for this workload and justifying it against staleness tolerance, instance count, and recompute cost — and naming the failure each guard (lock TTL, lease renewal, fencing token) defends against.
- Add an on-call runbook: how to spot a dogpile in metrics (origin recompute spikes synced to TTL boundaries), the decision tree single-flight vs distributed lock vs XFetch, and the lock-deadlock escape hatch (always have a TTL).
- Add TTL jitter to a BATCH of cold keys and show it staggers their combined expiry — then show it does nothing for the single hot key, making the scope of jitter concrete.
- Add stale-while-revalidate (soft-TTL / hard-TTL split) so waiters never block: serve the stale value instantly and refresh in the background, and compare its staleness/latency profile against the blocking lock.
- Run the distributed-lock variant against a Redis failover (or Redlock across nodes) and document what happens to the exactly-one guarantee during a partition — connecting back to why fencing tokens matter.
This is the loop you will run when a hot key takes the origin down: reproduce the herd at the expiry instant, measure the origin fan-out, then collapse it. Local single-flight caps recomputes at the instance count; a distributed lock collapses the fleet to one — but only with a TTL longer than the worst-case recompute, a renewed lease, a conditional release, and a fenced write to survive a crashed or paused holder. XFetch dissolves the collision without a lock by recomputing early and alone. Doing it once on a toy service, with origin counters and fault injection, makes the production version muscle memory.