awesome-everything RU
↑ Back to the climb

Caching

Dogpile: build and tame the herd

Crux Hands-on project — reproduce a dogpile on one hot key, then build and measure single-flight, a leased distributed lock, and XFetch, proving each tames the herd with before/after numbers.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 220 min

Reading about dogpiles is not the same as pulling a service out of one. Stand up a small service with one expensive hot key, drive a real herd into the origin at the expiry instant, then implement single-flight, a leased distributed lock, and XFetch — measuring the origin fan-out at every step.

Goal

Turn the unit’s mental model into a reproducible loop: reproduce the herd, then add coalescing, a distributed lock with a renewed lease and a fenced write, and probabilistic early expiry — proving with origin-side counters that each one collapses N concurrent recomputes toward one.

Project
0 of 7
Objective

Build a small cache-aside service with one expensive hot key, reproduce a measurable dogpile at the expiry instant under concurrent load, then implement and measure three mitigations — single-flight, a leased distributed lock, and XFetch — proving each collapses the origin recompute count with before/after numbers.

Requirements
Acceptance criteria
  • A before/after table across all four strategies: origin recomputes per expiry, p99 latency, and max staleness — measured under the same concurrent load, not estimated.
  • The distributed-lock variant shows ~1 origin recompute per expiry fleet-wide, and the fault-injection tests prove (a) a killed holder is auto-released by the TTL with no deadlock and (b) a holder paused past its lease cannot overwrite the newer value (fenced write rejected).
  • The XFetch variant shows the hot key recomputing early and alone before its TTL hits zero, keeping origin recomputes near 1 with no lock held.
  • A one-paragraph write-up choosing a default strategy for this workload and justifying it against staleness tolerance, instance count, and recompute cost — and naming the failure each guard (lock TTL, lease renewal, fencing token) defends against.
Senior stretch
  • Add an on-call runbook: how to spot a dogpile in metrics (origin recompute spikes synced to TTL boundaries), the decision tree single-flight vs distributed lock vs XFetch, and the lock-deadlock escape hatch (always have a TTL).
  • Add TTL jitter to a BATCH of cold keys and show it staggers their combined expiry — then show it does nothing for the single hot key, making the scope of jitter concrete.
  • Add stale-while-revalidate (soft-TTL / hard-TTL split) so waiters never block: serve the stale value instantly and refresh in the background, and compare its staleness/latency profile against the blocking lock.
  • Run the distributed-lock variant against a Redis failover (or Redlock across nodes) and document what happens to the exactly-one guarantee during a partition — connecting back to why fencing tokens matter.
Recap

This is the loop you will run when a hot key takes the origin down: reproduce the herd at the expiry instant, measure the origin fan-out, then collapse it. Local single-flight caps recomputes at the instance count; a distributed lock collapses the fleet to one — but only with a TTL longer than the worst-case recompute, a renewed lease, a conditional release, and a fenced write to survive a crashed or paused holder. XFetch dissolves the collision without a lock by recomputing early and alone. Doing it once on a toy service, with origin counters and fault injection, makes the production version muscle memory.

Continue the climb ↑Composing the cache stack: one coherent strategy across CDN, proxy, Redis, and the DB
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources2
expand
  1. 01
  2. 02

Trademarks belong to their respective owners. Editorial reference only.