Caching CACHE · 07 · 10

Dogpile: build and tame the herd

Hands-on project — reproduce a dogpile on one hot key, then build and measure single-flight, a leased distributed lock, and XFetch, proving each tames the herd with before/after numbers.

CACHE Senior ◷ 220 min

Level

FoundationsJuniorMiddleSenior

Reading about dogpiles is not the same as pulling a service out of one. Stand up a small service with one expensive hot key, drive a real herd into the origin at the expiry instant, then implement single-flight, a leased distributed lock, and XFetch — measuring the origin fan-out at every step.

Goal

Turn the unit’s mental model into a reproducible loop: reproduce the herd, then add coalescing, a distributed lock with a renewed lease and a fenced write, and probabilistic early expiry — proving with origin-side counters that each one collapses N concurrent recomputes toward one.

Project

0 of 7

Objective

Build a small cache-aside service with one expensive hot key, reproduce a measurable dogpile at the expiry instant under concurrent load, then implement and measure three mitigations — single-flight, a leased distributed lock, and XFetch — proving each collapses the origin recompute count with before/after numbers.

Requirements

Acceptance criteria

A before/after table across all four strategies: origin recomputes per expiry, p99 latency, and max staleness — measured under the same concurrent load, not estimated.
The distributed-lock variant shows ~1 origin recompute per expiry fleet-wide, and the fault-injection tests prove (a) a killed holder is auto-released by the TTL with no deadlock and (b) a holder paused past its lease cannot overwrite the newer value (fenced write rejected).
The XFetch variant shows the hot key recomputing early and alone before its TTL hits zero, keeping origin recomputes near 1 with no lock held.
A one-paragraph write-up choosing a default strategy for this workload and justifying it against staleness tolerance, instance count, and recompute cost — and naming the failure each guard (lock TTL, lease renewal, fencing token) defends against.

Senior stretch

Add an on-call runbook: how to spot a dogpile in metrics (origin recompute spikes synced to TTL boundaries), the decision tree single-flight vs distributed lock vs XFetch, and the lock-deadlock escape hatch (always have a TTL).
Add TTL jitter to a BATCH of cold keys and show it staggers their combined expiry — then show it does nothing for the single hot key, making the scope of jitter concrete.
Add stale-while-revalidate (soft-TTL / hard-TTL split) so waiters never block: serve the stale value instantly and refresh in the background, and compare its staleness/latency profile against the blocking lock.
Run the distributed-lock variant against a Redis failover (or Redlock across nodes) and document what happens to the exactly-one guarantee during a partition — connecting back to why fencing tokens matter.

Recap

This is the loop you will run when a hot key takes the origin down: reproduce the herd at the expiry instant, measure the origin fan-out, then collapse it. Local single-flight caps recomputes at the instance count; a distributed lock collapses the fleet to one — but only with a TTL longer than the worst-case recompute, a renewed lease, a conditional release, and a fenced write to survive a crashed or paused holder. XFetch dissolves the collision without a lock by recomputing early and alone. Doing it once on a toy service, with origin counters and fault injection, makes the production version muscle memory.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.