Caching CACHE · 02 · 10

Cache invalidation: reproduce and close the race

Hands-on project — reproduce the set-after-delete race and a TTL stampede in a small cache-aside service, then close them with double-delete, jitter, and a measured write strategy, proving each fix with before/after numbers.

CACHE Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading about the set-after-delete race is not the same as making it fire on demand and then killing it. Build a small cache-aside service, drive a concurrent reader and writer into the race until you can see stale data on every run, then apply the unit’s fixes until the inconsistency rate hits zero — with evidence at every step.

Goal

Turn the unit’s consistency model into a reproducible harness: provoke the set-after-delete race and a synchronized-TTL stampede, then close each with the right lever (double-delete or leases, jitter, a chosen write strategy) and verify with before/after measurements rather than reasoning.

Project

0 of 7

Objective

Build a cache-aside service over Postgres + Redis, reproduce the set-after-delete race and a TTL stampede deterministically, then close both and prove the fixes with measured inconsistency-rate and origin-load numbers — before and after.

Requirements

Acceptance criteria

A before/after table: set-after-delete inconsistency rate (baseline vs double-delete vs leases), measured over the same harness run — not estimated.
A before/after table for the stampede: peak origin qps at expiry with fixed TTL vs with jitter + single-flight, under the same warm-and-expire scenario.
Evidence that delete-on-write keeps a TTL backstop: simulate a dropped DEL (skip it) and show the TTL bounds the stale window, then show removing the TTL leaves the value stale indefinitely.
A one-paragraph write-up stating, for each failure, which lever closed it and why you chose it over the alternatives (e.g. double-delete vs leases vs write-through) given cost and the read-your-writes requirement.

Senior stretch

Add an on-call runbook: how to recognise a set-after-delete report (edit reverts for ~TTL, not reproducible locally), the triage steps, and the fix-priority ladder (jitter and TTL backstop first, then double-delete/leases, then write-through).
Add a write-behind variant and deliberately kill the node before the async flush to demonstrate the durability window losing a committed-looking write — then show a durable queue closing it.
Add cache-key normalisation: a request with noisy query params (utm_*, param order) and show hit rate collapse, then fix the key to include only response-affecting params and measure the hit-rate recovery.
Add probabilistic early refresh on top of jitter and show it rebuilds hot keys before expiry, flattening the origin ramp further than jitter alone.

Recap

This is the loop you will run in every real cache-consistency incident: reproduce the failure deterministically before you trust any fix, choose the lever the situation needs (jitter and a TTL backstop for stampedes and missed purges, double-delete or leases for the set-after-delete race, write-through when read-your-writes is non-negotiable), and verify with before/after numbers under identical load. Doing it once on a toy service makes the production version — where the stale data is a user’s reverted profile — muscle memory.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.