Caching CACHE · 03 · 09

Cache stampede: code reading

Read real cache-aside snippets — lock-on-miss, request coalescing, XFetch — predict the behaviour, and pick the highest-leverage fix.

CACHE Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

Stampede bugs hide in the cache-aside code itself: a lock with the wrong EX, a coalescer that registers the in-flight promise one line too late, an XFetch rule that fires too early. Read each snippet the way you would in review, then choose the fix a senior engineer makes first.

Goal

Practise reading the actual mitigation code — lock-on-miss, request coalescing, and probabilistic early expiration — and spotting the defect that a load test will eventually expose.

Snippet 1 — lock-on-miss

func get(ctx context.Context, key string) ([]byte, error) {
    if v, ok := cache.Get(key); ok {
        return v, nil
    }
    // miss: try to become the rebuilder
    locked := redis.SetNX(ctx, "lock:"+key, uuid, 30*time.Second).Val()
    if !locked {
        // someone else is rebuilding
        return rebuild(ctx, key) // <-- rebuild anyway
    }
    v, err := rebuild(ctx, key)
    if err == nil {
        cache.Set(key, v, 60*time.Second)
        redis.Del(ctx, "lock:"+key)
    }
    return v, nil
}

Quiz

The SetNX lock is acquired correctly, yet a load test still produces N concurrent DB rebuilds. Where is the bug?

Snippet 2 — request coalescing

const inflight = new Map(); // key -> Promise

async function getCoalesced(key) {
  const cached = await cache.get(key);
  if (cached !== null) return cached;

  const fresh = await rebuild(key);   // (A) await the rebuild...
  inflight.set(key, fresh);           // (B) ...then record it
  const value = await fresh;
  cache.set(key, value, 60);
  inflight.delete(key);
  return value;
}

Quiz

This is meant to coalesce concurrent misses for the same key into one rebuild, but it never coalesces. What is wrong?

Snippet 3 — probabilistic early expiration (XFetch)

def should_refresh(delta, beta, ttl_remaining):
    # delta = typical rebuild seconds, ttl_remaining = seconds to expiry
    return (-beta * delta * math.log(random.random())) >= ttl_remaining

Quiz

An operator wants fewer wasted early rebuilds on warm keys, so they raise beta from 1.0 to 4.0. What is the effect on a hot key vs a colder key?

Snippet 4 — lock with a fencing-token write

def rebuild_and_write(key, my_token):
    value = rebuild(key)               # may take longer than the lock EX
    if redis.get("lock:" + key) != my_token:
        return                          # we lost the lock — abort the write
    redis.set("cache:" + key, value, ex=60)

Quiz

The fencing check 'GET lock then SET cache' guards against a slow rebuild that outlived its lock. What residual race remains, and what closes it?

Recap

Every stampede defence lives in code that is easy to get subtly wrong: a lock only helps if the losers wait and re-check rather than rebuild anyway; a coalescer only helps if the in-flight promise is registered synchronously before any await and consulted on entry; XFetch’s beta moves the early-refresh window the opposite way from most people’s intuition (higher beta means earlier, more frequent refreshes); and a fencing-token check is only safe when the read-then-write is atomic or backed by a monotonic version. Read the mitigation, trace two concurrent callers through it, and the bug usually shows itself before any load test does.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.