Caching CACHE · 03 · 02

Lock and single-flight: bounding concurrent rebuilds

A Redis SETNX lock serialises rebuilds across the fleet; in-process single-flight collapses the per-node herd to one Promise at zero network cost. Use both layers in sequence.

CACHE Middle ◷ 14 min

Level

FoundationsJuniorMiddleSenior

At a TTL boundary 100,000 requests hit a 50-node fleet. Each node runs an in-process single-flight. How many DB queries happen? Not 100,000 — but not 1 either. The answer reveals exactly where each mitigation layer does and does not help.

Mitigation 1: distributed locking with SETNX

By the end of this lesson you will know how to drive 100,000 concurrent misses down to a single DB query — and why you need two separate mechanisms to get there.

The simplest cross-node mitigation: before running the rebuild, acquire a lock. The Redis primitive is:

SET lock:key uuid EX 30 NX

NX — set only if the key does not exist (set-if-not-exists).
EX 30 — auto-expire after 30 s (the safety net for crashed rebuilders).
uuid — the acquiring process’s unique token (used for fencing, covered in the senior lesson).

What happens at expiry:

Request-1 arrives. Runs SET lock:homepage:v1 uuid-A EX 30 NX → success. Starts rebuild.
Requests 2–N arrive. Run the same SET → fail (NX). They see the lock is held.
Option A: each waiter re-checks the cache on a short sleep (50–200 ms). By the time they re-check, the rebuild may have finished.
Option B: each waiter returns a fallback (stale value, default page, empty 204) immediately.
Request-1 finishes rebuild. Writes new value. Deletes lock.

The EX=30 is not the cache TTL — it is a safety net. If the rebuilder crashes at step 1 without deleting the lock, the lock auto-expires after 30 s. Set EX to longer than rebuild p99, but short enough that a crash does not stall traffic for too long. A typical target: 3× average rebuild duration.

Scenario	Without lock	With SETNX lock
10-node fleet, 2,000 misses/node	20,000 parallel DB queries	10 DB queries (1 per node) or 1 with cross-node lock
Rebuilder crashes mid-work	Herd repeats every TTL	Lock auto-expires after EX seconds
Lock EX too short	N/A	Second rebuilder races — duplicate writes

Mitigation 2: in-process single-flight

Distributed locks coordinate across the fleet; single-flight coordinates within one process. No Redis, no network round-trip — just an in-process map.

The pattern:

type SingleFlight struct { mu sync.Mutex; inflight map[string]*call }

Request arrives; cache miss.
Check in-process map for key. If a Promise / call already exists → subscribe to it, wait for resolution, return the shared result.
If no entry → create a new entry (Promise), start the rebuild, add to map.
When rebuild completes → resolve the Promise, remove from map. All subscribers get the result simultaneously.

Go’s standard library ships this as singleflight.Group.Do. Node.js equivalents: p-memoize or a manual Map<key, Promise> pattern.

Cost: O(1) map lookup in process memory. No network. No lock acquire.

Scope: per-process only. A 50-node fleet has 50 independent in-process maps. At a TTL boundary with 100,000 concurrent requests evenly distributed, single-flight alone gives 50 DB queries (1 per node), not 100,000. Add a distributed lock to go from 50 to 1.

Each layer narrows the coordination scope: single-flight is per-process (100,000 → 50), the distributed lock is per-fleet (50 → 1). Neither alone reaches one rebuild — you need both.

▸Why this works

Facebook’s memcache “leases” (Nishtala et al., NSDI 2013) implement the same idea at the cache layer: on a miss the cache returns a 64-bit lease token. Only the client holding the token may write back. Concurrent miss-clients get a null with no token and are told to wait. The result: peak DB query rate fell from 17K QPS to 1.3K QPS — roughly 13x — on a single hot key cluster.

Composing both layers

When you deploy single-flight alone, you still have M nodes each firing one rebuild. When you deploy the lock alone, all waiters queue behind a single rebuilder and every one of them waits for the full rebuild duration. Neither is acceptable at scale — but the two layers fit together exactly.

Neither layer is sufficient alone:

Single-flight only: 50-node fleet still sends 50 concurrent rebuilds.
Distributed lock only: waiters (all but 1 lock-holder) receive nothing while the rebuild runs — adds latency to every request at the boundary.

Combined stack:

Check in-process map → if Promise in flight, subscribe and wait.
No in-flight Promise → try SET lock:key uuid EX 30 NX.
Lock acquired → register Promise, start rebuild, write value, delete lock, resolve Promise.
Lock NOT acquired → retry GET cache after 50 ms. Return stale fallback if still missing.

Order the steps

Order the steps a request takes in a single-flight + Redis-lock stack:

1 Cache GET returns nil (miss)
2 Check in-process singleflight map — if a Promise exists, subscribe to it
3 No Promise: try SET lock:key uuid EX 30 NX
4 Lock acquired: register a new Promise and start the rebuild
5 Rebuild completes: write value to cache with TTL, delete the Redis lock, resolve the Promise
6 All in-process subscribers receive the resolved value via the shared Promise
7 Lock NOT acquired: wait 50 ms, re-check cache, return stale fallback if still missing

Quiz

A 50-node fleet uses in-process single-flight only. At a TTL boundary 100,000 concurrent misses arrive. How many DB rebuilds happen?

Quiz

What is the role of the EX value in a SETNX-based lock?

Quiz

A cache lock uses EX=10 s. The rebuild takes 12 s. What happens?

Only the lock holder (Request-1) touches the DB. Every other waiter fails the NX acquire and serves a stale fallback — N concurrent misses collapse to one rebuild.

Recall before you leave

01
What is the practical difference between in-process single-flight and a Redis distributed lock, and when should you use each?
02
Facebook memcache leases (NSDI 2013) reduced peak DB QPS from 17K to 1.3K. What mechanism achieves this?
03
Why must the lock EX value be set to more than the rebuild p99, not just the rebuild average?

Recap

Two mitigations bound concurrent rebuilds without changing the cache TTL. In-process single-flight maintains a per-process map of in-flight Promises; every request that arrives while a rebuild is running subscribes to the same Promise instead of starting a new rebuild — zero network cost, zero coordination. A Redis SETNX distributed lock serialises rebuilds across the entire fleet using a SET key uuid EX N NX acquire and an explicit delete on completion. Composing both reduces 100,000 concurrent misses on a 50-node fleet to 1 DB query. The lock EX must exceed the rebuild p99; pair the lock with a stale fallback so waiters never block indefinitely. Now when you see sawtooth DB spikes on a multi-node fleet, reach for single-flight first (free, in-process), then add the Redis lock if per-node isolation is still not enough.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 7 done

Connected lessons

builds on

What is a cache stampede and why it makes things worsejunior

unlocks

XFetch: coordination-free probabilistic early expirationmiddle

deepens into

XFetch: coordination-free probabilistic early expirationmiddle

appears again in228

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Cache stampede labReproduce a thundering-herd cache miss under load, then kill it with single-flight and early-expiry recomputation.URL shortener at scaleBuild a URL shortener that survives real traffic — then run it: deploy it, watch it, and work the incident when one hot link melts your cache.