awesome-everything RU
↑ Back to the climb

Caching

Lock and single-flight: bounding concurrent rebuilds

Crux A Redis SETNX lock serialises rebuilds across the fleet; in-process single-flight collapses the per-node herd to one Promise at zero network cost. Use both layers in sequence.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 14 min

At a TTL boundary 100,000 requests hit a 50-node fleet. Each node runs an in-process single-flight. How many DB queries happen? Not 100,000 — but not 1 either. The answer reveals exactly where each mitigation layer does and does not help.

Mitigation 1: distributed locking with SETNX

The simplest cross-node mitigation: before running the rebuild, acquire a lock. The Redis primitive is:

SET lock:key uuid EX 30 NX
  • NX — set only if the key does not exist (set-if-not-exists).
  • EX 30 — auto-expire after 30 s (the safety net for crashed rebuilders).
  • uuid — the acquiring process’s unique token (used for fencing, covered in the senior lesson).

What happens at expiry:

  1. Request-1 arrives. Runs SET lock:homepage:v1 uuid-A EX 30 NX → success. Starts rebuild.
  2. Requests 2–N arrive. Run the same SET → fail (NX). They see the lock is held.
  3. Option A: each waiter re-checks the cache on a short sleep (50–200 ms). By the time they re-check, the rebuild may have finished.
  4. Option B: each waiter returns a fallback (stale value, default page, empty 204) immediately.
  5. Request-1 finishes rebuild. Writes new value. Deletes lock.

The EX=30 is not the cache TTL — it is a safety net. If the rebuilder crashes at step 1 without deleting the lock, the lock auto-expires after 30 s. Set EX to longer than rebuild p99, but short enough that a crash does not stall traffic for too long. A typical target: 3× average rebuild duration.

ScenarioWithout lockWith SETNX lock
10-node fleet, 2,000 misses/node20,000 parallel DB queries10 DB queries (1 per node) or 1 with cross-node lock
Rebuilder crashes mid-workHerd repeats every TTLLock auto-expires after EX seconds
Lock EX too shortN/ASecond rebuilder races — duplicate writes

Mitigation 2: in-process single-flight

Distributed locks coordinate across the fleet; single-flight coordinates within one process. No Redis, no network round-trip — just an in-process map.

The pattern:

type SingleFlight struct { mu sync.Mutex; inflight map[string]*call }
  1. Request arrives; cache miss.
  2. Check in-process map for key. If a Promise / call already exists → subscribe to it, wait for resolution, return the shared result.
  3. If no entry → create a new entry (Promise), start the rebuild, add to map.
  4. When rebuild completes → resolve the Promise, remove from map. All subscribers get the result simultaneously.

Go’s standard library ships this as singleflight.Group.Do. Node.js equivalents: p-memoize or a manual Map<key, Promise> pattern.

Cost: O(1) map lookup in process memory. No network. No lock acquire.

Scope: per-process only. A 50-node fleet has 50 independent in-process maps. At a TTL boundary with 100,000 concurrent requests evenly distributed, single-flight alone gives 50 DB queries (1 per node), not 100,000. Add a distributed lock to go from 50 to 1.

Why this works

Facebook’s memcache “leases” (Nishtala et al., NSDI 2013) implement the same idea at the cache layer: on a miss the cache returns a 64-bit lease token. Only the client holding the token may write back. Concurrent miss-clients get a null with no token and are told to wait. The result: peak DB query rate fell from 17K QPS to 1.3K QPS — roughly 13x — on a single hot key cluster.

Composing both layers

Neither layer is sufficient alone:

  • Single-flight only: 50-node fleet still sends 50 concurrent rebuilds.
  • Distributed lock only: waiters (all but 1 lock-holder) receive nothing while the rebuild runs — adds latency to every request at the boundary.

Combined stack:

  1. Check in-process map → if Promise in flight, subscribe and wait.
  2. No in-flight Promise → try SET lock:key uuid EX 30 NX.
  3. Lock acquired → register Promise, start rebuild, write value, delete lock, resolve Promise.
  4. Lock NOT acquired → retry GET cache after 50 ms. Return stale fallback if still missing.
Order the steps

Order the steps a request takes in a single-flight + Redis-lock stack:

  1. 1 Cache GET returns nil (miss)
  2. 2 Check in-process singleflight map — if a Promise exists, subscribe to it
  3. 3 No Promise: try SET lock:key uuid EX 30 NX
  4. 4 Lock acquired: register a new Promise and start the rebuild
  5. 5 Rebuild completes: write value to cache with TTL, delete the Redis lock, resolve the Promise
  6. 6 All in-process subscribers receive the resolved value via the shared Promise
  7. 7 Lock NOT acquired: wait 50 ms, re-check cache, return stale fallback if still missing
Quiz

A 50-node fleet uses in-process single-flight only. At a TTL boundary 100,000 concurrent misses arrive. How many DB rebuilds happen?

Quiz

What is the role of the EX value in a SETNX-based lock?

Quiz

A cache lock uses EX=10 s. The rebuild takes 12 s. What happens?

Recall before you leave
  1. 01
    What is the practical difference between in-process single-flight and a Redis distributed lock, and when should you use each?
  2. 02
    Facebook memcache leases (NSDI 2013) reduced peak DB QPS from 17K to 1.3K. What mechanism achieves this?
  3. 03
    Why must the lock EX value be set to more than the rebuild p99, not just the rebuild average?
Recap

Two mitigations bound concurrent rebuilds without changing the cache TTL. In-process single-flight maintains a per-process map of in-flight Promises; every request that arrives while a rebuild is running subscribes to the same Promise instead of starting a new rebuild — zero network cost, zero coordination. A Redis SETNX distributed lock serialises rebuilds across the entire fleet using a SET key uuid EX N NX acquire and an explicit delete on completion. Composing both reduces 100,000 concurrent misses on a 50-node fleet to 1 DB query. The lock EX must exceed the rebuild p99; pair the lock with a stale fallback so waiters never block indefinitely.

Connected lessons
appears again in202
Continue the climb ↑XFetch: coordination-free probabilistic early expiration
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources4
expand
  1. 01
  2. 02
  3. 03
  4. 04

Trademarks belong to their respective owners. Editorial reference only.