Caching
Dogpile: code and lock reading
The dogpile lives in the recompute path: the miss handler, the lock acquire, the TTL, the release. Read each snippet, find where the collision reopens or the lock deadlocks, and choose the fix a senior engineer would make first.
Practise reading coalescing and locking code the way you read it in an incident — trace the concurrent path, spot the missing timeout, the lost lease, or the unfenced write, and fix the mechanism, not the symptom.
Snippet 1 — the local single-flight
var group singleflight.Group
func getFeed(ctx context.Context, key string) ([]byte, error) {
if v, ok := cache.Get(key); ok {
return v, nil
}
// coalesce concurrent misses for this key onto one recompute
v, err, _ := group.Do(key, func() (any, error) {
out, err := recomputeFeed(ctx) // 200ms DB aggregation
if err != nil {
return nil, err
}
cache.Set(key, out, 5*time.Minute)
return out, nil
})
if err != nil {
return nil, err
}
return v.([]byte), nil
}
This runs on 20 instances behind a load balancer. How many DB recomputes fire per expiry instant, and what is the highest-leverage change to reach one?
Snippet 2 — the distributed lock with no TTL
func getFeedLocked(ctx context.Context, key string) ([]byte, error) {
if v, ok := cache.Get(key); ok {
return v, nil
}
lockKey := "lock:" + key
// acquire: SET lock:key <token> NX (NO expiry argument)
ok, _ := rdb.SetNX(ctx, lockKey, token, 0).Result()
if !ok {
time.Sleep(50 * time.Millisecond) // someone else recomputes
return getFeedLocked(ctx, key) // retry the read
}
defer rdb.Del(ctx, lockKey) // release on return
out, err := recomputeFeed(ctx)
if err != nil {
return nil, err
}
cache.Set(key, out, 5*time.Minute)
return out, nil
}
The SetNX uses expiry 0 (no TTL); release is via defer Del. What is the production failure, and the fix?
Snippet 3 — the fixed TTL that is too short for the tail
const lockTTL = 5 * time.Second // recompute p50 ~2s, p99 ~25s
func recomputeUnderLock(ctx context.Context, key, token string) error {
ok, _ := rdb.SetNX(ctx, "lock:"+key, token, lockTTL).Result()
if !ok {
return errLockHeld // caller serves stale and retries later
}
out, err := recomputeFeed(ctx) // can take up to 25s on a cold shard
if err != nil {
return err
}
// lock may already have expired here on a slow recompute
cache.Set(key, out, 5*time.Minute)
rdb.Del(ctx, "lock:"+key) // unconditional delete
return nil
}
With a 5s lock TTL and a 25s p99 recompute, name both bugs and the principled fix.
Snippet 4 — the lease-renewing holder
func recomputeWithLease(ctx context.Context, key, token string) error {
if ok, _ := rdb.SetNX(ctx, "lock:"+key, token, 10*time.Second).Result(); !ok {
return errLockHeld
}
// heartbeat: re-extend the lease every 3s while we work
stop := make(chan struct{})
go func() {
t := time.NewTicker(3 * time.Second)
defer t.Stop()
for {
select {
case <-stop:
return
case <-t.C:
// PEXPIRE lock:key 10000 if token still ours
renewIfOwner(ctx, "lock:"+key, token, 10*time.Second)
}
}
}()
out, err := recomputeFeed(ctx)
close(stop)
if err != nil {
return err
}
cache.Set(key, out, 5*time.Minute)
releaseIfOwner(ctx, "lock:"+key, token) // CAS delete
return nil
}
The lease renews every 3s with a 10s TTL and releases via CAS-on-token. A long stop-the-world GC pause (≥10s) hits this worker mid-recompute. What can still go wrong, and what guards against it?
Every dogpile fix is read in the lock path. Local single-flight caps recomputes at the instance count, not one — a distributed lock is what coordinates the fleet. A lock with no TTL deadlocks every reader if the holder dies, so always SET NX PX. A fixed TTL shorter than the worst-case recompute reopens the herd and risks deleting someone else’s lock, so renew the lease on a heartbeat and release conditionally on your token. And even a renewed lease can lapse under a long STW pause or partition, so fence the write with a monotonic token (or a versioned CAS) to stop a resumed holder from overwriting fresh data. Fix the lock lifecycle, then re-test under a slow-recompute and a pause-injection scenario.