Caching CACHE · 07 · 09

Dogpile: code and lock reading

Read real single-flight, distributed-lock, and lease-renewal snippets, predict the dogpile behaviour, and pick the highest-leverage fix a senior would make first.

CACHE Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

The dogpile lives in the recompute path: the miss handler, the lock acquire, the TTL, the release. Read each snippet, find where the collision reopens or the lock deadlocks, and choose the fix a senior engineer would make first.

Goal

Practise reading coalescing and locking code the way you read it in an incident — trace the concurrent path, spot the missing timeout, the lost lease, or the unfenced write, and fix the mechanism, not the symptom.

Snippet 1 — the local single-flight

var group singleflight.Group

func getFeed(ctx context.Context, key string) ([]byte, error) {
    if v, ok := cache.Get(key); ok {
        return v, nil
    }
    // coalesce concurrent misses for this key onto one recompute
    v, err, _ := group.Do(key, func() (any, error) {
        out, err := recomputeFeed(ctx) // 200ms DB aggregation
        if err != nil {
            return nil, err
        }
        cache.Set(key, out, 5*time.Minute)
        return out, nil
    })
    if err != nil {
        return nil, err
    }
    return v.([]byte), nil
}

Quiz

This runs on 20 instances behind a load balancer. How many DB recomputes fire per expiry instant, and what is the highest-leverage change to reach one?

Snippet 2 — the distributed lock with no TTL

func getFeedLocked(ctx context.Context, key string) ([]byte, error) {
    if v, ok := cache.Get(key); ok {
        return v, nil
    }
    lockKey := "lock:" + key
    // acquire: SET lock:key <token> NX   (NO expiry argument)
    ok, _ := rdb.SetNX(ctx, lockKey, token, 0).Result()
    if !ok {
        time.Sleep(50 * time.Millisecond) // someone else recomputes
        return getFeedLocked(ctx, key)     // retry the read
    }
    defer rdb.Del(ctx, lockKey)            // release on return
    out, err := recomputeFeed(ctx)
    if err != nil {
        return nil, err
    }
    cache.Set(key, out, 5*time.Minute)
    return out, nil
}

Quiz

The SetNX uses expiry 0 (no TTL); release is via defer Del. What is the production failure, and the fix?

Snippet 3 — the fixed TTL that is too short for the tail

const lockTTL = 5 * time.Second // recompute p50 ~2s, p99 ~25s

func recomputeUnderLock(ctx context.Context, key, token string) error {
    ok, _ := rdb.SetNX(ctx, "lock:"+key, token, lockTTL).Result()
    if !ok {
        return errLockHeld // caller serves stale and retries later
    }
    out, err := recomputeFeed(ctx) // can take up to 25s on a cold shard
    if err != nil {
        return err
    }
    // lock may already have expired here on a slow recompute
    cache.Set(key, out, 5*time.Minute)
    rdb.Del(ctx, "lock:"+key) // unconditional delete
    return nil
}

Quiz

With a 5s lock TTL and a 25s p99 recompute, name both bugs and the principled fix.

Snippet 4 — the lease-renewing holder

func recomputeWithLease(ctx context.Context, key, token string) error {
    if ok, _ := rdb.SetNX(ctx, "lock:"+key, token, 10*time.Second).Result(); !ok {
        return errLockHeld
    }
    // heartbeat: re-extend the lease every 3s while we work
    stop := make(chan struct{})
    go func() {
        t := time.NewTicker(3 * time.Second)
        defer t.Stop()
        for {
            select {
            case <-stop:
                return
            case <-t.C:
                // PEXPIRE lock:key 10000 if token still ours
                renewIfOwner(ctx, "lock:"+key, token, 10*time.Second)
            }
        }
    }()
    out, err := recomputeFeed(ctx)
    close(stop)
    if err != nil {
        return err
    }
    cache.Set(key, out, 5*time.Minute)
    releaseIfOwner(ctx, "lock:"+key, token) // CAS delete
    return nil
}

Quiz

The lease renews every 3s with a 10s TTL and releases via CAS-on-token. A long stop-the-world GC pause (≥10s) hits this worker mid-recompute. What can still go wrong, and what guards against it?

Recap

Every dogpile fix is read in the lock path. Local single-flight caps recomputes at the instance count, not one — a distributed lock is what coordinates the fleet. A lock with no TTL deadlocks every reader if the holder dies, so always SET NX PX. A fixed TTL shorter than the worst-case recompute reopens the herd and risks deleting someone else’s lock, so renew the lease on a heartbeat and release conditionally on your token. And even a renewed lease can lapse under a long STW pause or partition, so fence the write with a monotonic token (or a versioned CAS) to stop a resumed holder from overwriting fresh data. Fix the lock lifecycle, then re-test under a slow-recompute and a pause-injection scenario. Now when you read unfamiliar coalescing code under an incident, you know which three lines to look for first: the lock’s expiry argument, the release condition, and whether anything fences the write.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.