Distributed Systems DIST · 07 · 09

Retry amplification: code reading

Read real retry code — backoff with jitter, a retry budget, nested-retry fan-out, a half-open breaker — and pick the highest-leverage fix a senior makes first.

DIST Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

Retry bugs hide in code that looks correct in a unit test and detonates under a real outage. Read each snippet, predict how it behaves when the dependency is down, and choose the fix a senior makes first.

Goal

Practise the loop you run on every retry config: read the backoff, the budget, and the call graph; predict the fan-out under failure; and reach for the highest-leverage fix before adding more retries.

Snippet 1 — the backoff that does not jitter

func callWithRetry(ctx context.Context, fn func() error) error {
    base := 100 * time.Millisecond
    var err error
    for attempt := 0; attempt < 5; attempt++ {
        if err = fn(); err == nil {
            return nil
        }
        // exponential, but no jitter
        sleep := base * time.Duration(1<<attempt) // 100, 200, 400, 800, 1600 ms
        time.Sleep(sleep)
    }
    return err
}

Quiz

10,000 clients all call this against a dependency that just blipped. What goes wrong, and what is the one-line fix?

Snippet 2 — the retry budget

// token-bucket retry budget: retries may consume at most ~10% of request volume
type RetryBudget struct {
    mu     sync.Mutex
    tokens float64
}

func (b *RetryBudget) OnRequest()      { b.mu.Lock(); b.tokens += 0.1; b.mu.Unlock() } // +0.1 per request
func (b *RetryBudget) TryRetry() bool {
    b.mu.Lock(); defer b.mu.Unlock()
    if b.tokens >= 1 {
        b.tokens -= 1 // each retry costs 1 token
        return true
    }
    return false // budget exhausted: fail fast, do not retry
}

Quiz

The dependency is fully down: every request fails. What is the steady-state retry rate this budget allows, and what does that achieve?

Snippet 3 — nested retries

// data layer
func (d *DataLayer) Read(ctx context.Context, k string) (V, error) {
    return retry(3, func() (V, error) { return d.pool.Read(ctx, k) }) // retries 3x
}
// service layer
func (s *Service) Get(ctx context.Context, k string) (V, error) {
    return retry(3, func() (V, error) { return s.data.Read(ctx, k) }) // retries 3x, calling the above
}
// gateway
func (g *Gateway) Handle(ctx context.Context, k string) (V, error) {
    return retry(3, func() (V, error) { return g.svc.Get(ctx, k) }) // retries 3x, calling the above
}

Quiz

For one request that fails at the pool, how many calls hit the connection pool, and what is the correct structural fix?

Snippet 4 — the half-open breaker

func (b *Breaker) Call(fn func() error) error {
    switch b.state {
    case Open:
        if time.Since(b.openedAt) < b.cooldown {
            return ErrOpen // fail fast, no network call
        }
        b.state = HalfOpen // cooldown elapsed: allow probes
        fallthrough
    case HalfOpen:
        err := fn()
        if err != nil {
            b.state = Open; b.openedAt = time.Now() // probe failed: re-open
            return err
        }
        b.state = Closed // probe succeeded: resume normal traffic
        return nil
    default: // Closed
        return b.trackFailures(fn)
    }
}

Quiz

In the HalfOpen state this code lets every concurrent caller through at once. Under a busy service, why is that dangerous and what is the fix?

Recap

Every retry incident is read in the code: backoff without jitter re-synchronizes the herd (full jitter is the one-line fix); a token-bucket retry budget converts unbounded amplification into a ~10% ceiling; nested retries at N layers multiply to retries^N (retry at one layer, propagate elsewhere); and a half-open breaker must admit a single probe, not a flood. Predict the fan-out under failure, fix the structure, then re-test under the same synchronized load.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.