Distributed Systems
Retry amplification: code reading
Retry bugs hide in code that looks correct in a unit test and detonates under a real outage. Read each snippet, predict how it behaves when the dependency is down, and choose the fix a senior makes first.
Practise the loop you run on every retry config: read the backoff, the budget, and the call graph; predict the fan-out under failure; and reach for the highest-leverage fix before adding more retries.
Snippet 1 — the backoff that does not jitter
func callWithRetry(ctx context.Context, fn func() error) error {
base := 100 * time.Millisecond
var err error
for attempt := 0; attempt < 5; attempt++ {
if err = fn(); err == nil {
return nil
}
// exponential, but no jitter
sleep := base * time.Duration(1<<attempt) // 100, 200, 400, 800, 1600 ms
time.Sleep(sleep)
}
return err
}
10,000 clients all call this against a dependency that just blipped. What goes wrong, and what is the one-line fix?
Snippet 2 — the retry budget
// token-bucket retry budget: retries may consume at most ~10% of request volume
type RetryBudget struct {
mu sync.Mutex
tokens float64
}
func (b *RetryBudget) OnRequest() { b.mu.Lock(); b.tokens += 0.1; b.mu.Unlock() } // +0.1 per request
func (b *RetryBudget) TryRetry() bool {
b.mu.Lock(); defer b.mu.Unlock()
if b.tokens >= 1 {
b.tokens -= 1 // each retry costs 1 token
return true
}
return false // budget exhausted: fail fast, do not retry
}
The dependency is fully down: every request fails. What is the steady-state retry rate this budget allows, and what does that achieve?
Snippet 3 — nested retries
// data layer
func (d *DataLayer) Read(ctx context.Context, k string) (V, error) {
return retry(3, func() (V, error) { return d.pool.Read(ctx, k) }) // retries 3x
}
// service layer
func (s *Service) Get(ctx context.Context, k string) (V, error) {
return retry(3, func() (V, error) { return s.data.Read(ctx, k) }) // retries 3x, calling the above
}
// gateway
func (g *Gateway) Handle(ctx context.Context, k string) (V, error) {
return retry(3, func() (V, error) { return g.svc.Get(ctx, k) }) // retries 3x, calling the above
}
For one request that fails at the pool, how many calls hit the connection pool, and what is the correct structural fix?
Snippet 4 — the half-open breaker
func (b *Breaker) Call(fn func() error) error {
switch b.state {
case Open:
if time.Since(b.openedAt) < b.cooldown {
return ErrOpen // fail fast, no network call
}
b.state = HalfOpen // cooldown elapsed: allow probes
fallthrough
case HalfOpen:
err := fn()
if err != nil {
b.state = Open; b.openedAt = time.Now() // probe failed: re-open
return err
}
b.state = Closed // probe succeeded: resume normal traffic
return nil
default: // Closed
return b.trackFailures(fn)
}
}
In the HalfOpen state this code lets every concurrent caller through at once. Under a busy service, why is that dangerous and what is the fix?
Every retry incident is read in the code: backoff without jitter re-synchronizes the herd (full jitter is the one-line fix); a token-bucket retry budget converts unbounded amplification into a ~10% ceiling; nested retries at N layers multiply to retries^N (retry at one layer, propagate elsewhere); and a half-open breaker must admit a single probe, not a flood. Predict the fan-out under failure, fix the structure, then re-test under the same synchronized load.