Networking & Protocols NET · 09 · 10

Proxy and load balancing: survive a failing backend

Hands-on project — front a backend pool with a real load balancer, drive it into a retry storm, then apply the unit's resilience ladder until the cluster survives a failing node, proven with before/after metrics.

NET Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading about retry storms is not the same as keeping a cluster alive while one node falls over. Stand up a small backend pool behind a real proxy, inject a failure, watch a 500 ms blip try to become an outage, and apply the unit’s resilience controls until the cluster rides it out — with evidence at every step.

Goal

Turn the unit’s mental model into a reproducible loop: balance with a load-aware algorithm, detect failure with active and passive health checks, drain gracefully, bound the cascade with circuit breakers and retry budgets, externalize session state, and verify each control with before/after metrics under identical load.

Project

0 of 8

Objective

Front a pool of 3–5 backend instances with a real load balancer (Envoy, nginx, or HAProxy), inject a single-node failure under load, and prove the cluster stays available — keeping error rate under 0.1% and p99 within target — by applying the unit's resilience ladder, with measurements at each step.

Requirements

Acceptance criteria

A before/after table under identical load: error rate, p99 latency, retry rate, and time-to-detect the failed node — measured, not estimated — showing error rate back under 0.1% and the retry rate back under 0.1% after the controls are applied.
Evidence that active and passive health checks each detected a distinct failure mode (a hard crash vs a TCP-accepting node returning 500s).
A rolling restart with connection draining produces zero client-visible connection resets in the access logs, while an abrupt kill (draining disabled) does produce them — both captured.
A short write-up naming, in priority order, which control stopped the cascade and why circuit breakers and retry budgets ranked above simply adding capacity.

Senior stretch

Add a second load balancer and put the pair behind a shared VIP (keepalived/VRRP locally, or anycast in a lab) so killing one LB does not drop all traffic; measure failover time.
Split the proxy into an L4 tier in front of an L7 tier and route /api/ vs /static/ to different pools at L7; preserve the real client IP end-to-end with X-Forwarded-For or PROXY protocol and verify the backend logs the true client.
Add slow-start: when a recovered backend rejoins, ramp its traffic 10% → 100% over a few minutes and show its p99 stays bounded versus an immediate full-load rejoin.
Build a one-page on-call runbook: triage from the dashboard, the retry-storm signature, the resilience ladder (algorithm → health → drain → breaker → budget → externalize state), and a verification checklist.

Recap

This is the loop you will run on every real LB incident: balance with a load-aware algorithm, detect failure with active AND passive checks, drain in-flight requests before removing a node, bound the cascade with circuit breakers and a retry budget before you reach for capacity, externalize session state so any backend can resume any request, and verify every control with before/after numbers under identical load. Doing it once on a toy pool makes the production version muscle memory.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.