Networking & Protocols
Proxy and load balancing: survive a failing backend
Reading about retry storms is not the same as keeping a cluster alive while one node falls over. Stand up a small backend pool behind a real proxy, inject a failure, watch a 500 ms blip try to become an outage, and apply the unit’s resilience controls until the cluster rides it out — with evidence at every step.
Turn the unit’s mental model into a reproducible loop: balance with a load-aware algorithm, detect failure with active and passive health checks, drain gracefully, bound the cascade with circuit breakers and retry budgets, externalize session state, and verify each control with before/after metrics under identical load.
Front a pool of 3–5 backend instances with a real load balancer (Envoy, nginx, or HAProxy), inject a single-node failure under load, and prove the cluster stays available — keeping error rate under 0.1% and p99 within target — by applying the unit's resilience ladder, with measurements at each step.
- A before/after table under identical load: error rate, p99 latency, retry rate, and time-to-detect the failed node — measured, not estimated — showing error rate back under 0.1% and the retry rate back under 0.1% after the controls are applied.
- Evidence that active and passive health checks each detected a distinct failure mode (a hard crash vs a TCP-accepting node returning 500s).
- A rolling restart with connection draining produces zero client-visible connection resets in the access logs, while an abrupt kill (draining disabled) does produce them — both captured.
- A short write-up naming, in priority order, which control stopped the cascade and why circuit breakers and retry budgets ranked above simply adding capacity.
- Add a second load balancer and put the pair behind a shared VIP (keepalived/VRRP locally, or anycast in a lab) so killing one LB does not drop all traffic; measure failover time.
- Split the proxy into an L4 tier in front of an L7 tier and route /api/ vs /static/ to different pools at L7; preserve the real client IP end-to-end with X-Forwarded-For or PROXY protocol and verify the backend logs the true client.
- Add slow-start: when a recovered backend rejoins, ramp its traffic 10% → 100% over a few minutes and show its p99 stays bounded versus an immediate full-load rejoin.
- Build a one-page on-call runbook: triage from the dashboard, the retry-storm signature, the resilience ladder (algorithm → health → drain → breaker → budget → externalize state), and a verification checklist.
This is the loop you will run on every real LB incident: balance with a load-aware algorithm, detect failure with active AND passive checks, drain in-flight requests before removing a node, bound the cascade with circuit breakers and a retry budget before you reach for capacity, externalize session state so any backend can resume any request, and verify every control with before/after numbers under identical load. Doing it once on a toy pool makes the production version muscle memory.