awesome-everything RU
↑ Back to the climb

Networking & Protocols

Proxy and load balancing: survive a failing backend

Crux Hands-on project — front a backend pool with a real load balancer, drive it into a retry storm, then apply the unit's resilience ladder until the cluster survives a failing node, proven with before/after metrics.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 240 min

Reading about retry storms is not the same as keeping a cluster alive while one node falls over. Stand up a small backend pool behind a real proxy, inject a failure, watch a 500 ms blip try to become an outage, and apply the unit’s resilience controls until the cluster rides it out — with evidence at every step.

Goal

Turn the unit’s mental model into a reproducible loop: balance with a load-aware algorithm, detect failure with active and passive health checks, drain gracefully, bound the cascade with circuit breakers and retry budgets, externalize session state, and verify each control with before/after metrics under identical load.

Project
0 of 8
Objective

Front a pool of 3–5 backend instances with a real load balancer (Envoy, nginx, or HAProxy), inject a single-node failure under load, and prove the cluster stays available — keeping error rate under 0.1% and p99 within target — by applying the unit's resilience ladder, with measurements at each step.

Requirements
Acceptance criteria
  • A before/after table under identical load: error rate, p99 latency, retry rate, and time-to-detect the failed node — measured, not estimated — showing error rate back under 0.1% and the retry rate back under 0.1% after the controls are applied.
  • Evidence that active and passive health checks each detected a distinct failure mode (a hard crash vs a TCP-accepting node returning 500s).
  • A rolling restart with connection draining produces zero client-visible connection resets in the access logs, while an abrupt kill (draining disabled) does produce them — both captured.
  • A short write-up naming, in priority order, which control stopped the cascade and why circuit breakers and retry budgets ranked above simply adding capacity.
Senior stretch
  • Add a second load balancer and put the pair behind a shared VIP (keepalived/VRRP locally, or anycast in a lab) so killing one LB does not drop all traffic; measure failover time.
  • Split the proxy into an L4 tier in front of an L7 tier and route /api/ vs /static/ to different pools at L7; preserve the real client IP end-to-end with X-Forwarded-For or PROXY protocol and verify the backend logs the true client.
  • Add slow-start: when a recovered backend rejoins, ramp its traffic 10% → 100% over a few minutes and show its p99 stays bounded versus an immediate full-load rejoin.
  • Build a one-page on-call runbook: triage from the dashboard, the retry-storm signature, the resilience ladder (algorithm → health → drain → breaker → budget → externalize state), and a verification checklist.
Recap

This is the loop you will run on every real LB incident: balance with a load-aware algorithm, detect failure with active AND passive checks, drain in-flight requests before removing a node, bound the cascade with circuit breakers and a retry budget before you reach for capacity, externalize session state so any backend can resume any request, and verify every control with before/after numbers under identical load. Doing it once on a toy pool makes the production version muscle memory.

Continue the climb ↑Why QUIC and not TCP+TLS
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.