Networking & Protocols NET · 09 · 07

Resilient LB architecture: anycast, zone-aware routing, and observability

A single LB is a SPOF — anycast + BGP ECMP eliminates it; zone-aware routing cuts cross-zone egress cost; TLS terminates at the edge; RED metrics and circuit-breaker state are the minimum observability for safe operation.

NET Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

Your load balancer is healthy. Then it is not. Every client connection drops at once because the single LB machine crashed. Your load balancer — the component that was supposed to make your backend resilient — is your single point of failure.

The LB single point of failure

If your load balancer is a single machine, its failure takes down all traffic. Even with a hot standby (active-passive failover), the failover takes 10–30 seconds — long enough for users to notice.

The ~30x failover-time gap is the whole reason to advertise one VIP from many LBs: BGP withdraws a dead route in under a second instead of a 10–30 s standby promotion.

Solution: anycast + BGP ECMP

Multiple LB machines all advertise the same anycast VIP (virtual IP) via BGP. The network’s Equal-Cost Multipath (ECMP) routing distributes client traffic across all LBs:

Client connects to IP 203.0.113.10 (the VIP).
BGP sees multiple equal-cost routes to 203.0.113.10 — one via LB_A, one via LB_B, one via LB_C.
ECMP hashes the 5-tuple (src IP, src port, dst IP, dst port, protocol) to pick a path.
If LB_A crashes, BGP withdraws its route. ECMP re-hashes to LB_B or LB_C. Convergence: <1 second.

The stateless LB requirement. When ECMP re-routes a flow to a different LB, that LB has no memory of the previous state. If the LB stored connection state (TLS session, HTTP/2 stream state) locally, the client must reconnect. Stateless LBs store no per-flow state — each connection is self-contained. This is why Maglev (Google’s distributed LB) uses consistent hashing of the 5-tuple to always map the same flow to the same LB machine, even as machines come and go.

Zone-aware routing

The problem: A client in zone A (us-east-1a) routing to a backend in zone B (us-east-1b) incurs:

Cross-zone egress cost: $0.01–0.02/GB in most clouds.
Extra latency: 1–5 ms intra-region RTT.

Zone-aware routing: Prefer backends in the same zone as the LB. Fall back to other zones only when all same-zone backends are unhealthy or circuit-breaker limits are hit.

AWS ALB zone-affinity: Enabled by default in newer AWS regions. Envoy: locality_weighted_lb_config with local-zone preference. GCP: uses zone-affinity mode by default when backends span zones.

Zone failure isolation. When zone A has a partial failure, zone-aware routing prevents it from cascading: traffic stays in zone A (or shifts to zone B/C only for zone-A traffic), so zone B/C are not suddenly absorbing 3× their normal load.

TLS termination at the LB

The LB terminates TLS: decrypts the client’s TLS session, sees plaintext, and (optionally) re-encrypts on the connection to the backend or sends plaintext over the internal network.

Benefits:

Backends do not need to manage certificates — one cert at the LB edge.
TLS handshake cost (~20–50 ms per new connection) borne once at the LB, not by every backend.
The LB can terminate TLS 1.2 from old clients and upgrade to TLS 1.3 on the backend-facing connection.

TLS 1.3 0-RTT resumption at the LB. If the client has a pre-shared key (PSK) from a prior session, the first request can be sent in the same flight as the ClientHello — zero extra round-trips. The LB must route the resumption request to the same LB instance that holds the session ticket, or the PSK must be stored in a distributed session cache shared by all LB instances.

Cost: ~20–50 ms per new connection, 50–2 000 ms under load spikes. TLS session reuse amortizes this over many requests.

Resilient LB architecture numbers

Anycast ECMP failover time: <1 s (BGP withdrawal)
Cross-zone egress cost: $0.01–0.02/GB
TLS termination cost (new connection): 20–50 ms
TLS termination cost under load spike: 50–2 000 ms
DNS TTL for geo-LB: 60–300 s
L4 edge + L7 behind: Google's pattern: Maglev + Envoy

DNS load balancing vs LB routing

DNS round-robin: Return multiple A records for one hostname. Clients pick one. Simple, but:

DNS TTL is 60–300 seconds — backend changes are not reflected for up to 5 minutes.
Clients cache DNS results and defeat rebalancing.
No health awareness — DNS returns dead backends until TTL expires.

Correct pattern: DNS points to a single anycast VIP (one per region). The LB cluster behind the VIP handles per-request balancing. DNS provides geographic routing (return the nearest regional VIP); the LB provides per-request balancing within the region.

Observability: minimum viable metrics

Alert-worthy metrics for a load balancer cluster:

Request rate per backend (RED method: Rate, Errors, Duration).
p50/p95/p99 latency per backend — p99 shows tail latency that affects 1% of users.
Error rate per backend — alert if > 0.01%.
Active connection count per backend.
Health-check success/failure rate — alert on flapping.
Circuit-breaker opens/closes — one open per week is fine; 10/hour signals a problem.
Retry rate — alert if > 0.1% of request rate (early storm warning).
Load imbalance — std dev of request counts across backends; high imbalance signals algorithm or affinity issues.
Drain time on shutdowns — long drain time (approaching timeout) signals long-running requests.

SLOs:

p99 latency < 100 ms for API endpoints.
Error rate < 0.01%.
Circuit-breaker open time < 1 minute/week.

Trace it

1/4

Trace zone-aware LB failover and anycast resilience.

Step 1 of 4

You have LBs in zones A, B, C. All advertise the same anycast VIP via BGP ECMP. A client in zone A initiates a request. Which LB handles it?

Locked

LB_A in zone A has backends in zones A, B, and C. Should it prefer zone-A backends for new requests?

Locked

LB_A crashes. BGP withdraws its route. ECMP re-hashes in-flight connections to LB_B or LB_C. What is the impact on existing TCP connections?

Locked

Backend B1 in zone A dies. Zone-A backends are now [B2, B3]. Should LB_A immediately fail over all new traffic to zone B?

Pick the best fit

A platform team is building a multi-region load balancer for a globally distributed SaaS service. Pick the topology.

▸Why this works

Google’s Maglev and the two-tier LB pattern. Google uses Maglev as a stateless L4 LB at the network edge. Maglev uses a consistent hash of the 5-tuple to route flows to backend Envoy instances (L7). This two-tier design separates concerns: Maglev absorbs packet-rate traffic cheaply and provides LB-level fault tolerance via anycast + consistent hashing. Envoy behind it does content routing, TLS, gRPC transcoding, and per-request observability. AWS mirrors this with Network Load Balancer (L4, anycast VIP) → Application Load Balancer (L7, HTTP routing).

One VIP, many stateless LBs. ECMP spreads flows across them by 5-tuple hash; a crashed LB's route is withdrawn by BGP and its flows re-hash to LB_B or LB_C in under a second — the cluster has no single machine whose loss takes down traffic.

Recall before you leave

01
How does anycast + BGP ECMP eliminate the LB single point of failure, and what happens to in-flight connections when one LB crashes?
02
Why does zone-aware routing matter economically, and when should it fail over to another zone?
03
What is the minimum set of metrics needed to detect a retry storm before it causes an outage?

Recap

A single load balancer is a single point of failure. Anycast + BGP ECMP advertises the same VIP from multiple LBs; ECMP hashes flows across them and BGP withdraws a dead LB’s route in <1 second. Zone-aware routing keeps traffic in the same availability zone to avoid $0.01–0.02/GB egress costs and intra-region RTT overhead — only crossing zones when all same-zone backends are unhealthy. TLS terminates at the LB edge: one certificate, 20–50 ms handshake cost borne once rather than on every backend. The minimum observability set — request rate, p99 latency, error rate, retry rate, circuit-breaker opens — catches a retry storm at the 0.1% retry rate threshold before it escalates to cascade failure. Now when you design a new service tier, ask yourself: if this single LB machine disappears right now, what happens — and is your answer “BGP convergence in under a second” or “complete outage”?

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

appears again in287

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.