Networking & Protocols NET · 09 · 04

Health checks, connection draining, and slow start

Active probes catch total failures; passive outlier detection catches silent 5xx degradation; connection draining lets in-flight requests finish before a backend is removed; slow start prevents a cold backend from drowning under full load.

NET Middle ◷ 12 min

Level

FoundationsJuniorMiddleSenior

A backend’s disk fills up. It keeps accepting TCP connections but every HTTP request returns 500. Your L4 health check (TCP SYN probe) says the backend is healthy. Requests keep routing to it and failing. This is the gap that passive health checking fills.

Active health checks

The LB sends probe requests to each backend on a fixed interval, independent of real client traffic.

Probe types:

HTTP GET /healthz — most common; expects 200 OK. Catches application-layer failures (disk full, database unreachable).
TCP SYN — lightweight; just attempts to establish TCP. Catches total network or process failure but misses application errors.
gRPC health check — standard for gRPC services (grpc.health.v1.Health/Check).
Custom UDP probes — for non-TCP services.

Thresholds:

Mark unhealthy after 2–3 consecutive failures.
Mark healthy after 2–3 consecutive successes.
Interval: 10–30 seconds.

The silent-failure gap. A backend that crashes hard (process killed) is caught by TCP SYN. A backend that hangs (thread pool exhausted, returning 500) accepts TCP but fails HTTP — a TCP SYN probe misses this entirely. HTTP GET probes catch it, but only if the probe URL exercises the actual application logic, not just a trivial stub that always returns 200 regardless.

Passive health checks and outlier detection

Instead of probing, the LB observes real traffic. If a backend returns repeated 5xx errors or repeated timeouts, it is an outlier.

Envoy outlier detection (default config):

Eject a backend after 5 consecutive 5xx responses.
Ejection duration starts at 30 seconds, doubles on each re-ejection: 30 s → 60 s → 120 s → up to ~300 s.
The LB stops routing new requests to the ejected backend for the duration.

Passive vs active — which catches what:

Active catches: crashed process, network partition, misrouted port.
Passive catches: application-level hangs, database pool exhaustion, GC-induced 500 storms.
Neither alone is sufficient. Use both for defense in depth.

Active is the safety net that works without traffic; passive catches the silent 5xx degradation active probes miss. Run both — each covers the other's blind spot.

Health-check flapping

A network blip causes 2–3 consecutive active check failures → backend ejected. Network recovers → 2–3 successes → backend re-added. Blip again → ejected. This cycling — called flapping — causes rapid traffic churn as the backend oscillates between healthy and unhealthy.

Fix: Increase failure/success thresholds (require 5 consecutive failures before ejection). Add jitter to check timing so checks from multiple LB replicas do not synchronize. Envoy’s exponential ejection backoff (30 s → 60 s → 120 s) naturally stabilizes a flapping backend: the ejection window widens until it is longer than the flap duration.

Health check and draining numbers

Active health check interval: 10–30 s
Consecutive failures before ejection: 2–3
Consecutive successes before re-add: 2–3
Envoy passive ejection duration (initial): 30 s
Envoy ejection duration maximum: ~300 s
AWS ALB connection drain timeout (default): 300 s
Connection drain for short HTTP: 5–30 s
Connection drain for WebSocket/SSE: 300+ s
Slow-start ramp duration (typical): 1–5 min

Connection draining

When removing a backend (deployment, scale-in, maintenance), you must not tear down active connections abruptly. Connection draining:

Stop routing new requests to the backend.
Allow in-flight requests to complete within a drain timeout.
After the timeout, force-close any remaining connections.

Drain timeouts:

Short HTTP requests: 5–30 s.
Long-running WebSocket / SSE connections: 300+ s.
AWS ALB default: 300 s (configurable 0–3 600 s).
GCP: 0–3 600 s configurable.

The application side. On SIGTERM, the backend should:

Call close(listening_socket) — stop accepting new connections.
Finish all in-flight requests.
Exit cleanly.

Without draining, the LB removes the backend mid-request: the client receives a connection reset and must retry. With draining, the request completes normally and the backend exits quietly.

Slow start / warm-up

When a backend rejoins the pool after recovery or first deployment, it is cold:

In-process caches are empty.
Database connection pools need priming.
TLS session caches are cold.

Sending 100% of traffic immediately causes the backend to fall behind under the surge, potentially triggering a cascade failure.

Solution: Ramp traffic gradually — 10% → 50% → 100% over 1–5 minutes. Envoy supports this via slow_start_window and slow_start_duration. AWS ALB supports it via a weighted target group that starts a new backend at weight 1 and increments over time.

Trace it

1/4

Trace a backend crash, health-check detection, and graceful rejoining.

Step 1 of 4

Backend B2 crashes. The LB's active health check probes B2 via HTTP GET /healthz every 30 s. What does the LB see on the first probe after the crash?

Locked

After 2–3 consecutive health check failures, the LB marks B2 unhealthy. What happens to new requests vs in-flight requests?

Locked

B2 restarts and passes 3 consecutive health checks. How does traffic ramp back?

Locked

B3 is still up but its database pool is exhausted — it returns 500 on all requests. The active health check (TCP SYN) still succeeds. What detects the problem?

▸Edge cases

Health-check endpoint design. A trivial /healthz that always returns 200 is dangerous — it passes the active probe even when the application is broken. A good health-check endpoint tests the minimum critical dependencies: can we reach the database? Can we reach the cache? Return 200 only if the backend can actually serve requests. But do not add all dependencies: if a non-critical downstream service is down, returning 500 from /healthz will eject the backend unnecessarily.

Quiz

A backend accepts TCP connections but its thread pool is exhausted and it returns 500 on every HTTP request. Which health check type catches this, and which misses it?

Quiz

Why is connection draining necessary when removing a backend from the load balancer pool?

Removal is graceful (drain in-flight requests, do not cut them mid-flight) and re-entry is gradual (slow start, so a cold backend with empty caches is not drowned by full load on its first second back).

Recall before you leave

01
Why should you use both active and passive health checks rather than one alone?
02
What is health-check flapping and how does Envoy's ejection backoff mitigate it?
03
What should a backend do on SIGTERM to cooperate with connection draining?

Recap

Two health-check strategies complement each other. Active checks send probes (HTTP GET, TCP SYN, gRPC) every 10–30 seconds and eject a backend after 2–3 failures — fast but blind to application-layer degradation when TCP still accepts. Passive outlier detection watches real traffic and ejects after repeated 5xx responses, with an exponential ejection backoff (30 s → 300 s) that dampens flapping. Connection draining bridges the gap at removal time: new requests stop immediately, in-flight requests get 5–30 s (HTTP) or 300+ s (WebSocket) to finish. Slow start protects rejoining backends by ramping traffic from 10% to 100% over 1–5 minutes so cold caches and connection pools can warm before they bear full load. Now when you see a backend repeatedly cycling into unhealthy state despite recovering quickly, check ejection backoff duration — and whether your /healthz endpoint actually tests the application or just returns 200 blindly.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

deepens into

Retry storms, circuit breakers, and load sheddingsenior

appears again in287

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.