awesome-everything RU
↑ Back to the climb

Networking & Protocols

Health checks, connection draining, and slow start

Crux Active probes catch total failures; passive outlier detection catches silent 5xx degradation; connection draining lets in-flight requests finish before a backend is removed; slow start prevents a cold backend from drowning under full load.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 12 min

A backend’s disk fills up. It keeps accepting TCP connections but every HTTP request returns 500. Your L4 health check (TCP SYN probe) says the backend is healthy. Requests keep routing to it and failing. This is the gap that passive health checking fills.

Active health checks

The LB sends probe requests to each backend on a fixed interval, independent of real client traffic.

Probe types:

  • HTTP GET /healthz — most common; expects 200 OK. Catches application-layer failures (disk full, database unreachable).
  • TCP SYN — lightweight; just attempts to establish TCP. Catches total network or process failure but misses application errors.
  • gRPC health check — standard for gRPC services (grpc.health.v1.Health/Check).
  • Custom UDP probes — for non-TCP services.

Thresholds:

  • Mark unhealthy after 2–3 consecutive failures.
  • Mark healthy after 2–3 consecutive successes.
  • Interval: 10–30 seconds.

The silent-failure gap. A backend that crashes hard (process killed) is caught by TCP SYN. A backend that hangs (thread pool exhausted, returning 500) accepts TCP but fails HTTP — a TCP SYN probe misses this entirely. HTTP GET probes catch it, but only if the probe URL exercises the actual application logic, not just a trivial stub that always returns 200 regardless.

Passive health checks and outlier detection

Instead of probing, the LB observes real traffic. If a backend returns repeated 5xx errors or repeated timeouts, it is an outlier.

Envoy outlier detection (default config):

  • Eject a backend after 5 consecutive 5xx responses.
  • Ejection duration starts at 30 seconds, doubles on each re-ejection: 30 s → 60 s → 120 s → up to ~300 s.
  • The LB stops routing new requests to the ejected backend for the duration.

Passive vs active — which catches what:

  • Active catches: crashed process, network partition, misrouted port.
  • Passive catches: application-level hangs, database pool exhaustion, GC-induced 500 storms.
  • Neither alone is sufficient. Use both for defense in depth.

Health-check flapping

A network blip causes 2–3 consecutive active check failures → backend ejected. Network recovers → 2–3 successes → backend re-added. Blip again → ejected. This cycling — called flapping — causes rapid traffic churn as the backend oscillates between healthy and unhealthy.

Fix: Increase failure/success thresholds (require 5 consecutive failures before ejection). Add jitter to check timing so checks from multiple LB replicas do not synchronize. Envoy’s exponential ejection backoff (30 s → 60 s → 120 s) naturally stabilizes a flapping backend: the ejection window widens until it is longer than the flap duration.

Health check and draining numbers
Active health check interval
10–30 s
Consecutive failures before ejection
2–3
Consecutive successes before re-add
2–3
Envoy passive ejection duration (initial)
30 s
Envoy ejection duration maximum
~300 s
AWS ALB connection drain timeout (default)
300 s
Connection drain for short HTTP
5–30 s
Connection drain for WebSocket/SSE
300+ s
Slow-start ramp duration (typical)
1–5 min

Connection draining

When removing a backend (deployment, scale-in, maintenance), you must not tear down active connections abruptly. Connection draining:

  1. Stop routing new requests to the backend.
  2. Allow in-flight requests to complete within a drain timeout.
  3. After the timeout, force-close any remaining connections.

Drain timeouts:

  • Short HTTP requests: 5–30 s.
  • Long-running WebSocket / SSE connections: 300+ s.
  • AWS ALB default: 300 s (configurable 0–3 600 s).
  • GCP: 0–3 600 s configurable.

The application side. On SIGTERM, the backend should:

  1. Call close(listening_socket) — stop accepting new connections.
  2. Finish all in-flight requests.
  3. Exit cleanly.

Without draining, the LB removes the backend mid-request: the client receives a connection reset and must retry. With draining, the request completes normally and the backend exits quietly.

Slow start / warm-up

When a backend rejoins the pool after recovery or first deployment, it is cold:

  • In-process caches are empty.
  • Database connection pools need priming.
  • TLS session caches are cold.

Sending 100% of traffic immediately causes the backend to fall behind under the surge, potentially triggering a cascade failure.

Solution: Ramp traffic gradually — 10% → 50% → 100% over 1–5 minutes. Envoy supports this via slow_start_window and slow_start_duration. AWS ALB supports it via a weighted target group that starts a new backend at weight 1 and increments over time.

Trace it
1/4

Trace a backend crash, health-check detection, and graceful rejoining.

1
Step 1 of 4
Backend B2 crashes. The LB's active health check probes B2 via HTTP GET /healthz every 30 s. What does the LB see on the first probe after the crash?
2
Locked
After 2–3 consecutive health check failures, the LB marks B2 unhealthy. What happens to new requests vs in-flight requests?
3
Locked
B2 restarts and passes 3 consecutive health checks. How does traffic ramp back?
4
Locked
B3 is still up but its database pool is exhausted — it returns 500 on all requests. The active health check (TCP SYN) still succeeds. What detects the problem?
Edge cases

Health-check endpoint design. A trivial /healthz that always returns 200 is dangerous — it passes the active probe even when the application is broken. A good health-check endpoint tests the minimum critical dependencies: can we reach the database? Can we reach the cache? Return 200 only if the backend can actually serve requests. But do not add all dependencies: if a non-critical downstream service is down, returning 500 from /healthz will eject the backend unnecessarily.

Quiz

A backend accepts TCP connections but its thread pool is exhausted and it returns 500 on every HTTP request. Which health check type catches this, and which misses it?

Quiz

Why is connection draining necessary when removing a backend from the load balancer pool?

Recall before you leave
  1. 01
    Why should you use both active and passive health checks rather than one alone?
  2. 02
    What is health-check flapping and how does Envoy's ejection backoff mitigate it?
  3. 03
    What should a backend do on SIGTERM to cooperate with connection draining?
Recap

Two health-check strategies complement each other. Active checks send probes (HTTP GET, TCP SYN, gRPC) every 10–30 seconds and eject a backend after 2–3 failures — fast but blind to application-layer degradation when TCP still accepts. Passive outlier detection watches real traffic and ejects after repeated 5xx responses, with an exponential ejection backoff (30 s → 300 s) that dampens flapping. Connection draining bridges the gap at removal time: new requests stop immediately, in-flight requests get 5–30 s (HTTP) or 300+ s (WebSocket) to finish. Slow start protects rejoining backends by ramping traffic from 10% to 100% over 1–5 minutes so cold caches and connection pools can warm before they bear full load.

Connected lessons
appears again in258
Continue the climb ↑Session affinity, consistent hashing, and the right fix
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.