awesome-everything RU
↑ Back to the climb

Backend Architecture

The deregistration race: stop routing before you stop accepting

Crux The subtle bug is a race: SIGTERM and endpoint removal fire in parallel, but propagation delay means the load balancer keeps routing to a pod that has already started shutting down. The fix is to fail readiness first and wait out the delay before closing the listener.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 16 min

You did everything right: a real SIGTERM handler, the app is PID 1, the handler stops accepting new connections the instant the signal lands and drains the in-flight ones. Deploy, and you still see a burst of connection-refused errors — but only for a second or two, only at the start of each rollout. The cause is a race that almost nobody guesses on the first try. When the pod starts terminating, two things happen in parallel: the orchestrator sends SIGTERM to your process, and it tells the rest of the cluster to stop routing traffic to you. The second part is not instant — it propagates through the control plane, kube-proxy on every node, and the load balancer, taking anywhere from under a second on a small cluster to tens of seconds on a big one. Your process, obeying SIGTERM, closes its listener immediately. So for the propagation window there is a load balancer still confidently sending requests to a pod that has already shut its door — and every one of those gets refused. You stopped accepting before the world stopped routing.

Two clocks, started together, ending apart

The mental model that breaks here is “the load balancer knows the moment I get SIGTERM.” It does not. Endpoint removal and signal delivery are independent, concurrent actions kicked off by the same event:

  • The signal path is fast and local: the kubelet on your node sends SIGTERM to your container in milliseconds.
  • The routing path is slow and distributed: the API server marks the endpoint not-ready, that change propagates to every node’s kube-proxy (or to an external load balancer, or to an ingress controller that may be polling on an interval), and only then do the routing rules actually stop forwarding to you.

Because the routing path is eventually consistent, there is a window — measured from “SIGTERM arrives” to “the last router stops sending you traffic” — during which the cluster still believes you are a valid backend. Reported numbers: under a second on small clusters, but 10–30 seconds on large clusters or with polling ingress controllers. If your handler closes the listening socket the instant SIGTERM lands, every request that arrives in that window hits a closed port and the client gets a connection refused — the inverse of the lesson-one problem. There you exited too early and severed in-flight work; here you stop accepting too early and reject work the LB is still sending.

The fix: change the order, don’t just add a handler

The cure is to make routing stop before you stop accepting — to reverse the order of the two clocks. There are two complementary levers:

  • Fail the readiness probe first. Readiness is the signal the orchestrator uses to decide whether to route to you. The moment you begin shutting down, flip your readiness endpoint to failing (or unready). That starts the deregistration clock through the normal mechanism — but it does not finish it, because propagation still takes time.
  • Add a preStop sleep to cover propagation. Because the preStop hook runs before SIGTERM and the orchestrator blocks on it, a preStop that simply sleeps for a few seconds (commonly 5–15s, sized to your cluster’s real propagation delay) holds the process open and accepting while the not-ready status spreads. Only after the sleep does SIGTERM arrive and your handler close the listener — by which point the routers have caught up and are no longer sending you new traffic.

The principle: keep serving until you are confident nothing is still being routed to you, then stop accepting, then drain. A handler alone is not enough; the ordering relative to deregistration is the whole point.

Don’t break readiness the wrong way

A common own-goal: people make the liveness probe and the readiness probe share an endpoint, or they let the SIGTERM handler immediately return errors from the health endpoint. If the liveness probe starts failing during shutdown, the orchestrator may decide the container is broken and kill or restart it, cutting your drain short. Keep liveness passing (the process is alive and draining is healthy behavior) and fail only readiness (do not send me new traffic). The two probes answer different questions: liveness asks “should I restart you?”, readiness asks “should I route to you?” — and during shutdown the honest answers are no and no new traffic, respectively.

Why this works

Why can’t the platform just make endpoint removal synchronous with the signal, so there is no window to engineer around? Because routing in a distributed cluster is not a single switch you can flip atomically — it is replicated state spread across many independent components, and keeping replicated state perfectly consistent on every change is exactly the expensive coordination that distributed systems avoid for throughput. When a pod goes not-ready, that fact has to reach the API server, get written to the endpoints object, be observed by every node’s kube-proxy (which then rewrites local iptables or IPVS rules), and separately reach any external load balancer or ingress controller, some of which discover changes by polling on their own schedule rather than being pushed. Each of those hops is independently fast but collectively asynchronous, and there is no global clock that says “everyone has updated, now release the signal.” Making it synchronous would mean blocking every termination on the slowest router in the fleet acknowledging the change — coupling pod shutdown to cluster-wide consensus, which would make deploys crawl and would itself fail whenever any router was slow or unreachable. So the platform chooses eventual consistency and hands you the tools to bridge the gap: readiness to start the clock and preStop to wait it out. The deep lesson is the same one the circuit-breaker unit kept hitting — in a distributed system you cannot assume two events triggered together are observed together, and any correctness that depends on their ordering has to be enforced deliberately, not assumed.

MomentYour processThe load balancerResult
No guard, SIGTERM landsCloses listener instantlyStill routing (not yet propagated)Connection refused for the window
Fail readiness firstKeeps acceptingStarts marking you not-readyClock started, not yet finished
preStop sleeps 5–15sKeeps acceptingPropagation completesRouters stop sending new traffic
SIGTERM after sleepNow closes listenerNo longer routing to youNo refused connections
Quiz

A service with a correct SIGTERM handler that closes its listener immediately still sees a brief burst of connection-refused errors at the start of each deploy. Why?

Quiz

During shutdown, why should you fail the readiness probe but keep the liveness probe passing?

Recall before you leave
  1. 01
    What is the deregistration race and why does it cause connection-refused errors?
  2. 02
    How do you fix the race, and why fail readiness but not liveness?
Recap

Even a perfect SIGTERM handler refuses connections if it closes the listener too soon, because termination starts two clocks at once that finish apart: signal delivery is a fast local path of milliseconds, while endpoint deregistration is a slow distributed path through the API server, every node’s kube-proxy, and any external or polling load balancer — eventually consistent, taking under a second on small clusters but 10–30s on large ones. During that window the load balancer still routes to a pod whose door is already shut, producing connection-refused errors, the inverse of lesson one’s sever-too-early. The fix is ordering: fail the readiness probe first to start the deregistration clock, and add a preStop sleep of roughly 5–15s, sized to real propagation, that keeps the process accepting until routing has caught up — only then does SIGTERM close the listener. Keep liveness passing so the orchestrator does not mistake a healthy drain for a broken container and restart it mid-shutdown. With routing safely drained and the listener finally closed, the next lesson handles what comes after the door shuts: draining the in-flight requests and closing every resource in the right order.

Connected lessons
Continue the climb ↑Draining and shutdown order: reverse the dependency graph
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.