awesome-everything RU
↑ Back to the climb

Backend Architecture

Zero-downtime deploys: graceful shutdown as a fleet property

Crux Across a rolling deploy, graceful shutdown becomes a fleet property: deregister before terminate, bring new pods to ready before old ones drain, size the deregistration delay to real p99, and jitter connection close so a fleet of clients doesn''''t reconnect in one thundering spi
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 17 min

Every lesson so far has been about one process dying well. But you never deploy one process — you roll a fleet. A rolling deploy walks through your replicas, replacing each old pod with a new one, and the user-visible promise is “zero downtime”: no request fails, no client notices, even though every single instance is being torn down and rebuilt underneath them. Here’s the unsettling part: a perfect per-instance graceful shutdown does not give you that promise on its own. You can have a flawless SIGTERM handler, correct readiness, clean draining, idempotent requeues — and still take an outage on deploy, because the coordination between instances is wrong. Tear down an old pod before its replacement is ready and you’ve shrunk capacity mid-deploy. Tear down too many at once and you brown out. Let the load balancer keep a stale route for a few seconds and you refuse connections. Send every client Connection: close at the same instant and they all reconnect in one synchronized thundering spike that knocks over the very pods you just brought up. Zero-downtime deploy is not the sum of good shutdowns; it’s a property of how the shutdowns are orchestrated across the fleet. This is the unit’s capstone: everything you’ve learned, lifted from one process to many.

Capacity must never dip: new ready before old drains

The first fleet-level invariant is that total serving capacity never drops below demand during the roll. A rolling deploy is governed by two knobs (Kubernetes names them maxUnavailable and maxSurge, but the concept is universal): how many old pods may be down at once, and how many extra new pods may be spun up beyond the desired count. The safe pattern is surge up, then drain: bring a new pod fully to ready — passing its readiness probe, warmed, connected to its datastores — before you start draining the old one it replaces. If you reverse that, terminating the old pod before the new one is serving, you have a window where the fleet is one instance short, and under steady load that missing capacity is dropped or queued requests. maxUnavailable: 0 with a positive maxSurge encodes “never go below full capacity”; it’s slower and costs transient extra instances, but it’s the setting that actually delivers zero-downtime under load.

Deregister before terminate, sized to real propagation

This is the deregistration race from lesson three, but now as a fleet discipline. Every old pod must be removed from the load balancer’s rotation before it stops accepting — and because deregistration is eventually consistent, you must wait out the propagation before the listener closes. At fleet scale the numbers get concrete and they get big. A typical cloud load balancer’s deregistration delay (connection-draining timeout) defaults to 30–60 seconds; for services with long-lived connections or large uploads, teams push it to 600 seconds or more. The sizing rule that survives contact with production: set the drain window to roughly 3× your p99 request duration so all but the most pathological in-flight requests complete, then clamp it with the guardian timeout. Get this wrong across a hundred pods and every deploy sheds a thin, correlated layer of 502s — invisible in a single request, glaring in the aggregate error rate.

The thundering reconnect: jitter the close

The last trap is one only the fleet can show you. When you drain, you send Connection: close so clients reconnect to healthy instances — correct. But if every pod closes its keep-alive connections at the same moment, every client reconnects at the same moment, and a fleet of thousands of clients lands as one synchronized thundering reconnect on the surviving (and newly-started, cold) pods — a self-inflicted thundering herd, the exact failure mode the circuit-breaker and idempotency units kept circling. The fix is the same one those units used: jitter. Spread connection close over a randomized window instead of a single instant, so reconnections smear across time rather than spiking. Combined with surge-up capacity and properly-sized deregistration, jittered close is what turns “every instance shut down correctly” into “the fleet never hiccupped.” Zero downtime is graceful shutdown plus coordination: capacity, ordering, and spread.

Why this works

Why isn’t a fleet of individually-perfect graceful shutdowns automatically a zero-downtime deploy — what does the fleet add that the single process can’t see? Because every property that matters during a deploy is a relationship between instances, and a single process has no view of its peers. Capacity is the clearest example: one pod draining flawlessly is correct in isolation, but whether that drain causes an outage depends entirely on whether its replacement is already serving — a fact the draining pod cannot know and cannot control. The same is true of the deregistration delay: a pod can wait out propagation perfectly, but the right amount to wait is a function of the fleet’s request-duration distribution and the load balancer’s convergence time, not anything local. And the thundering reconnect is purely emergent — no individual close is wrong; the damage comes from thousands of correct closes landing simultaneously, a pattern that literally does not exist at the scale of one process and so cannot be prevented there. This is the same lesson the distributed-failure-modes lesson hammered for circuit breakers: local correctness does not compose into global correctness for free, because the failure modes live in the interactions. Graceful shutdown gives you a well-behaved instance; zero-downtime deploy is the orchestration layer that arranges those instances in time and space — surge before drain, deregister before terminate, jitter the synchronized actions — so their individually-correct behaviors don’t collide. The deploy is where every theme of this track converges: lifecycle (the shutdown sequence), resource limits (capacity headroom), idempotency (safe requeue under churn), and distributed systems (eventual consistency and herding). Owning shutdown means owning all four at once.

Fleet concernWrong wayRight wayFailure if ignored
CapacityDrain old before new is readySurge new to ready, then drain oldMid-deploy capacity dip, dropped load
Batch sizeReplace all/many at onceBounded maxUnavailable / maxSurgeBrownout under steady traffic
RoutingClose listener at SIGTERMDeregister + wait propagation firstConnection-refused window per pod
Drain windowFixed tiny timeout~3× p99, clamp with guardianSevered long requests, 502s
ReconnectAll clients close at onceJittered Connection: closeThundering reconnect on cold pods
Quiz

A service has a flawless per-instance graceful shutdown but still drops requests during rolling deploys under load. Which fleet-level setting most directly fixes a mid-deploy capacity dip?

Quiz

During a deploy, every pod sends Connection: close the instant it begins draining, and the surviving cold pods immediately fall over. What happened and what is the fix?

Order the steps

Order one step of a zero-downtime rolling deploy for a single replica:

  1. 1 Start a new pod and wait until it passes readiness (surge before drain)
  2. 2 Deregister the old pod from the load balancer and wait out propagation
  3. 3 Drain the old pod: jittered Connection: close, finish in-flight up to ~3× p99
  4. 4 Close resources in reverse dependency order and exit before the guardian timeout
Recall before you leave
  1. 01
    Why doesn't a fleet of perfect per-instance graceful shutdowns guarantee a zero-downtime deploy?
  2. 02
    What are the three fleet-level invariants of a zero-downtime rolling deploy and their real numbers?
Recap

You never deploy one process, you roll a fleet, and a perfect per-instance graceful shutdown does not by itself buy zero downtime — the coordination between instances is the missing half. Three fleet invariants close the unit. Capacity must never dip below demand: surge a new pod to ready before draining the old one it replaces, which maxUnavailable: 0 plus a positive maxSurge encodes at the cost of transient instances. Routing must deregister before terminate: leave the load balancer’s rotation and wait out eventually-consistent propagation before the listener closes, with deregistration delays of 30–60s (600s+ for long connections) and a drain window near 3× p99, clamped by the guardian timeout. And synchronized actions must be jittered: un-jittered simultaneous Connection: close produces a thundering reconnect that topples the cold pods you just started, so spread the closes over a randomized window. The deep lesson, shared with the distributed-failure-modes lesson, is that local correctness does not compose into global correctness for free — the failure modes live in the interactions, and the deploy is where lifecycle, resource limits, idempotency, and distributed-systems eventual consistency all converge. That convergence is exactly what the final unit takes up, putting the whole backend track together into one coherent picture of a resilient service under churn.

Connected lessons
Continue the climb ↑Graceful shutdown: multiple-choice review
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.