Backend Architecture BE · 07 · 06

Zero-downtime deploys: graceful shutdown as a fleet property

Across a rolling deploy, graceful shutdown becomes a fleet property: deregister before terminate, bring new pods to ready before old ones drain, size the deregistration delay to real p99, and jitter connection close so a fleet of clients doesn''''t reconnect in one thundering spi

BE Senior ◷ 17 min

Level

FoundationsJuniorMiddleSenior

Every lesson so far has been about one process dying well. But you never deploy one process — you roll a fleet. A rolling deploy walks through your replicas, replacing each old pod with a new one, and the user-visible promise is “zero downtime”: no request fails, no client notices, even though every single instance is being torn down and rebuilt underneath them. Here’s the unsettling part: a perfect per-instance graceful shutdown does not give you that promise on its own. You can have a flawless SIGTERM handler, correct readiness, clean draining, idempotent requeues — and still take an outage on deploy, because the coordination between instances is wrong. Tear down an old pod before its replacement is ready and you’ve shrunk capacity mid-deploy. Tear down too many at once and you brown out. Let the load balancer keep a stale route for a few seconds and you refuse connections. Send every client Connection: close at the same instant and they all reconnect in one synchronized thundering spike that knocks over the very pods you just brought up. Zero-downtime deploy is not the sum of good shutdowns; it’s a property of how the shutdowns are orchestrated across the fleet. This is the unit’s capstone: everything you’ve learned, lifted from one process to many.

Capacity must never dip: new ready before old drains

The first fleet-level invariant is that total serving capacity never drops below demand during the roll. A rolling deploy is governed by two knobs (Kubernetes names them maxUnavailable and maxSurge, but the concept is universal): how many old pods may be down at once, and how many extra new pods may be spun up beyond the desired count. The safe pattern is surge up, then drain: bring a new pod fully to ready — passing its readiness probe, warmed, connected to its datastores — before you start draining the old one it replaces. If you reverse that, terminating the old pod before the new one is serving, you have a window where the fleet is one instance short, and under steady load that missing capacity is dropped or queued requests. maxUnavailable: 0 with a positive maxSurge encodes “never go below full capacity”; it’s slower and costs transient extra instances, but it’s the setting that actually delivers zero-downtime under load.

Deregister before terminate, sized to real propagation

This is the deregistration race from lesson three, but now as a fleet discipline. Every old pod must be removed from the load balancer’s rotation before it stops accepting — and because deregistration is eventually consistent, you must wait out the propagation before the listener closes. At fleet scale the numbers get concrete and they get big. A typical cloud load balancer’s deregistration delay (connection-draining timeout) defaults to 30–60 seconds; for services with long-lived connections or large uploads, teams push it to 600 seconds or more. The sizing rule that survives contact with production: set the drain window to roughly 3× your p99 request duration so all but the most pathological in-flight requests complete, then clamp it with the guardian timeout. Get this wrong across a hundred pods and every deploy sheds a thin, correlated layer of 502s — invisible in a single request, glaring in the aggregate error rate.

The drain window is not one small fixed number: cloud LB defaults run 30–60s, but long-lived connections or large uploads push it to 600s+, sized to ~3× p99 and clamped by the guardian timeout.

The thundering reconnect: jitter the close

The last trap is one only the fleet can show you. When you drain, you send Connection: close so clients reconnect to healthy instances — correct. But if every pod closes its keep-alive connections at the same moment, every client reconnects at the same moment, and a fleet of thousands of clients lands as one synchronized thundering reconnect on the surviving (and newly-started, cold) pods — a self-inflicted thundering herd, the exact failure mode the circuit-breaker and idempotency units kept circling. The fix is the same one those units used: jitter. Spread connection close over a randomized window instead of a single instant, so reconnections smear across time rather than spiking. Combined with surge-up capacity and properly-sized deregistration, jittered close is what turns “every instance shut down correctly” into “the fleet never hiccupped.” Zero downtime is graceful shutdown plus coordination: capacity, ordering, and spread.

▸Why this works

Why isn’t a fleet of individually-perfect graceful shutdowns automatically a zero-downtime deploy — what does the fleet add that the single process can’t see? Because every property that matters during a deploy is a relationship between instances, and a single process has no view of its peers. Capacity is the clearest example: one pod draining flawlessly is correct in isolation, but whether that drain causes an outage depends entirely on whether its replacement is already serving — a fact the draining pod cannot know and cannot control. The same is true of the deregistration delay: a pod can wait out propagation perfectly, but the right amount to wait is a function of the fleet’s request-duration distribution and the load balancer’s convergence time, not anything local. And the thundering reconnect is purely emergent — no individual close is wrong; the damage comes from thousands of correct closes landing simultaneously, a pattern that literally does not exist at the scale of one process and so cannot be prevented there. This is the same lesson the distributed-failure-modes lesson hammered for circuit breakers: local correctness does not compose into global correctness for free, because the failure modes live in the interactions. Graceful shutdown gives you a well-behaved instance; zero-downtime deploy is the orchestration layer that arranges those instances in time and space — surge before drain, deregister before terminate, jitter the synchronized actions — so their individually-correct behaviors don’t collide. The deploy is where every theme of this track converges: lifecycle (the shutdown sequence), resource limits (capacity headroom), idempotency (safe requeue under churn), and distributed systems (eventual consistency and herding). Owning shutdown means owning all four at once.

Fleet concern	Wrong way	Right way	Failure if ignored
Capacity	Drain old before new is ready	Surge new to ready, then drain old	Mid-deploy capacity dip, dropped load
Batch size	Replace all/many at once	Bounded maxUnavailable / maxSurge	Brownout under steady traffic
Routing	Close listener at SIGTERM	Deregister + wait propagation first	Connection-refused window per pod
Drain window	Fixed tiny timeout	~3× p99, clamp with guardian	Severed long requests, 502s
Reconnect	All clients close at once	Jittered Connection: close	Thundering reconnect on cold pods

Quiz

A service has a flawless per-instance graceful shutdown but still drops requests during rolling deploys under load. Which fleet-level setting most directly fixes a mid-deploy capacity dip?

Quiz

During a deploy, every pod sends Connection: close the instant it begins draining, and the surviving cold pods immediately fall over. What happened and what is the fix?

Order the steps

Order one step of a zero-downtime rolling deploy for a single replica:

1 Start a new pod and wait until it passes readiness (surge before drain)
2 Deregister the old pod from the load balancer and wait out propagation
3 Drain the old pod: jittered Connection: close, finish in-flight up to ~3× p99
4 Close resources in reverse dependency order and exit before the guardian timeout

Capacity never dips: new pod serves before old pod drains. Jittered close prevents a synchronized thundering reconnect.

key takeaway

Zero-downtime deploy is not the sum of good per-instance shutdowns — it is a fleet property, the orchestration of those shutdowns across replicas, and a flawless SIGTERM handler can still take an outage if the coordination is wrong. Three invariants carry it. Capacity must never dip: surge a new pod to ready (readiness-passing, warmed, datastore-connected) before draining the old one it replaces — maxUnavailable: 0 with a positive maxSurge encodes “never go below full capacity,” slower and costing transient instances but the setting that actually delivers zero downtime under load. Routing must deregister before terminate: every old pod leaves the load balancer’s rotation and waits out eventually-consistent propagation before its listener closes, with a deregistration delay defaulting to 30–60s (600s+ for long connections or uploads) and a drain window sized to roughly 3× p99 request duration, clamped by the guardian timeout. And synchronized actions must be jittered: if every pod sends Connection: close at the same instant, every client reconnects at once — a self-inflicted thundering reconnect that knocks over the surviving cold pods — so spread the closes over a randomized window. The deploy is where the whole track converges: lifecycle, resource limits, idempotency for safe requeue under churn, and distributed-systems eventual consistency. Local correctness does not compose into global correctness for free; the failure modes live in the interactions.

Recall before you leave

01
Why doesn't a fleet of perfect per-instance graceful shutdowns guarantee a zero-downtime deploy?
02
What are the three fleet-level invariants of a zero-downtime rolling deploy and their real numbers?

Recap

You never deploy one process, you roll a fleet, and a perfect per-instance graceful shutdown does not by itself buy zero downtime — the coordination between instances is the missing half. Three fleet invariants close the unit. Capacity must never dip below demand: surge a new pod to ready before draining the old one it replaces, which maxUnavailable: 0 plus a positive maxSurge encodes at the cost of transient instances. Routing must deregister before terminate: leave the load balancer’s rotation and wait out eventually-consistent propagation before the listener closes, with deregistration delays of 30–60s (600s+ for long connections) and a drain window near 3× p99, clamped by the guardian timeout. And synchronized actions must be jittered: un-jittered simultaneous Connection: close produces a thundering reconnect that topples the cold pods you just started, so spread the closes over a randomized window. The deep lesson, shared with the distributed-failure-modes lesson, is that local correctness does not compose into global correctness for free — the failure modes live in the interactions, and the deploy is where lifecycle, resource limits, idempotency, and distributed-systems eventual consistency all converge. That convergence is exactly what the final unit takes up, putting the whole backend track together into one coherent picture of a resilient service under churn. Now when you review a rolling-deploy configuration, you’ll check three things before anything else: whether new pods reach ready before old ones drain, whether the deregistration delay is sized to real p99 latency, and whether Connection: close is jittered — because all three are coordination problems a per-process shutdown can’t solve alone.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

In-flight work: long requests, background jobs, and the deadlinesenior

unlocks

Putting it together: the backend is a system, not a stackjunior

deepens into

Putting it together: the backend is a system, not a stackjunior

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.