Backend Architecture
Zero-downtime deploys: graceful shutdown as a fleet property
Every lesson so far has been about one process dying well. But you never deploy one process — you roll a fleet. A rolling deploy walks through your replicas, replacing each old pod with a new one, and the user-visible promise is “zero downtime”: no request fails, no client notices, even though every single instance is being torn down and rebuilt underneath them. Here’s the unsettling part: a perfect per-instance graceful shutdown does not give you that promise on its own. You can have a flawless SIGTERM handler, correct readiness, clean draining, idempotent requeues — and still take an outage on deploy, because the coordination between instances is wrong. Tear down an old pod before its replacement is ready and you’ve shrunk capacity mid-deploy. Tear down too many at once and you brown out. Let the load balancer keep a stale route for a few seconds and you refuse connections. Send every client Connection: close at the same instant and they all reconnect in one synchronized thundering spike that knocks over the very pods you just brought up. Zero-downtime deploy is not the sum of good shutdowns; it’s a property of how the shutdowns are orchestrated across the fleet. This is the unit’s capstone: everything you’ve learned, lifted from one process to many.
Capacity must never dip: new ready before old drains
The first fleet-level invariant is that total serving capacity never drops below demand during the roll. A rolling deploy is governed by two knobs (Kubernetes names them maxUnavailable and maxSurge, but the concept is universal): how many old pods may be down at once, and how many extra new pods may be spun up beyond the desired count. The safe pattern is surge up, then drain: bring a new pod fully to ready — passing its readiness probe, warmed, connected to its datastores — before you start draining the old one it replaces. If you reverse that, terminating the old pod before the new one is serving, you have a window where the fleet is one instance short, and under steady load that missing capacity is dropped or queued requests. maxUnavailable: 0 with a positive maxSurge encodes “never go below full capacity”; it’s slower and costs transient extra instances, but it’s the setting that actually delivers zero-downtime under load.
Deregister before terminate, sized to real propagation
This is the deregistration race from lesson three, but now as a fleet discipline. Every old pod must be removed from the load balancer’s rotation before it stops accepting — and because deregistration is eventually consistent, you must wait out the propagation before the listener closes. At fleet scale the numbers get concrete and they get big. A typical cloud load balancer’s deregistration delay (connection-draining timeout) defaults to 30–60 seconds; for services with long-lived connections or large uploads, teams push it to 600 seconds or more. The sizing rule that survives contact with production: set the drain window to roughly 3× your p99 request duration so all but the most pathological in-flight requests complete, then clamp it with the guardian timeout. Get this wrong across a hundred pods and every deploy sheds a thin, correlated layer of 502s — invisible in a single request, glaring in the aggregate error rate.
The thundering reconnect: jitter the close
The last trap is one only the fleet can show you. When you drain, you send Connection: close so clients reconnect to healthy instances — correct. But if every pod closes its keep-alive connections at the same moment, every client reconnects at the same moment, and a fleet of thousands of clients lands as one synchronized thundering reconnect on the surviving (and newly-started, cold) pods — a self-inflicted thundering herd, the exact failure mode the circuit-breaker and idempotency units kept circling. The fix is the same one those units used: jitter. Spread connection close over a randomized window instead of a single instant, so reconnections smear across time rather than spiking. Combined with surge-up capacity and properly-sized deregistration, jittered close is what turns “every instance shut down correctly” into “the fleet never hiccupped.” Zero downtime is graceful shutdown plus coordination: capacity, ordering, and spread.
Why this works
Why isn’t a fleet of individually-perfect graceful shutdowns automatically a zero-downtime deploy — what does the fleet add that the single process can’t see? Because every property that matters during a deploy is a relationship between instances, and a single process has no view of its peers. Capacity is the clearest example: one pod draining flawlessly is correct in isolation, but whether that drain causes an outage depends entirely on whether its replacement is already serving — a fact the draining pod cannot know and cannot control. The same is true of the deregistration delay: a pod can wait out propagation perfectly, but the right amount to wait is a function of the fleet’s request-duration distribution and the load balancer’s convergence time, not anything local. And the thundering reconnect is purely emergent — no individual close is wrong; the damage comes from thousands of correct closes landing simultaneously, a pattern that literally does not exist at the scale of one process and so cannot be prevented there. This is the same lesson the distributed-failure-modes lesson hammered for circuit breakers: local correctness does not compose into global correctness for free, because the failure modes live in the interactions. Graceful shutdown gives you a well-behaved instance; zero-downtime deploy is the orchestration layer that arranges those instances in time and space — surge before drain, deregister before terminate, jitter the synchronized actions — so their individually-correct behaviors don’t collide. The deploy is where every theme of this track converges: lifecycle (the shutdown sequence), resource limits (capacity headroom), idempotency (safe requeue under churn), and distributed systems (eventual consistency and herding). Owning shutdown means owning all four at once.
| Fleet concern | Wrong way | Right way | Failure if ignored |
|---|---|---|---|
| Capacity | Drain old before new is ready | Surge new to ready, then drain old | Mid-deploy capacity dip, dropped load |
| Batch size | Replace all/many at once | Bounded maxUnavailable / maxSurge | Brownout under steady traffic |
| Routing | Close listener at SIGTERM | Deregister + wait propagation first | Connection-refused window per pod |
| Drain window | Fixed tiny timeout | ~3× p99, clamp with guardian | Severed long requests, 502s |
| Reconnect | All clients close at once | Jittered Connection: close | Thundering reconnect on cold pods |
A service has a flawless per-instance graceful shutdown but still drops requests during rolling deploys under load. Which fleet-level setting most directly fixes a mid-deploy capacity dip?
During a deploy, every pod sends Connection: close the instant it begins draining, and the surviving cold pods immediately fall over. What happened and what is the fix?
Order one step of a zero-downtime rolling deploy for a single replica:
- 1 Start a new pod and wait until it passes readiness (surge before drain)
- 2 Deregister the old pod from the load balancer and wait out propagation
- 3 Drain the old pod: jittered Connection: close, finish in-flight up to ~3× p99
- 4 Close resources in reverse dependency order and exit before the guardian timeout
- 01Why doesn't a fleet of perfect per-instance graceful shutdowns guarantee a zero-downtime deploy?
- 02What are the three fleet-level invariants of a zero-downtime rolling deploy and their real numbers?
You never deploy one process, you roll a fleet, and a perfect per-instance graceful shutdown does not by itself buy zero downtime — the coordination between instances is the missing half. Three fleet invariants close the unit. Capacity must never dip below demand: surge a new pod to ready before draining the old one it replaces, which maxUnavailable: 0 plus a positive maxSurge encodes at the cost of transient instances. Routing must deregister before terminate: leave the load balancer’s rotation and wait out eventually-consistent propagation before the listener closes, with deregistration delays of 30–60s (600s+ for long connections) and a drain window near 3× p99, clamped by the guardian timeout. And synchronized actions must be jittered: un-jittered simultaneous Connection: close produces a thundering reconnect that topples the cold pods you just started, so spread the closes over a randomized window. The deep lesson, shared with the distributed-failure-modes lesson, is that local correctness does not compose into global correctness for free — the failure modes live in the interactions, and the deploy is where lifecycle, resource limits, idempotency, and distributed-systems eventual consistency all converge. That convergence is exactly what the final unit takes up, putting the whole backend track together into one coherent picture of a resilient service under churn.