Backend Architecture
Graceful shutdown: multiple-choice review
Six questions that cut across the whole unit. Each one mirrors a decision you make during a real deploy — not a definition to recite, but an ordering or disposition to get right before the grace period runs out.
Confirm you can connect the signal contract, the deregistration race, drain ordering, requeue safety, and fleet coordination — the synthesis the individual lessons built toward.
A team ships a correct 45-second SIGTERM drain handler, yet requests are still cut off on every deploy. The Dockerfile starts the app with sh -c 'node server.js'. What is the root cause?
A service with a flawless SIGTERM handler that closes its listener the instant the signal lands still emits a brief burst of connection-refused errors at the very start of each rollout. Why, and what is the fix?
A shutdown handler closes the HTTP server and the database pool in the same step. Most deploys are clean, but occasionally a few requests fail mid-drain with pool-closed errors. What principle was violated?
A queue worker is four minutes into a job when SIGTERM lands and the grace period is 30s. What is the correct disposition, and what makes it safe?
During shutdown a service starts failing both its readiness and liveness probes to 'shut down faster.' Drains begin getting cut short. What went wrong?
Every per-instance shutdown is flawless, but rolling deploys under load still shed a thin layer of 502s, and occasionally the surviving pods fall over. Which pair of fleet-level fixes addresses this?
The unit’s through-line is one ordered discipline. The signal contract comes first — SIGTERM must actually reach a registered handler in PID 1, or SIGKILL ends everything at the grace-period deadline. Then the deregistration race: fail readiness and wait out propagation before you stop accepting, so you never refuse traffic the load balancer is still routing. Then drain in reverse dependency order — HTTP first, datastores last — so no in-flight request loses a resource it still needs, all bounded by a guardian timeout. Work that won’t fit the deadline budget is rejected (long requests) or requeued (jobs), and requeue is safe only when the consumer is idempotent. Finally, lift it to the fleet: surge before drain, deregister before terminate, and jitter synchronized closes — because local correctness does not compose into global correctness for free.