Crux Read real shutdown handlers, a Kubernetes pod spec, and a drain sequence, predict the failure, and pick the highest-leverage fix a senior engineer would make first.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min
Shutdown bugs hide in the handler code and the pod spec, not in the prose. Read each snippet, predict exactly where it drops a request, and choose the fix a senior engineer would reach for first.
Goal
Practise the loop you run in every shutdown review: trace the signal, the ordering, and the deadline through real code and config, then find the one change that stops the request loss.
Under real traffic this handler intermittently hangs until SIGKILL even when no request is actively running. What is the defect and the fix?
Heads-up Closing the pool after the HTTP server drains is the correct order, not a deadlock. The hang is the un-closed idle keep-alive sockets keeping server.close() from ever firing its callback.
Heads-up process.exit only runs inside the close callback, after the server has drained — that ordering is fine. The problem is the callback never fires because idle keep-alive sockets stay open.
Heads-up SIGKILL can never be caught or handled — there is no SIGKILL handler. SIGTERM is correct; the bug is the keep-alive drain, not the signal choice.
Snippet 2 — the drain order
process.on("SIGTERM", async () => { setReadiness(false); // fail readiness probe await db.end(); // close database pool await redis.quit(); // close cache await closeHttpServer(server); // drain + close HTTP last process.exit(0);});
Quiz
Completed
This handler fails readiness correctly, but in-flight requests still error with pool-closed and cache-closed failures during the drain. What is wrong?
Heads-up Failing readiness first is correct — it starts the deregistration clock. The bug is the datastores closing before the HTTP layer that still depends on them during the drain.
Heads-up Running them in parallel makes it worse — it guarantees the pool closes while requests are still draining. The fix is sequential reverse-dependency order, HTTP first.
Heads-up There is no fixed redis-then-db rule; the order is reverse-dependency. Both datastores must close after the HTTP server has drained the requests that use them.
The preStop sleep and grace period look fine, but during shutdown pods are occasionally killed and restarted mid-drain. What in this spec causes it?
Heads-up 5s of preStop fits comfortably inside a 30s budget. The restart comes from the shared probe endpoint making liveness fail, not from the grace period being too short.
Heads-up preStop runs before SIGTERM and the orchestrator blocks on it; SIGTERM is still delivered afterward. The defect is the shared liveness/readiness endpoint.
Heads-up Failing readiness during termination is exactly how you stop new traffic — it is required. The bug is liveness sharing that endpoint and tripping a restart.
Snippet 4 — the guardian timeout and requeue
process.on("SIGTERM", async () => { worker.stopPulling(); // stop taking new jobs const t = setTimeout(() => { log.error("drain timed out, forcing exit"); process.exit(1); }, 25_000); // guardian < grace period await worker.finishOrRequeue(); // ack done jobs, requeue the rest clearTimeout(t); process.exit(0);});
Quiz
Completed
The guardian timeout and requeue logic are structurally correct. What is the one remaining precondition that makes worker.finishOrRequeue() safe, and what breaks without it?
Heads-up The guardian must be less than the grace period so it force-exits on your own terms before SIGKILL. Equalling it leaves no margin and risks a SIGKILL mid-flush. And it does not address requeue safety.
Heads-up You must stop pulling new jobs first — otherwise new work keeps arriving on a dying worker and never finishes. The ordering here is correct; the missing piece is idempotency.
Heads-up A non-zero exit on a forced/timed-out drain is fine — it signals an abnormal shutdown for alerting. The real precondition is that requeued jobs are idempotent, regardless of exit code.
Recap
Every shutdown bug is read in the handler and the spec: server.close() alone hangs on idle keep-alive sockets, so close them explicitly; teardown must run in reverse dependency order with datastores last, or in-flight queries hit a dead pool; liveness and readiness must not share an endpoint, or failing readiness to drain also trips a restart; and the guardian timeout buys a clean self-chosen exit while requeue stays correct only when the consumer is idempotent. Trace the signal, the order, and the deadline through the code, fix the one line that drops the request, and re-test under load to confirm.