Backend Architecture BE · 07 · 04

Draining and shutdown order: reverse the dependency graph

Draining is more than server.close(): keep-alive sockets must be closed deliberately, and resources must close in reverse dependency order — HTTP server first, datastores last — or in-flight requests hit a closed pool. A guardian timeout force-exits if the drain hangs.

BE Middle ◷ 16 min

Level

FoundationsJuniorMiddleSenior

Routing has drained, SIGTERM has arrived, and your handler calls the one line every tutorial shows: server.close(). You expect the process to wind down and exit. Instead it hangs until the grace period runs out and gets SIGKILLed anyway — and on a different deploy, a handful of requests fail with database-connection-closed errors during the drain. Two separate truths are biting you. First, server.close() stops the server from accepting new connections but does not close the keep-alive connections already open and idle, so those sockets keep the server “busy” and the close callback never fires. Second, when you did manage to shut things down, you closed the database pool in the same breath as the HTTP server — but a request still in flight reached for the pool a moment later and found it gone. Draining cleanly is not one call; it is closing the right things in the right order, and bounding the whole thing so a single stuck request can’t hold the process hostage.

server.close() is necessary but not sufficient

server.close() does exactly one thing: it stops the server from accepting new connections and fires its callback once all existing connections have ended. The trap is keep-alive. HTTP/1.1 keeps connections open by default (and HTTP/2 multiplexes over a long-lived one) so clients can reuse a socket for many requests. After a request finishes, its connection sits idle but open, waiting for the next request. server.close() will not touch those idle keep-alive sockets — from its point of view they are still “active connections” — so the callback never fires and the process hangs until SIGKILL.

The fix is to drain keep-alive deliberately:

Signal clients to stop reusing the socket. Send Connection: close on responses during shutdown, so each client finishes its current request and then reconnects elsewhere (to a healthy instance) instead of holding the socket open.
Force-close idle sockets. Track open connections and destroy the ones that are idle (not mid-request) immediately, while letting active ones finish. Node’s ecosystem wraps this in helpers; the newer runtime exposes server.closeIdleConnections() and server.closeAllConnections() for exactly this.

The goal of draining: every in-flight request gets to finish and respond, and every idle connection is closed promptly so the server can actually reach zero and exit.

Close resources in reverse dependency order

Once the HTTP layer has drained, you tear down the rest — and the order is not arbitrary. The rule is reverse dependency order: shut down in the opposite order you started up, so nothing is pulled out from under something that still needs it. During startup you open the database, then the cache, then start accepting HTTP. During shutdown you reverse it:

Stop the HTTP server / drain in-flight requests — nothing new can arrive, existing requests finish.
Stop background workers and flush queues — let in-progress jobs complete or checkpoint.
Close datastores last — database pool, then Redis, then any other downstream.

Together these three steps mean that every layer always closes against a quiet layer beneath it — the dependency graph runs backwards, exactly mirroring startup. Skip step 3’s ordering and a request that was still draining hits a closed pool and fails, trading a graceful shutdown for the very request loss you were preventing.

The reason is concrete: an in-flight request, mid-drain, may still issue a query. If you closed the database pool before that request finished, the query hits a closed pool and the request fails — you turned a graceful drain into the very request loss you were preventing. The datastores are the deepest dependency, so they close last, after everything that might use them is done. Closing a Postgres pool itself drains: pool.end() waits for idle clients to close and aborts active queries after a timeout.

The guardian timeout

A drain can hang. A request might be stuck on a slow downstream, a keep-alive client might never send Connection: close, a worker might be wedged. If you simply await the drain forever, you blow past the grace period and get SIGKILLed mid-cleanup — losing exactly the state you were trying to flush. So every production shutdown wraps the drain in a guardian timeout (also called a force-shutdown or shutdown-watchdog): start a timer for less than the grace period, and if the drain has not completed when it fires, log loudly and force-exit (process.exit(1)) on your own terms. A self-inflicted, logged exit a second before SIGKILL is strictly better than the kernel pulling the plug, because you control the exit code and can flush logs and metrics first.

Order and the guardian timeout are orthogonal: the timeout only turns a hang into a clean self-exit — it cannot save requests a mis-ordered teardown already severed. You need both.

▸Why this works

Why does the closing order matter so much when the grace period is only thirty seconds anyway — won’t everything be gone shortly regardless? Because order and deadline solve different problems, and getting the deadline right does nothing for ordering. The deadline bounds how long you wait; the order determines whether the work that does finish, finishes correctly. Picture a request that takes 200ms and is 100ms in when shutdown begins — well within any grace period. If you close the database pool concurrently with the HTTP server, that request’s next query, issued at 150ms, finds a dead pool and fails, even though there were 29-plus seconds of grace period left unused. The request did not run out of time; it ran out of dependency, because you removed a resource it still needed while it was legitimately running. Reverse-dependency teardown encodes a simple invariant: a resource is only closed once everything that could use it has stopped, which means each layer closes against a quiet layer beneath it. This is the exact mirror of startup, where you must open the database before the HTTP server can serve a request that needs it — shutdown just runs the dependency graph backwards. The guardian timeout is the orthogonal guarantee: ordering ensures correctness if the drain completes, and the timeout ensures the process still dies on its own terms if the drain hangs, so you never trade a stuck request for a SIGKILL that loses everything. You need both, because a correct order that never finishes is as fatal as a fast finish in the wrong order.

Step	Action	Failure if skipped or misordered
Stop accepting	Close listener; refuse new connections	New work starts on a dying process
Drain keep-alive	`Connection: close`, close idle sockets	`server.close()` hangs until SIGKILL
Finish in-flight	Let active requests complete and respond	Severed requests, 502s
Stop workers	Drain queues, complete/checkpoint jobs	Half-done jobs lost
Close datastores	Pool end, then Redis, last of all	In-flight query hits a closed pool
Guardian timeout	Force-exit before grace period ends	SIGKILL mid-cleanup, state lost

Quiz

A shutdown handler calls server.close() and then waits, but the process hangs until SIGKILL even though no requests are actively running. Why?

Quiz

Why must the database pool be closed after the HTTP server has drained, not at the same time?

Order the steps

Order a clean drain-and-teardown after SIGTERM (reverse dependency order):

1 Stop accepting new connections and send Connection: close on responses
2 Close idle keep-alive sockets and let in-flight requests finish
3 Stop background workers and flush their queues
4 Close datastores last — database pool, then Redis

HTTP server close listener + drain keep-alive

Background workers drain queues, checkpoint jobs

Cache (Redis) flush and disconnect

Database pool pool.end() — drain idle, abort active

Datastores are the deepest dependency and close last; closing them early severs in-flight queries mid-drain.

key takeaway

Draining is not a single call. server.close() stops accepting new connections but does not close idle keep-alive sockets — HTTP/1.1 keeps them open by default — so it counts them as active and its callback never fires, hanging the process until SIGKILL; you must drain keep-alive deliberately by sending Connection: close on shutdown responses and force-closing idle sockets (server.closeIdleConnections/closeAllConnections) while letting active requests finish. Then tear down in reverse dependency order, the opposite of startup: stop the HTTP server and finish in-flight requests, stop background workers and flush queues, and close datastores last — database pool (pool.end() drains idle clients and aborts active queries after a timeout), then Redis. The reason is concrete: an in-flight request mid-drain may still query the database, so closing the pool early turns a graceful drain into request loss; datastores are the deepest dependency and close last, after everything that uses them has stopped. Finally, wrap the whole drain in a guardian timeout set below the grace period: if the drain hangs, log and force-exit on your own terms rather than letting SIGKILL pull the plug mid-cleanup. Order ensures correctness; the timeout ensures the process still dies cleanly if the drain stalls — you need both.

Recall before you leave

01
Why is server.close() not enough, and how do you drain keep-alive connections?
02
What is reverse dependency order and why is a guardian timeout needed too?

Recap

The handler has the door shut and routing drained; now it must drain connections and tear down resources without re-introducing loss. server.close() only stops new connections — it ignores idle keep-alive sockets, which HTTP/1.1 holds open by default, so it counts them as active and hangs until SIGKILL; drain them by sending Connection: close and force-closing idle sockets while in-flight requests finish. Then tear down in reverse dependency order, the mirror of startup: HTTP server and in-flight requests first, then workers and queues, then datastores last, because an in-flight request mid-drain may still query the database and a pool closed early turns a clean drain back into request loss. Wrap the whole thing in a guardian timeout below the grace period so a single stuck request cannot hold the process hostage to SIGKILL — a logged, self-chosen exit beats the kernel pulling the plug. Order buys correctness, the timeout buys a clean death under stall, and you need both. The mechanics are now complete for requests that fit the window — but some work does not fit: long requests and background jobs that cannot finish in time. The next lesson, the unit’s first senior beat, asks what to do with work the deadline will cut off. Now when you see a process hanging at shutdown until SIGKILL, you know to look at two things first: idle keep-alive sockets that server.close() left open, and a datastore closed out of order that broke a still-draining request.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 7 done

Connected lessons

builds on

The deregistration race: stop routing before you stop acceptingmiddle

unlocks

In-flight work: long requests, background jobs, and the deadlinesenior

deepens into

In-flight work: long requests, background jobs, and the deadlinesenior

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.