awesome-everything RU
↑ Back to the climb

Backend Architecture

Draining and shutdown order: reverse the dependency graph

Crux Draining is more than server.close(): keep-alive sockets must be closed deliberately, and resources must close in reverse dependency order — HTTP server first, datastores last — or in-flight requests hit a closed pool. A guardian timeout force-exits if the drain hangs.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 16 min

Routing has drained, SIGTERM has arrived, and your handler calls the one line every tutorial shows: server.close(). You expect the process to wind down and exit. Instead it hangs until the grace period runs out and gets SIGKILLed anyway — and on a different deploy, a handful of requests fail with database-connection-closed errors during the drain. Two separate truths are biting you. First, server.close() stops the server from accepting new connections but does not close the keep-alive connections already open and idle, so those sockets keep the server “busy” and the close callback never fires. Second, when you did manage to shut things down, you closed the database pool in the same breath as the HTTP server — but a request still in flight reached for the pool a moment later and found it gone. Draining cleanly is not one call; it is closing the right things in the right order, and bounding the whole thing so a single stuck request can’t hold the process hostage.

server.close() is necessary but not sufficient

server.close() does exactly one thing: it stops the server from accepting new connections and fires its callback once all existing connections have ended. The trap is keep-alive. HTTP/1.1 keeps connections open by default (and HTTP/2 multiplexes over a long-lived one) so clients can reuse a socket for many requests. After a request finishes, its connection sits idle but open, waiting for the next request. server.close() will not touch those idle keep-alive sockets — from its point of view they are still “active connections” — so the callback never fires and the process hangs until SIGKILL.

The fix is to drain keep-alive deliberately:

  • Signal clients to stop reusing the socket. Send Connection: close on responses during shutdown, so each client finishes its current request and then reconnects elsewhere (to a healthy instance) instead of holding the socket open.
  • Force-close idle sockets. Track open connections and destroy the ones that are idle (not mid-request) immediately, while letting active ones finish. Node’s ecosystem wraps this in helpers; the newer runtime exposes server.closeIdleConnections() and server.closeAllConnections() for exactly this.

The goal of draining: every in-flight request gets to finish and respond, and every idle connection is closed promptly so the server can actually reach zero and exit.

Close resources in reverse dependency order

Once the HTTP layer has drained, you tear down the rest — and the order is not arbitrary. The rule is reverse dependency order: shut down in the opposite order you started up, so nothing is pulled out from under something that still needs it. During startup you open the database, then the cache, then start accepting HTTP. During shutdown you reverse it:

  1. Stop the HTTP server / drain in-flight requests — nothing new can arrive, existing requests finish.
  2. Stop background workers and flush queues — let in-progress jobs complete or checkpoint.
  3. Close datastores last — database pool, then Redis, then any other downstream.

The reason is concrete: an in-flight request, mid-drain, may still issue a query. If you closed the database pool before that request finished, the query hits a closed pool and the request fails — you turned a graceful drain into the very request loss you were preventing. The datastores are the deepest dependency, so they close last, after everything that might use them is done. Closing a Postgres pool itself drains: pool.end() waits for idle clients to close and aborts active queries after a timeout.

The guardian timeout

A drain can hang. A request might be stuck on a slow downstream, a keep-alive client might never send Connection: close, a worker might be wedged. If you simply await the drain forever, you blow past the grace period and get SIGKILLed mid-cleanup — losing exactly the state you were trying to flush. So every production shutdown wraps the drain in a guardian timeout (also called a force-shutdown or shutdown-watchdog): start a timer for less than the grace period, and if the drain has not completed when it fires, log loudly and force-exit (process.exit(1)) on your own terms. A self-inflicted, logged exit a second before SIGKILL is strictly better than the kernel pulling the plug, because you control the exit code and can flush logs and metrics first.

Why this works

Why does the closing order matter so much when the grace period is only thirty seconds anyway — won’t everything be gone shortly regardless? Because order and deadline solve different problems, and getting the deadline right does nothing for ordering. The deadline bounds how long you wait; the order determines whether the work that does finish, finishes correctly. Picture a request that takes 200ms and is 100ms in when shutdown begins — well within any grace period. If you close the database pool concurrently with the HTTP server, that request’s next query, issued at 150ms, finds a dead pool and fails, even though there were 29-plus seconds of grace period left unused. The request did not run out of time; it ran out of dependency, because you removed a resource it still needed while it was legitimately running. Reverse-dependency teardown encodes a simple invariant: a resource is only closed once everything that could use it has stopped, which means each layer closes against a quiet layer beneath it. This is the exact mirror of startup, where you must open the database before the HTTP server can serve a request that needs it — shutdown just runs the dependency graph backwards. The guardian timeout is the orthogonal guarantee: ordering ensures correctness if the drain completes, and the timeout ensures the process still dies on its own terms if the drain hangs, so you never trade a stuck request for a SIGKILL that loses everything. You need both, because a correct order that never finishes is as fatal as a fast finish in the wrong order.

StepActionFailure if skipped or misordered
Stop acceptingClose listener; refuse new connectionsNew work starts on a dying process
Drain keep-aliveConnection: close, close idle socketsserver.close() hangs until SIGKILL
Finish in-flightLet active requests complete and respondSevered requests, 502s
Stop workersDrain queues, complete/checkpoint jobsHalf-done jobs lost
Close datastoresPool end, then Redis, last of allIn-flight query hits a closed pool
Guardian timeoutForce-exit before grace period endsSIGKILL mid-cleanup, state lost
Quiz

A shutdown handler calls server.close() and then waits, but the process hangs until SIGKILL even though no requests are actively running. Why?

Quiz

Why must the database pool be closed after the HTTP server has drained, not at the same time?

Order the steps

Order a clean drain-and-teardown after SIGTERM (reverse dependency order):

  1. 1 Stop accepting new connections and send Connection: close on responses
  2. 2 Close idle keep-alive sockets and let in-flight requests finish
  3. 3 Stop background workers and flush their queues
  4. 4 Close datastores last — database pool, then Redis
Recall before you leave
  1. 01
    Why is server.close() not enough, and how do you drain keep-alive connections?
  2. 02
    What is reverse dependency order and why is a guardian timeout needed too?
Recap

The handler has the door shut and routing drained; now it must drain connections and tear down resources without re-introducing loss. server.close() only stops new connections — it ignores idle keep-alive sockets, which HTTP/1.1 holds open by default, so it counts them as active and hangs until SIGKILL; drain them by sending Connection: close and force-closing idle sockets while in-flight requests finish. Then tear down in reverse dependency order, the mirror of startup: HTTP server and in-flight requests first, then workers and queues, then datastores last, because an in-flight request mid-drain may still query the database and a pool closed early turns a clean drain back into request loss. Wrap the whole thing in a guardian timeout below the grace period so a single stuck request cannot hold the process hostage to SIGKILL — a logged, self-chosen exit beats the kernel pulling the plug. Order buys correctness, the timeout buys a clean death under stall, and you need both. The mechanics are now complete for requests that fit the window — but some work does not fit: long requests and background jobs that cannot finish in time. The next lesson, the unit’s first senior beat, asks what to do with work the deadline will cut off.

Connected lessons
Continue the climb ↑In-flight work: long requests, background jobs, and the deadline
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.