Backend Architecture BE · 07 · 09

Graceful shutdown: code and config reading

Read real shutdown handlers, a Kubernetes pod spec, and a drain sequence, predict the failure, and pick the highest-leverage fix a senior engineer would make first.

BE Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

Shutdown bugs hide in the handler code and the pod spec, not in the prose. Read each snippet, predict exactly where it drops a request, and choose the fix a senior engineer would reach for first.

Goal

Practise the loop you run in every shutdown review: trace the signal, the ordering, and the deadline through real code and config, then find the one change that stops the request loss.

Snippet 1 — the SIGTERM handler

const server = app.listen(8080);

process.on("SIGTERM", () => {
  server.close(() => {
    db.end();
    process.exit(0);
  });
});

Quiz

Under real traffic this handler intermittently hangs until SIGKILL even when no request is actively running. What is the defect and the fix?

Snippet 2 — the drain order

process.on("SIGTERM", async () => {
  setReadiness(false);              // fail readiness probe
  await db.end();                   // close database pool
  await redis.quit();               // close cache
  await closeHttpServer(server);    // drain + close HTTP last
  process.exit(0);
});

Quiz

This handler fails readiness correctly, but in-flight requests still error with pool-closed and cache-closed failures during the drain. What is wrong?

Snippet 3 — the pod spec

spec:
  terminationGracePeriodSeconds: 30
  containers:
  - name: api
    lifecycle:
      preStop:
        exec:
          command: ["sleep", "5"]
    readinessProbe:
      httpGet: { path: /ready, port: 8080 }
    livenessProbe:
      httpGet: { path: /ready, port: 8080 }

Quiz

The preStop sleep and grace period look fine, but during shutdown pods are occasionally killed and restarted mid-drain. What in this spec causes it?

Snippet 4 — the guardian timeout and requeue

process.on("SIGTERM", async () => {
  worker.stopPulling();             // stop taking new jobs
  const t = setTimeout(() => {
    log.error("drain timed out, forcing exit");
    process.exit(1);
  }, 25_000);                       // guardian < grace period
  await worker.finishOrRequeue();   // ack done jobs, requeue the rest
  clearTimeout(t);
  process.exit(0);
});

Quiz

The guardian timeout and requeue logic are structurally correct. What is the one remaining precondition that makes worker.finishOrRequeue() safe, and what breaks without it?

Recap

Every shutdown bug is read in the handler and the spec: server.close() alone hangs on idle keep-alive sockets, so close them explicitly; teardown must run in reverse dependency order with datastores last, or in-flight queries hit a dead pool; liveness and readiness must not share an endpoint, or failing readiness to drain also trips a restart; and the guardian timeout buys a clean self-chosen exit while requeue stays correct only when the consumer is idempotent. Trace the signal, the order, and the deadline through the code, fix the one line that drops the request, and re-test under load to confirm.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.