Backend Architecture BE · 07 · 02

Signals and the grace period: SIGTERM, SIGKILL, and PID 1

Shutdown is a contract with the orchestrator: a preStop hook, then SIGTERM to PID 1, then a grace period (Kubernetes default 30s), then an unstoppable SIGKILL. The grace period is a hard deadline, and the classic bug is a process that never receives the signal at all.

BE Middle ◷ 15 min

Level

FoundationsJuniorMiddleSenior

“Let the process finish first” is a nice idea, but who tells the process to start finishing, and how long does it have? The answer is a precise contract between the orchestrator and your container, and getting it wrong is one of the most common production surprises. A team adds a clean SIGTERM handler that drains for up to 45 seconds — and discovers their requests are still being cut off mid-flight on every deploy. The handler was correct; the problem was that the signal never reached it. Their container started the app through a shell (sh -c "node server.js"), so PID 1 was the shell, the shell did not forward SIGTERM, and after the grace period the orchestrator sent the one signal nobody can catch — SIGKILL — straight through the polite handler that was waiting for a SIGTERM it would never see. Shutdown begins with understanding exactly which signals fire, in what order, to which process, and against what clock.

The termination sequence

When you write a SIGTERM handler, you’re entering a precise protocol — knowing the exact sequence, who sends what, and what clock you’re racing against is the difference between a handler that works and one that silently does nothing. When Kubernetes decides to stop a pod, it runs a fixed sequence — and the same shape holds for most schedulers:

The pod is marked Terminating and removed from the Service’s endpoints (more on that race in the next lesson).
The preStop hook runs, if you defined one. This is a command or HTTP call that executes before the signal is sent, and the orchestrator blocks on it. It is commonly used to sleep briefly so routing can drain, or to flip a readiness flag.
SIGTERM is sent to PID 1 of the container — the polite “please stop” signal. This is the moment your shutdown handler should fire.
The orchestrator waits the grace period — terminationGracePeriodSeconds, default 30 seconds in Kubernetes. The clock covers the preStop hook and the post-SIGTERM shutdown combined.
SIGKILL is sent if the process is still alive when the clock runs out. SIGKILL cannot be caught, blocked, or handled — the kernel destroys the process immediately.

Together, these five steps define a single hard contract: the platform gives you exactly one catchable signal (step 3) and one clock (step 4), and your entire shutdown must live inside them — if you miss the signal or blow the clock, the kill in step 5 leaves you in the same state as an abrupt exit.

So your entire graceful shutdown — drain, close, exit — must fit inside that grace period, minus whatever the preStop hook already consumed. The grace period is not a suggestion; it is a hard deadline enforced by an uncatchable signal.

SIGTERM versus SIGKILL

The two signals are not two flavors of the same thing; they are categorically different. SIGTERM (signal 15) is a request: it interrupts the process and runs whatever handler you registered, giving you the chance to drain and clean up. SIGKILL (signal 9) is a command to the kernel, not to your process: it is never delivered to your code, so there is no handler, no cleanup, no final flush. The whole game is to do your cleanup during SIGTERM so that SIGKILL never has to fire — if SIGKILL fires, you have already lost the in-flight work, exactly as in the abrupt-exit case from lesson one.

SIGTERM is your one catchable chance to drain; SIGKILL never reaches your code — so finish all cleanup during SIGTERM and SIGKILL never has to fire.

The PID 1 trap

Inside a container, the first process started becomes PID 1, and PID 1 is special: it is the process the orchestrator signals, and the kernel gives it unusual signal semantics. Two traps follow:

Signals go only to PID 1. If you launch your app as a child of a shell — sh -c "node server.js" or an unsuspecting entrypoint script — then the shell is PID 1, SIGTERM is delivered to the shell, and many shells do not forward it to their children. Your app never sees the signal, drains nothing, and is SIGKILLed at the deadline. The fix is to make your app PID 1 (exec form: CMD ["node", "server.js"], or exec node server.js at the end of a script) or to run a tiny init like tini that forwards signals and reaps zombies.
PID 1 has no default signal handlers. Normally the kernel installs default actions (like “terminate on SIGTERM”), but for PID 1 it does not. If your app is PID 1 and you forget to register a handler, the default is to ignore the signal — so the process keeps running and, again, eats the SIGKILL.

▸Why this works

Why does the kernel treat PID 1 so differently that a missing handler means the signal is ignored rather than terminating the process? PID 1 is descended from the role of init on a normal Linux system — the first userspace process, the ancestor of everything, and the one responsible for reaping orphaned children and keeping the system alive. The kernel deliberately protects it: if PID 1 could be killed by a stray default signal action, the whole system (or, in a container, the whole container) would die by accident, so the kernel does not apply the usual default dispositions to PID 1. A signal with no explicitly registered handler is simply discarded. This is a sensible safety rule for a real init system, but it becomes a footgun in containers, because your application — written assuming it is an ordinary process where SIGTERM defaults to “terminate” — is suddenly wearing the init crown without knowing it. The same code that would shut down fine when launched normally now ignores SIGTERM entirely as PID 1. That is why the two canonical fixes exist: either explicitly register a SIGTERM handler so PID 1 has something to run, or insert a real init process (tini, dumb-init, or the platform’s --init flag) as PID 1 whose entire job is to forward signals to your app and reap zombies. The deeper point is that “send SIGTERM and wait” only works if SIGTERM actually arrives at code that listens for it, and containerization silently changes both whether it arrives and what happens when it does.

Step	What fires	Catchable?	Your opportunity
Terminating	Pod removed from endpoints	—	Routing begins to drain
preStop hook	Command / HTTP, blocks	n/a	Sleep for propagation, flip readiness
SIGTERM	Signal 15 to PID 1	Yes	Run the shutdown handler: drain + close
Grace period	`terminationGracePeriodSeconds` (30s)	—	The hard deadline for all of the above
SIGKILL	Signal 9 via kernel	No	None — process destroyed, work lost

Quiz

A team adds a correct 45-second SIGTERM drain handler, but requests are still cut off on every deploy. The container runs the app via sh -c 'node server.js'. What is wrong?

Quiz

Why must all of your drain-and-cleanup work complete before the grace period expires?

Order the steps

Order the Kubernetes pod termination sequence:

1 Pod is marked Terminating and removed from the Service endpoints
2 preStop hook runs to completion (if defined), blocking the next step
3 SIGTERM is delivered to PID 1; the shutdown handler should drain and clean up
4 After terminationGracePeriodSeconds, SIGKILL destroys the process if it is still alive

The entire drain must finish before terminationGracePeriodSeconds expires; SIGKILL cannot be caught or handled.

key takeaway

Shutdown is a precise contract with the orchestrator. The sequence: the pod is marked Terminating and pulled from endpoints, the preStop hook runs and blocks, SIGTERM (signal 15) is delivered to PID 1 as the polite “please stop,” the orchestrator waits terminationGracePeriodSeconds (Kubernetes default 30s, covering preStop plus post-SIGTERM combined), and then SIGKILL (signal 9) fires if the process is still alive. SIGTERM and SIGKILL are categorically different: SIGTERM runs your handler so you can drain and clean up, while SIGKILL is a kernel command never delivered to your code — no handler, no flush — so the entire game is to finish during SIGTERM so SIGKILL never fires. The grace period is therefore a hard deadline. The classic bug is PID 1: signals go only to PID 1, so launching the app under a shell (sh -c ”…”) makes the shell PID 1 and it swallows SIGTERM; and PID 1 has no default signal handlers, so even as PID 1 a missing handler means SIGTERM is ignored. Fix by making the app PID 1 (exec form) or running a tiny init like tini, and always register the handler.

Recall before you leave

01
What is the termination sequence and how do SIGTERM and SIGKILL differ?
02
What is the PID 1 trap and how do you fix it?

Recap

The window the last lesson promised has an exact shape. Kubernetes marks the pod Terminating and removes it from endpoints, runs the blocking preStop hook, sends SIGTERM to PID 1, waits terminationGracePeriodSeconds (default 30s, covering preStop plus shutdown), and then sends SIGKILL if the process still lives. SIGTERM is a catchable request that runs your handler; SIGKILL is an uncatchable kernel command that destroys the process with no cleanup — so the grace period is a hard deadline and the entire goal is to finish during SIGTERM so SIGKILL never fires. The notorious failure is PID 1: signals reach only PID 1, so an app launched under a shell never sees SIGTERM because the shell swallows it, and even as PID 1 a missing handler means the signal is ignored, since the kernel withholds default dispositions from the protected init slot. Make the app PID 1 with the exec form or insert a tiny init like tini, and register the handler. With the signal arriving and the clock understood, the next lesson tackles the first thing the handler must reckon with — the race between SIGTERM and the load balancer, where traffic keeps arriving after you have been told to stop. Now when you set up a container, you’ll check PID 1 before anything else — a handler that never fires is the same as no handler at all.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Why graceful shutdown: the abrupt kill drops in-flight workjunior

unlocks

The deregistration race: stop routing before you stop acceptingmiddle

deepens into

The deregistration race: stop routing before you stop acceptingmiddle

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.