Backend Architecture BE · 07 · 10

Graceful shutdown: zero request loss on deploy

Hands-on project — build a service that drops requests on deploy, then make it shut down gracefully and prove zero request loss across a rolling deploy with before/after numbers.

BE Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading about the deregistration race is not the same as watching your own deploy shed 502s and then making them vanish. Build a small service that loses requests on every rollout, drive it under load, and apply the unit’s discipline — signal, deregister, drain, dispose, coordinate — until a rolling deploy is invisible to the client, with evidence at every step.

Goal

Turn the unit’s mental model into a reproducible engineering loop: reproduce deploy-induced request loss under load, then add a correct shutdown path and fleet coordination, and verify zero (or near-zero) loss across a rolling deploy with before/after metrics.

Project

0 of 7

Objective

Take a small HTTP service with one slow endpoint and one queue worker, deploy it on Kubernetes (kind/minikube is fine), and drive its deploy-time request loss to zero — or to a documented, bounded minimum — proving each step with measurements rather than assertions.

Requirements

Acceptance criteria

A before/after table: deploy-window error rate, 502/reset/refused counts, p99 request latency during the roll, and duplicated/lost job side effects — measured under identical load, not estimated.
Logs proving the SIGTERM handler fires, readiness flips before the listener closes, and resources close in reverse dependency order; the guardian timeout path is exercised at least once and force-exits cleanly.
A kill-mid-job demonstration showing the side effect is applied exactly once after redelivery, proving the consumer is idempotent.
After numbers show zero (or a documented, bounded near-zero) request loss across the rolling deploy, with the connection-refused burst and the capacity dip both gone.
A one-paragraph write-up naming which lever fixed each failure mode (PID 1, readiness+preStop, reverse-order drain, requeue+idempotency, surge+jitter) and why ordering mattered.

Senior stretch

Add an on-call runbook: how to read a deploy-window error spike, the five failure modes and their fixes, and a pre-deploy verification checklist.
Measure and tune the real deregistration delay: instrument the time from readiness-fail to last-routed-request, and size the preStop sleep and drain window to ≈3× your observed p99 instead of a guessed constant.
Reproduce and then fix the thundering reconnect explicitly: drive thousands of keep-alive clients, show the un-jittered synchronized close toppling cold pods, then show jitter smoothing the reconnect curve.
Repeat the experiment with long-lived connections (WebSocket/SSE or large uploads): push the deregistration delay to 600s+, and show clean 503 + Retry-After rejection for operations that cannot finish in the budget.

Recap

This is the loop you will run for every real service before you trust it under churn: reproduce the deploy-time loss first, then fix it in order — make SIGTERM reach a handler in PID 1, fail readiness and wait out deregistration before you stop accepting, drain keep-alive and tear down in reverse dependency order under a guardian timeout, reject or idempotently requeue work that won’t fit the budget, and lift it to the fleet with surge-before-drain and jittered closes. Prove each step with before/after numbers under identical load. Doing it once on a toy service makes the production version muscle memory — and turns ‘we lose a few requests on every deploy’ into a deploy nobody notices.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.