awesome-everything RU
↑ Back to the climb

Backend Architecture

Graceful shutdown: zero request loss on deploy

Crux Hands-on project — build a service that drops requests on deploy, then make it shut down gracefully and prove zero request loss across a rolling deploy with before/after numbers.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 240 min

Reading about the deregistration race is not the same as watching your own deploy shed 502s and then making them vanish. Build a small service that loses requests on every rollout, drive it under load, and apply the unit’s discipline — signal, deregister, drain, dispose, coordinate — until a rolling deploy is invisible to the client, with evidence at every step.

Goal

Turn the unit’s mental model into a reproducible engineering loop: reproduce deploy-induced request loss under load, then add a correct shutdown path and fleet coordination, and verify zero (or near-zero) loss across a rolling deploy with before/after metrics.

Project
0 of 7
Objective

Take a small HTTP service with one slow endpoint and one queue worker, deploy it on Kubernetes (kind/minikube is fine), and drive its deploy-time request loss to zero — or to a documented, bounded minimum — proving each step with measurements rather than assertions.

Requirements
Acceptance criteria
  • A before/after table: deploy-window error rate, 502/reset/refused counts, p99 request latency during the roll, and duplicated/lost job side effects — measured under identical load, not estimated.
  • Logs proving the SIGTERM handler fires, readiness flips before the listener closes, and resources close in reverse dependency order; the guardian timeout path is exercised at least once and force-exits cleanly.
  • A kill-mid-job demonstration showing the side effect is applied exactly once after redelivery, proving the consumer is idempotent.
  • After numbers show zero (or a documented, bounded near-zero) request loss across the rolling deploy, with the connection-refused burst and the capacity dip both gone.
  • A one-paragraph write-up naming which lever fixed each failure mode (PID 1, readiness+preStop, reverse-order drain, requeue+idempotency, surge+jitter) and why ordering mattered.
Senior stretch
  • Add an on-call runbook: how to read a deploy-window error spike, the five failure modes and their fixes, and a pre-deploy verification checklist.
  • Measure and tune the real deregistration delay: instrument the time from readiness-fail to last-routed-request, and size the preStop sleep and drain window to ≈3× your observed p99 instead of a guessed constant.
  • Reproduce and then fix the thundering reconnect explicitly: drive thousands of keep-alive clients, show the un-jittered synchronized close toppling cold pods, then show jitter smoothing the reconnect curve.
  • Repeat the experiment with long-lived connections (WebSocket/SSE or large uploads): push the deregistration delay to 600s+, and show clean 503 + Retry-After rejection for operations that cannot finish in the budget.
Recap

This is the loop you will run for every real service before you trust it under churn: reproduce the deploy-time loss first, then fix it in order — make SIGTERM reach a handler in PID 1, fail readiness and wait out deregistration before you stop accepting, drain keep-alive and tear down in reverse dependency order under a guardian timeout, reject or idempotently requeue work that won’t fit the budget, and lift it to the fleet with surge-before-drain and jittered closes. Prove each step with before/after numbers under identical load. Doing it once on a toy service makes the production version muscle memory — and turns ‘we lose a few requests on every deploy’ into a deploy nobody notices.

Continue the climb ↑Putting it together: the backend is a system, not a stack
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources2
expand
  1. 01
  2. 02

Trademarks belong to their respective owners. Editorial reference only.