Deployment & Infra DEP · 08 · 01

Capstone: a deploy is a chain, and the outage lives in the seam

Each deploy stage is correct alone, yet the release breaks where two compose wrong: a rolling update with no readiness probe, a blue-green flip over an incompatible migration, an L4 LB that can''''t drain. The glue is health checks, expand-contract, and rollout metrics.

DEP Senior ◷ 17 min

Level

FoundationsJuniorMiddleSenior

Every box in the release diagram was green. The image built and pushed. The k8s Deployment applied without error. The rolling update reported success in 40 seconds. Then the pager went off: 30% of requests returning 502 for two minutes. Nobody had shipped a bug. The image was fine, the manifest was fine, the migration was fine — separately. The Deployment had no readinessProbe, so k8s declared each new pod “ready” the instant the container process started, before the app had opened its DB pool. The rolling update dutifully shifted traffic to pods that were alive but not yet serving. The outage did not live in any one stage. It lived in the seam between two correct ones.

The pipeline is one composed object, not seven steps

The whole deployment track has been one stage at a time: build a lean multi-stage image, push it to a registry, declare k8s objects, pick a rollout strategy, front it with a load balancer, inject secrets, codify it all as infrastructure-as-code. The capstone insight is that none of those stages ships software on its own. A release is the composition of all of them, and composition has emergent failure modes that no single stage’s tests can catch.

Think of it as a chain where each link passes an artifact to the next:

Stage	Produces	The seam it can break
Multi-stage build	An immutable image + digest	Build deps leak into runtime → bloat, CVEs
Push to registry	A pullable, tagged artifact	Mutable `:latest` tag → which build is actually live?
k8s objects	Deployment + Service + Ingress + Config/Secret	No probe → “ready” lies to the rollout
Rollout strategy	Old → new traffic shift	Flip over an incompatible schema → old code 500s
Load balancer	Client traffic → healthy backends	L4 can’t drain → in-flight requests killed mid-flight
Secrets at deploy	Config injected at runtime	Secret baked into image → leaked + un-rotatable
IaC	Reproducible cluster state	Drift → the env you tested isn’t the one you shipped

Every “seam” column is a real outage someone has shipped. Each presupposes the stage before it did its job — and broke anyway, because the contract between stages was never enforced.

Health checks are the glue between “running” and “serving”

The single most load-bearing piece of glue is the readiness probe, because it defines the contract the rollout depends on. A rolling update’s whole job is: bring up a new pod, wait until it’s ready, then take down an old one, maxUnavailable at a time. The word “ready” is doing enormous work there. Without a readinessProbe, k8s reports a pod ready the moment its container’s main process starts — which is before the JVM has warmed, before the connection pool is open, before caches are primed. The rolling update sees green, adds the pod to the Service endpoints, and traffic flows into a process that returns 502/connection-refused.

When you configure a probe, ask yourself: does this endpoint go red while the thing the request actually needs is still initializing? If the answer is no, you have decoration, not a gate. The fix is a probe that actually exercises dependencies, plus surge math that never drops below capacity:

readinessProbe:
  httpGet: { path: /healthz/ready, port: 8080 }
  initialDelaySeconds: 5
  periodSeconds: 5
strategy:
  rollingUpdate:
    maxUnavailable: 0   # never below desired replicas
    maxSurge: 1         # add one new pod before removing an old one

Two probes do two different jobs and conflating them is its own outage. The readiness probe controls traffic — fail it and the pod is pulled from the Service, but kept running. The liveness probe controls restarts — fail it and the kubelet kills and recreates the container. Wire a slow dependency (a flaky DB) into your liveness probe and you get a restart storm: every pod that can’t reach the DB gets killed, which makes the DB problem worse, not better. Liveness should check “am I deadlocked,” not “is my whole world healthy.”

▸Why this works

A subtle trap: a probe at path: /healthz that returns 200 if the web server is up tells the rollout nothing useful, because the web server is up almost instantly. The probe has to fail while the thing the request actually needs — the DB pool, the cache connection, the downstream client — is still initializing. A health check that can’t go red during startup is decoration, not glue.

The rollout strategy and the migration are one decision

The most expensive seam in the whole chain is between the rollout strategy and the database, because the LB flip is reversible and the schema change usually is not. Picture a blue-green deploy: blue is live, green is the new version, you flip the LB and roll back instantly if green misbehaves. Now suppose green’s release “cleaned up” the schema — it dropped a column blue still reads. The flip succeeds, green serves fine. Then green throws an error and you roll back to blue — and blue immediately 500s on every request, because the column it needs is gone. Your “instant rollback” is now a forward-only emergency.

The discipline that makes rollout-and-migration composable is expand-contract (also “parallel change”): never ship a schema change that is incompatible with the currently-running code. You split one logical change into separate deploys, each maintaining N-1 compatibility (new schema works with old code, and vice versa):

Expand — add the new column/table; keep the old one. Both versions work.
Migrate — backfill data into the new shape; dual-write from the app.
Contract — only after all traffic runs on new code, drop the old column. A separate, later deploy.

The cost is real: a column rename that “should” be one migration becomes three deploys spread across releases, plus dual-write code you have to remember to delete. The payoff is that every intermediate state is rollback-safe, so your rollout strategy keeps the reversibility it promised.

Pick the best fit

You need to rename a hot column on a 200M-row table while keeping zero-downtime rollouts reversible. Pick the approach.

Draining and observability close the chain

Even with probes and safe migrations, the cutover can kill requests that were already in flight. When a pod is terminated, k8s removes it from the Service endpoints and sends SIGTERM — but those two events race, and the LB may still route a few requests to the dying pod for a beat. An L7 load balancer can drain: it stops sending new requests and waits a configurable timeout (a good default is 1.5–2× your p99 request time) for in-flight ones to finish. A naive L4 setup that just stops at the connection layer can sever requests mid-response. The application side has to cooperate: catch SIGTERM, stop accepting new work, finish in-flight requests, then exit — all inside terminationGracePeriodSeconds (default 30s), with a preStop sleep to cover the endpoint-removal race.

And the only reason you knew the readiness-probe outage in the Hook was 502s and not “success” is observability. A rollout’s exit code tells you the manifest applied; it does not tell you error rate, p99 latency, or saturation on the new pods. The load-bearing signal is comparing the new version’s golden metrics against the old version’s baseline during the rollout — which is exactly what a canary does automatically: shift 5% of traffic, watch error rate and latency for a few minutes, promote or auto-rollback on the metric, not on the deploy command’s return value.

Quiz

A rolling update reports success in 40 seconds, but 30% of requests 502 for two minutes afterward. The image, manifest, and migration are all individually fine. What is the most likely cause?

Quiz

Why is a blue-green flip dangerous when paired with a migration that drops a column?

Order the steps

Order a safe column-rename release so every intermediate state is rollback-safe:

1 Deploy 1 — Expand: add the new column, leave the old one in place
2 Deploy 2 — Ship code that dual-writes both columns and backfills existing rows
3 Verify all running pods are on the new code reading the new column
4 Deploy 3 — Contract: drop the old column now that nothing reads it
5 Remove the now-dead dual-write code in a follow-up deploy

Each stage is correct alone and passes an artifact to the next; the outage lives in the seam between two stages, where a contract — a readiness probe, an N-1 compatible schema, an L7 drain — was never enforced.

Recall before you leave

01
Walk a teammate through why a perfectly correct rolling update can still cause an outage, and what single piece of configuration prevents it.
02
Explain expand-contract and why it is what makes a rollout strategy and a database migration safely composable.

Recap

A deploy is not seven independent steps; it is one composed object, and the outages live in the seams where two individually-correct stages meet under a contract nobody enforced. The readiness probe is the contract between “the process is running” and “the app can serve,” and without it a flawless rolling update routes traffic to dead pods. The rollout strategy and the database migration are a single decision: a reversible LB flip composed with a destructive, non-reversible schema change produces a state you cannot roll back to, which is why expand-contract — keeping every schema N-1 compatible across separate expand, migrate, and contract deploys — is the glue that lets them coexist. An L7 load balancer that drains and an app that handles SIGTERM within terminationGracePeriodSeconds keep in-flight requests alive across the cutover, and rollout metrics (error rate and p99 versus baseline, as a canary checks automatically) are what make “success” mean healthy rather than merely applied. Build the image lean, push an immutable digest, declare the objects, inject secrets at runtime, and codify the whole thing as IaC so the environment you tested is the one you shipped — then the chain holds because you engineered the seams, not just the links. Now when you see a post-deploy 502 spike or a failed rollback, your first question is not “which stage broke?” but “which seam between two correct stages was never enforced?”

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.