Deployment & Infra DEP · 04 · 01

Rollout strategies: trading blast radius for resource cost and rollback speed

Recreate, rolling, blue-green, canary — four ways to ship a new version. The choice is a tradeoff between blast radius, resource cost, and rollback speed, and all four are load-bearing only if your health checks, rollback plan, and schema compatibility hold.

DEP Junior ◷ 16 min

Level

FoundationsJuniorMiddleSenior

A rolling update goes out at 2pm. Dashboards stay green — the Deployment reports “available”, every replica is up. Yet 30% of requests start returning 502s. The new pods booted, Kubernetes marked them ready the instant the container started, and traffic flowed in before the app had connected to the database. There was no readiness probe. The rollout “succeeded” straight into a silent partial outage, and it took a frantic 20 minutes to realise the deploy was the cause.

The four strategies, and what each one costs

There are exactly four shapes a release can take, and picking one is an engineering tradeoff, not a default. When you understand what each shape costs, you stop defaulting to “rolling because Kubernetes does it” and start making the call deliberately.

Recreate kills every old instance, then starts the new ones. Simple, no version overlap — which matters when two versions genuinely cannot coexist (an incompatible in-place schema change, a singleton job). The cost is downtime: a gap where nothing serves. Only acceptable for non-HA services or maintenance windows.

Rolling update is the Kubernetes default. It replaces pods incrementally — bring up a few new ones, drain a few old ones, repeat — so the service never fully drops. Tuned with two knobs: maxSurge (how many extra pods above the desired count may exist mid-rollout) and maxUnavailable (how many may be missing). Both default to 25%.

Blue-green runs two complete environments. “Blue” serves production while “green” gets the new version fully deployed and warmed; then you flip the router atomically. Rollback is instant — flip back. The cost is roughly double the resources during the cutover, plus the database-compatibility problem (below).

Canary routes a small slice of traffic — typically 5%, then 25%, then 50% — to the new version, watches metrics at each step, and ramps up only if error rate and latency stay healthy. Smallest blast radius, but it needs traffic-shaping and real observability. Tools like Argo Rollouts and Flagger automate the steps and the metric gates (progressive delivery).

Together these four strategies cover every tradeoff you will face: from “ship fast and accept downtime” (recreate) to “catch bad releases before they reach most users” (canary). The shape you pick commits you to a specific cost — so learn the costs before you need the rollback.

Rolling update: the readiness probe is not optional

The Hook’s outage is the canonical rolling-update failure. Kubernetes shifts traffic to a new pod the moment it is ready — and by default a pod is ready as soon as its container process starts. If your app needs three seconds to open DB connections and warm a cache, that is a three-second window per pod where you are routing live traffic into a process that cannot serve it.

A correct readiness probe closes that window: Kubernetes withholds traffic until the probe passes (and minReadySeconds has elapsed). Without it, maxUnavailable: 25% does not protect you — a quarter of capacity can be “ready” and dead at the same time. The senior-safe config for zero-downtime is maxSurge: 1, maxUnavailable: 0 plus a real readiness probe: never drop below full capacity, and never send traffic to a pod that has not proven it can answer.

▸Why this works

A readiness probe and a liveness probe are not the same thing, and conflating them causes outages. Liveness restarts a pod it thinks is dead — point it at a slow dependency and a backend hiccup triggers a restart storm. Readiness only removes the pod from the load-balancer rotation. During a rollout it is readiness that gates traffic; a liveness probe doing double duty will happily kill pods that were merely busy.

Blue-green: instant rollback, but the database doesn’t flip

Blue-green’s selling point is the atomic flip and the instant rollback that comes with it. The trap is that you switch the application atomically but the database is shared — it does not flip. The day green ships a migration that drops a column or renames it, blue (still your rollback target) is now broken against the live schema. Flip back and it crashes. Your “instant rollback” is gone exactly when you need it.

The fix is the expand-contract (parallel-change) pattern: never make a breaking schema change in one step.

Phase	Schema action	Why both versions survive
Expand	Add the new column/table; keep the old one	Old code ignores the new field; new code can read both — additive change is backward-compatible
Migrate	Backfill data; dual-write old + new	Both shapes stay populated, so a rollback to old code still finds its data
Contract	Drop the old column — in a later deploy	Only after the old version is fully retired and you’ll never roll back to it

This is what “backward-compatible schema changes” means in practice, and it applies to rolling and canary too — any strategy with two app versions live at once needs a schema both versions can read.

Choosing: blast radius vs resource cost vs rollback speed

There is no universally best strategy; you weigh three axes. Blast radius — how many users a bad release hits before you catch it: canary is smallest, recreate is total. Resource cost — blue-green pays for two full environments; canary and rolling reuse the same pool; recreate is cheapest. Rollback speed — blue-green and canary are near-instant (flip the router / shift weight to zero); rolling has to roll backward pod by pod; recreate means a second downtime.

The deciding factor is usually how good your observability is. When you reach for canary, ask yourself first: can I actually read the 5% slice? Canary’s tiny blast radius is worthless if you cannot tell from your metrics that the 5% slice is failing — you will ramp it to 100% blind. Without solid dashboards and SLO-based alerts, a well-tuned rolling update with a readiness probe beats a canary you cannot read.

Pick the best fit

A payments API on Kubernetes is shipping a risky refactor. You have Prometheus dashboards and SLO alerts, ample cluster capacity, and need the smallest possible user impact if it goes wrong. Pick the rollout strategy.

Quiz

A rolling update reports 'available' but ~30% of requests return 502s right after deploy. What's the most likely root cause?

Quiz

You're using blue-green and want rollback to stay safe. Green's release renames a column. What do you do?

Order the steps

Order a safe canary rollout of a risky change:

1 Ship a backward-compatible (expand) schema change so old and new versions both work
2 Route 5% of traffic to the new version
3 Watch error-rate and latency against the SLO at this step
4 If healthy, ramp 25% → 50% → 100%; if not, shift weight back to 0 (rollback)
5 Once fully on the new version and stable, contract: drop the old schema in a later deploy

Traffic shifts to a new pod only after its readiness probe passes; without that gate, the Service routes to a pod that started but cannot serve yet — the silent partial outage. Repeat per pod until v2 fully replaces v1.

Recall before you leave

01
A rolling update is reporting success but users see intermittent 502s. Explain why, and what one change fixes it.
02
Why does blue-green's 'instant rollback' fail when a release changes the schema, and how does expand-contract restore it?

Recap

Four rollout shapes, one tradeoff. Recreate is simple but takes downtime — fine only for non-HA services. Rolling update is the Kubernetes default, tuned with maxSurge and maxUnavailable (both 25% by default), and it is only safe with a correct readiness probe: without one, traffic flows to pods that started but can’t serve, producing a “successful” deploy straight into a silent partial outage. Blue-green gives an atomic flip and instant rollback at double the resource cost — but the shared database doesn’t flip, so a breaking migration kills your rollback target unless you use expand-contract (add, dual-write, drop later). Canary has the smallest blast radius — 5% → 25% → 50% gated on metrics, automatable with Argo Rollouts or Flagger — but it is only as good as the observability reading the canary slice. Choose by weighing blast radius against resource cost against rollback speed, and remember the load-bearing prerequisite under all four: health checks, a rollback plan, and backward-compatible schema changes. Now when you see a Deployment reporting “available” while users are getting 502s, your first question is: is there a real readiness probe — and if you see a schema rename about to go out with blue-green, you know to reach for expand-contract before the router flips.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.