Deployment & Infra
Rollout strategies: trading blast radius for resource cost and rollback speed
A rolling update goes out at 2pm. Dashboards stay green — the Deployment reports “available”, every replica is up. Yet 30% of requests start returning 502s. The new pods booted, Kubernetes marked them ready the instant the container started, and traffic flowed in before the app had connected to the database. There was no readiness probe. The rollout “succeeded” straight into a silent partial outage, and it took a frantic 20 minutes to realise the deploy was the cause.
The four strategies, and what each one costs
There are exactly four shapes a release can take, and picking one is an engineering tradeoff, not a default.
Recreate kills every old instance, then starts the new ones. Simple, no version overlap — which matters when two versions genuinely cannot coexist (an incompatible in-place schema change, a singleton job). The cost is downtime: a gap where nothing serves. Only acceptable for non-HA services or maintenance windows.
Rolling update is the Kubernetes default. It replaces pods incrementally — bring up a few new ones, drain a few old ones, repeat — so the service never fully drops. Tuned with two knobs: maxSurge (how many extra pods above the desired count may exist mid-rollout) and maxUnavailable (how many may be missing). Both default to 25%.
Blue-green runs two complete environments. “Blue” serves production while “green” gets the new version fully deployed and warmed; then you flip the router atomically. Rollback is instant — flip back. The cost is roughly double the resources during the cutover, plus the database-compatibility problem (below).
Canary routes a small slice of traffic — typically 5%, then 25%, then 50% — to the new version, watches metrics at each step, and ramps up only if error rate and latency stay healthy. Smallest blast radius, but it needs traffic-shaping and real observability. Tools like Argo Rollouts and Flagger automate the steps and the metric gates (progressive delivery).
Rolling update: the readiness probe is not optional
The Hook’s outage is the canonical rolling-update failure. Kubernetes shifts traffic to a new pod the moment it is ready — and by default a pod is ready as soon as its container process starts. If your app needs three seconds to open DB connections and warm a cache, that is a three-second window per pod where you are routing live traffic into a process that cannot serve it.
A correct readiness probe closes that window: Kubernetes withholds traffic until the probe passes (and minReadySeconds has elapsed). Without it, maxUnavailable: 25% does not protect you — a quarter of capacity can be “ready” and dead at the same time. The senior-safe config for zero-downtime is maxSurge: 1, maxUnavailable: 0 plus a real readiness probe: never drop below full capacity, and never send traffic to a pod that has not proven it can answer.
Why this works
A readiness probe and a liveness probe are not the same thing, and conflating them causes outages. Liveness restarts a pod it thinks is dead — point it at a slow dependency and a backend hiccup triggers a restart storm. Readiness only removes the pod from the load-balancer rotation. During a rollout it is readiness that gates traffic; a liveness probe doing double duty will happily kill pods that were merely busy.
Blue-green: instant rollback, but the database doesn’t flip
Blue-green’s selling point is the atomic flip and the instant rollback that comes with it. The trap is that you switch the application atomically but the database is shared — it does not flip. The day green ships a migration that drops a column or renames it, blue (still your rollback target) is now broken against the live schema. Flip back and it crashes. Your “instant rollback” is gone exactly when you need it.
The fix is the expand-contract (parallel-change) pattern: never make a breaking schema change in one step.
| Phase | Schema action | Why both versions survive |
|---|---|---|
| Expand | Add the new column/table; keep the old one | Old code ignores the new field; new code can read both — additive change is backward-compatible |
| Migrate | Backfill data; dual-write old + new | Both shapes stay populated, so a rollback to old code still finds its data |
| Contract | Drop the old column — in a later deploy | Only after the old version is fully retired and you’ll never roll back to it |
This is what “backward-compatible schema changes” means in practice, and it applies to rolling and canary too — any strategy with two app versions live at once needs a schema both versions can read.
Choosing: blast radius vs resource cost vs rollback speed
There is no universally best strategy; you weigh three axes. Blast radius — how many users a bad release hits before you catch it: canary is smallest, recreate is total. Resource cost — blue-green pays for two full environments; canary and rolling reuse the same pool; recreate is cheapest. Rollback speed — blue-green and canary are near-instant (flip the router / shift weight to zero); rolling has to roll backward pod by pod; recreate means a second downtime.
The deciding factor is usually how good your observability is. Canary’s tiny blast radius is worthless if you cannot tell from your metrics that the 5% slice is failing — you will ramp it to 100% blind. Without solid dashboards and SLO-based alerts, a well-tuned rolling update with a readiness probe beats a canary you cannot read.
A payments API on Kubernetes is shipping a risky refactor. You have Prometheus dashboards and SLO alerts, ample cluster capacity, and need the smallest possible user impact if it goes wrong. Pick the rollout strategy.
A rolling update reports 'available' but ~30% of requests return 502s right after deploy. What's the most likely root cause?
You're using blue-green and want rollback to stay safe. Green's release renames a column. What do you do?
Order a safe canary rollout of a risky change:
- 1 Ship a backward-compatible (expand) schema change so old and new versions both work
- 2 Route 5% of traffic to the new version
- 3 Watch error-rate and latency against the SLO at this step
- 4 If healthy, ramp 25% → 50% → 100%; if not, shift weight back to 0 (rollback)
- 5 Once fully on the new version and stable, contract: drop the old schema in a later deploy
- 01A rolling update is reporting success but users see intermittent 502s. Explain why, and what one change fixes it.
- 02Why does blue-green's 'instant rollback' fail when a release changes the schema, and how does expand-contract restore it?
Four rollout shapes, one tradeoff. Recreate is simple but takes downtime — fine only for non-HA services. Rolling update is the Kubernetes default, tuned with maxSurge and maxUnavailable (both 25% by default), and it is only safe with a correct readiness probe: without one, traffic flows to pods that started but can’t serve, producing a “successful” deploy straight into a silent partial outage. Blue-green gives an atomic flip and instant rollback at double the resource cost — but the shared database doesn’t flip, so a breaking migration kills your rollback target unless you use expand-contract (add, dual-write, drop later). Canary has the smallest blast radius — 5% → 25% → 50% gated on metrics, automatable with Argo Rollouts or Flagger — but it is only as good as the observability reading the canary slice. Choose by weighing blast radius against resource cost against rollback speed, and remember the load-bearing prerequisite under all four: health checks, a rollback plan, and backward-compatible schema changes.