Deployment & Infra DEP · 04 · 10

Rollout strategies: build a gated canary that rolls itself back

Hands-on project — build a metric-gated canary rollout that auto-rolls-back a deliberately bad release, with a readiness probe and an expand-contract migration proving rollback stays safe.

DEP Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading about progressive delivery is not the same as watching a canary catch a bad release on 5% of traffic and roll itself back while you do nothing. Build one, ship a deliberately broken version into it, and prove the gate — and the readiness probe and the expand-contract migration — actually hold.

Goal

Turn the unit’s mental model into a working pipeline: a metric-gated canary that limits blast radius, a readiness probe that prevents the 502-on-rollout failure, and an expand-contract migration that keeps rollback safe across a schema change — each proven by an experiment, not asserted.

Project

0 of 7

Objective

Deploy a small HTTP service to Kubernetes behind an automated canary rollout, then prove the rollout's safety properties by deliberately shipping a bad release and a schema change and showing the system contains both — limited blast radius, no 502s, and a rollback that survives the migration.

Requirements

Acceptance criteria

A before/after for Experiment 1: error count during the rollout with the probe absent vs present, measured under the same load — not estimated.
Rollout event logs from Experiment 2 showing the automated pause and rollback, plus the measured fraction of total traffic that hit the bad version (should be ≤ the 5% canary step).
Evidence from Experiment 3 that a rollback after the expand step still serves correctly, and a note on why the contract step is deferred to a later deploy.
A one-paragraph write-up choosing canary vs blue-green vs rolling for this service and justifying it against blast radius, resource cost, and rollback speed.

Senior stretch

Add a one-page release runbook: how to read the analysis dashboard, when to abort manually, and the expand-contract checklist for any schema change.
Add a smoke/integration step to the analysis that must pass before the first traffic step, so a release failing basic checks never reaches even 5%.
Compare resource cost: measure peak pod count and memory during the canary vs a simulated blue-green of the same service, and show the roughly-double footprint of blue-green during cutover.
Wire the gate to a second signal beyond error rate (e.g. a business metric like checkout success), and show it catching a release that is technically 200-OK but functionally broken.

Recap

This is the loop behind every safe release: limit blast radius with a gated canary, gate on real SLO metrics so the rollback is automatic rather than a 3am judgement call, keep the readiness probe load-bearing so traffic never reaches a half-booted pod, and make every schema change expand-contract so rollback survives it. Proving each property with a deliberate failure once on a toy cluster is what makes the production version muscle memory.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.