Deployment & Infra
Rollout strategies: build a gated canary that rolls itself back
Reading about progressive delivery is not the same as watching a canary catch a bad release on 5% of traffic and roll itself back while you do nothing. Build one, ship a deliberately broken version into it, and prove the gate — and the readiness probe and the expand-contract migration — actually hold.
Turn the unit’s mental model into a working pipeline: a metric-gated canary that limits blast radius, a readiness probe that prevents the 502-on-rollout failure, and an expand-contract migration that keeps rollback safe across a schema change — each proven by an experiment, not asserted.
Deploy a small HTTP service to Kubernetes behind an automated canary rollout, then prove the rollout's safety properties by deliberately shipping a bad release and a schema change and showing the system contains both — limited blast radius, no 502s, and a rollback that survives the migration.
- A before/after for Experiment 1: error count during the rollout with the probe absent vs present, measured under the same load — not estimated.
- Rollout event logs from Experiment 2 showing the automated pause and rollback, plus the measured fraction of total traffic that hit the bad version (should be ≤ the 5% canary step).
- Evidence from Experiment 3 that a rollback after the expand step still serves correctly, and a note on why the contract step is deferred to a later deploy.
- A one-paragraph write-up choosing canary vs blue-green vs rolling for this service and justifying it against blast radius, resource cost, and rollback speed.
- Add a one-page release runbook: how to read the analysis dashboard, when to abort manually, and the expand-contract checklist for any schema change.
- Add a smoke/integration step to the analysis that must pass before the first traffic step, so a release failing basic checks never reaches even 5%.
- Compare resource cost: measure peak pod count and memory during the canary vs a simulated blue-green of the same service, and show the roughly-double footprint of blue-green during cutover.
- Wire the gate to a second signal beyond error rate (e.g. a business metric like checkout success), and show it catching a release that is technically 200-OK but functionally broken.
This is the loop behind every safe release: limit blast radius with a gated canary, gate on real SLO metrics so the rollback is automatic rather than a 3am judgement call, keep the readiness probe load-bearing so traffic never reaches a half-booted pod, and make every schema change expand-contract so rollback survives it. Proving each property with a deliberate failure once on a toy cluster is what makes the production version muscle memory.