Databases DB · 06 · 05

Expand-contract: zero-downtime for breaking schema changes

Any breaking change — rename, drop, type-change — decomposes into an expand (add) phase and a contract (remove) phase, each independently deployable and reversible.

DB Middle ◷ 14 min

Level

FoundationsJuniorMiddleSenior

A team renames users.username to users.handle in one ALTER TABLE statement at deploy time. The migration commits instantly. But the rolling deploy is still in progress — old pods are still running SQL that references users.username. Every request from an old pod fails with “column does not exist”. A fast migration caused a real outage.

Why rename/drop/type-change cannot be done in one step

Think about what a rolling deploy actually looks like at any scale: for 30–120 seconds, two versions of your application run side by side against the same database. Any schema change that makes one version fail is an outage. A rolling deploy — the only safe deploy strategy at any scale — runs old and new code simultaneously for seconds to minutes. If a migration commits before all old pods drain:

Rename: old code references old column name → column does not exist errors.
Drop: old code references dropped column → same.
Type change: old code expects int, column is now text → type cast errors.

The invariant that must hold at every moment: the current schema is simultaneously compatible with the previous application version and the next one. A one-step rename, drop, or type change breaks this invariant.

The general expand-contract framework

Every breaking change decomposes into six phases:

Phase	Migration action	Code action	Both versions compatible?
1. Expand	Add new element alongside old	—	Yes — old code unaffected
2. Dual-write	—	Write to both old and new	Yes — old reads old, new writes both
3. Backfill	Copy historical data old → new in batches	—	Yes
4. Cut over reads	—	Read from new, still write both	Yes — full rollback still possible
5. Stop old writes	—	Write only to new	Yes — old element still present
6. Contract	Drop the old element	—	Yes — no code references old element

Column rename walkthrough: username → handle

The canonical example. Total span: 3–7 days.

Day 1 — Expand:

-- Migration 1: instant, no default
ALTER TABLE users ADD COLUMN handle TEXT;

Day 1 — Dual-write deploy: App code now writes both username and handle on every INSERT/UPDATE. Old code still reads username. New code reads username (for now).

Day 2 — Backfill:

-- Outside a migration transaction (batched loop):
UPDATE users SET handle = username
WHERE handle IS NULL
  AND id BETWEEN :batch_start AND :batch_end;
-- Repeat until SELECT COUNT(*) FROM users WHERE handle IS NULL = 0

Batches of 1k–10k rows, each a short transaction with pg_sleep(0.1) between.

Day 3 — Cut over reads: Deploy code that reads handle exclusively. Writes still go to both columns. Watch dashboards for 24h.

Day 5 — Stop dual-write: Deploy code that writes only handle. Wait for rolling deploy to complete.

Day 7 — Contract:

-- Migration 4: fast (metadata only)
ALTER TABLE users DROP COLUMN username;

Every step is independently deployable and reversible. Zero-downtime throughout.

The phases are logically simple but calendar-slow: the gaps between deploys (soak, monitoring, rolling-deploy drain) are the real cost — you cannot rush a zero-downtime rename into one step.

Down migrations are not rollback

Down migrations (reverse SQL of an up migration) appear in many tools (down.sql counterpart files). They work in development but are dangerous in production:

A down migration that drops handle also destroys all the data that new code wrote to it.
If the up migration was ADD COLUMN, the down is DROP COLUMN — destructive.

Production rollback is forward: write a new migration that corrects the problem, deploy it, and move forward. Treat down migrations as development scaffolding only.

Expand-contract deployment facts

Typical span for a column rename: 3–7 days
Number of deploys for a column rename: 5+ separate PRs
Backfill batch size: 1k–10k rows
pg_sleep between batches: 0.1 s (breathing room)
Each phase: independently reversible?: Yes
Production rollback direction: Forward (new migration)

▸Why this works

pgroll (by Xata) automates expand-contract via Postgres views. When you declare a migration that “renames username to handle”, pgroll creates a v1 view (exposes username) and a v2 view (exposes handle), both backed by the same underlying column, with triggers translating writes. Old pods connect with search_path = v1; new pods with search_path = v2. When migration completes, the v1 view is dropped. This collapses 5+ deploys into one declarative migration — at the cost of view-layer complexity in debugging and pg_dump output.

Quiz

Why can a direct RENAME COLUMN migration cause an outage during a rolling deploy?

Quiz

At which phase of expand-contract is rollback easiest, and why?

Quiz

A team runs a down migration in production that drops the new column added in the up migration. What data is lost?

Each phase is an independently deployable, reversible step. The schema stays a superset of old and new throughout, so a rolling deploy never hits a column-not-found error.

Recall before you leave

01
State the core invariant of expand-contract and explain why skipping the dual-write phase breaks it.
02
Walk through the six phases of rename: username → handle, naming the migration or code change at each step.
03
Why are down migrations unsafe in production and what is the correct production rollback strategy?

Recap

Expand-contract is the only correct pattern for zero-downtime rename, drop, or type-change migrations. It maintains a schema superset at every moment: both old and new code versions run simultaneously against a schema that supports both. A column rename becomes five independent deploys over 3–7 days: add new column (expand), deploy dual-write code, backfill historical rows in batches, deploy read-swap code, deploy drop-old-write code, then drop the old column (contract). Down migrations appear safe but destroy data already written by the new code; production rollback is always a forward migration that fixes the problem. Now when you see a PR that renames or drops a column in a single migration, you can explain exactly what will break during the rolling deploy — and what the correct six-phase fix looks like.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 7 done

Connected lessons

builds on

Safe DDL patterns: NOT VALID, CONCURRENTLY, and unsafe-op fixesmiddle

unlocks

Advisory locks, migration tools, and deploy coordinationsenior

deepens into

appears again in166

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.