awesome-everything RU
↑ Back to the climb

Databases

Expand-contract: zero-downtime for breaking schema changes

Crux Any breaking change — rename, drop, type-change — decomposes into an expand (add) phase and a contract (remove) phase, each independently deployable and reversible.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 14 min

A team renames users.username to users.handle in one ALTER TABLE statement at deploy time. The migration commits instantly. But the rolling deploy is still in progress — old pods are still running SQL that references users.username. Every request from an old pod fails with “column does not exist”. A fast migration caused a real outage.

Why rename/drop/type-change cannot be done in one step

A rolling deploy — the only safe deploy strategy at any scale — runs old and new code simultaneously for seconds to minutes. If a migration commits before all old pods drain:

  • Rename: old code references old column name → column does not exist errors.
  • Drop: old code references dropped column → same.
  • Type change: old code expects int, column is now text → type cast errors.

The invariant that must hold at every moment: the current schema is simultaneously compatible with the previous application version and the next one. A one-step rename, drop, or type change breaks this invariant.

The general expand-contract framework

Every breaking change decomposes into six phases:

PhaseMigration actionCode actionBoth versions compatible?
1. ExpandAdd new element alongside oldYes — old code unaffected
2. Dual-writeWrite to both old and newYes — old reads old, new writes both
3. BackfillCopy historical data old → new in batchesYes
4. Cut over readsRead from new, still write bothYes — full rollback still possible
5. Stop old writesWrite only to newYes — old element still present
6. ContractDrop the old elementYes — no code references old element

Column rename walkthrough: username → handle

The canonical example. Total span: 3–7 days.

Day 1 — Expand:

-- Migration 1: instant, no default
ALTER TABLE users ADD COLUMN handle TEXT;

Day 1 — Dual-write deploy: App code now writes both username and handle on every INSERT/UPDATE. Old code still reads username. New code reads username (for now).

Day 2 — Backfill:

-- Outside a migration transaction (batched loop):
UPDATE users SET handle = username
WHERE handle IS NULL
  AND id BETWEEN :batch_start AND :batch_end;
-- Repeat until SELECT COUNT(*) FROM users WHERE handle IS NULL = 0

Batches of 1k–10k rows, each a short transaction with pg_sleep(0.1) between.

Day 3 — Cut over reads: Deploy code that reads handle exclusively. Writes still go to both columns. Watch dashboards for 24h.

Day 5 — Stop dual-write: Deploy code that writes only handle. Wait for rolling deploy to complete.

Day 7 — Contract:

-- Migration 4: fast (metadata only)
ALTER TABLE users DROP COLUMN username;

Every step is independently deployable and reversible. Zero-downtime throughout.

Down migrations are not rollback

Down migrations (reverse SQL of an up migration) appear in many tools (down.sql counterpart files). They work in development but are dangerous in production:

  • A down migration that drops handle also destroys all the data that new code wrote to it.
  • If the up migration was ADD COLUMN, the down is DROP COLUMN — destructive.

Production rollback is forward: write a new migration that corrects the problem, deploy it, and move forward. Treat down migrations as development scaffolding only.

Expand-contract deployment facts
Typical span for a column rename
3–7 days
Number of deploys for a column rename
5+ separate PRs
Backfill batch size
1k–10k rows
pg_sleep between batches
0.1 s (breathing room)
Each phase: independently reversible?
Yes
Production rollback direction
Forward (new migration)
Why this works

pgroll (by Xata) automates expand-contract via Postgres views. When you declare a migration that “renames username to handle”, pgroll creates a v1 view (exposes username) and a v2 view (exposes handle), both backed by the same underlying column, with triggers translating writes. Old pods connect with search_path = v1; new pods with search_path = v2. When migration completes, the v1 view is dropped. This collapses 5+ deploys into one declarative migration — at the cost of view-layer complexity in debugging and pg_dump output.

Quiz

Why can a direct RENAME COLUMN migration cause an outage during a rolling deploy?

Quiz

At which phase of expand-contract is rollback easiest, and why?

Quiz

A team runs a down migration in production that drops the new column added in the up migration. What data is lost?

Recall before you leave
  1. 01
    State the core invariant of expand-contract and explain why skipping the dual-write phase breaks it.
  2. 02
    Walk through the six phases of rename: username → handle, naming the migration or code change at each step.
  3. 03
    Why are down migrations unsafe in production and what is the correct production rollback strategy?
Recap

Expand-contract is the only correct pattern for zero-downtime rename, drop, or type-change migrations. It maintains a schema superset at every moment: both old and new code versions run simultaneously against a schema that supports both. A column rename becomes five independent deploys over 3–7 days: add new column (expand), deploy dual-write code, backfill historical rows in batches, deploy read-swap code, deploy drop-old-write code, then drop the old column (contract). Down migrations appear safe but destroy data already written by the new code; production rollback is always a forward migration that fixes the problem.

Connected lessons
appears again in140
Continue the climb ↑Advisory locks, migration tools, and deploy coordination
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.