awesome-everything RU
↑ Back to the climb

Databases

Schema migrations: a zero-downtime column rename under load

Crux Hands-on project: rename a column zero-downtime via expand-contract on a loaded Postgres table, proving lock-safety, a batched backfill, and both-version compatibility with evidence at every step.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 240 min

Reading about expand-contract is not the same as renaming a column on a loaded table without dropping a single request. Stand up a Postgres table with tens of millions of rows, drive read/write traffic at it, and migrate users.username to users.handle through every phase — proving lock-safety and zero data loss with measurements, not assertions.

Goal

Turn the unit’s mental model into a reproducible runbook: provision a loaded table under live traffic, decompose a breaking rename into safe expand-contract phases, defend every DDL step with lock_timeout, run a batched backfill that never floods WAL, and verify both code versions stay healthy throughout.

Project
0 of 8
Objective

Rename a column on a 10M+-row Postgres table from username to handle with zero downtime, using the full expand-contract sequence, while a load generator runs both an old (reads/writes username) and a new (reads/writes handle) client against the database the whole time.

Requirements
Acceptance criteria
  • A request-error timeline for both v1 and v2 across all six phases showing zero failed requests attributable to schema incompatibility — measured under live load, not assumed.
  • A before/after lock evidence pair: a pg_locks snapshot during a contended ALTER showing the lock_timeout abort-and-retry path, versus the same ALTER succeeding in a quiet window.
  • Backfill metrics: batch count, rows per batch, and a replication-lag (or WAL-rate) graph that stays bounded for the batched run, next to the single-statement UPDATE's WAL spike on the throwaway copy.
  • A one-paragraph write-up naming, for each phase, the invariant that held (schema compatible with both versions) and the specific failure it prevented (rename outage, lock freeze, WAL flood).
Senior stretch
  • Replace the manual SQL with a migration tool (Atlas or golang-migrate): wire pg_advisory_lock runner serialisation, lock_timeout + retries, and a non-transactional annotation on a CREATE INDEX CONCURRENTLY on handle. Verify indisvalid after the build.
  • Reproduce the lock-queue incident deliberately: hold a 60-second analytics SELECT, fire the ALTER with lock_timeout = 0, and capture the 503s and pool exhaustion — then re-run with lock_timeout = '2s' and show the freeze is gone.
  • Add migration observability: alert on migration retries > 3, any post-deploy indisvalid index, replication lag > 10 s during backfill, and migration duration > 30 s — then trip each alert intentionally.
  • Wire Squawk into CI on the migrations directory and feed it the unsafe one-liner forms (RENAME COLUMN, CREATE INDEX without CONCURRENTLY, ADD COLUMN with a volatile DEFAULT); show each is caught at PR time.
Recap

This is the loop you will run for every breaking schema change in production: decompose into expand-contract phases so the schema is compatible with both code versions at every moment, defend each DDL step with lock_timeout so a contended ALTER aborts and retries instead of freezing the table, backfill in batches so WAL stays bounded and replicas keep up, add constraints with NOT VALID + VALIDATE so DML never blocks, and roll back forward — never with a destructive down migration. Doing it once on a loaded toy table under live traffic turns the production version into muscle memory.

Continue the climb ↑Why sharding exists: the single-Postgres ceiling
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.