Databases DB · 06 · 10

Schema migrations: a zero-downtime column rename under load

Hands-on project: rename a column zero-downtime via expand-contract on a loaded Postgres table, proving lock-safety, a batched backfill, and both-version compatibility with evidence at every step.

DB Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading about expand-contract is not the same as renaming a column on a loaded table without dropping a single request. Stand up a Postgres table with tens of millions of rows, drive read/write traffic at it, and migrate users.username to users.handle through every phase — proving lock-safety and zero data loss with measurements, not assertions.

Goal

Turn the unit’s mental model into a reproducible runbook: provision a loaded table under live traffic, decompose a breaking rename into safe expand-contract phases, defend every DDL step with lock_timeout, run a batched backfill that never floods WAL, and verify both code versions stay healthy throughout.

Project

0 of 8

Objective

Rename a column on a 10M+-row Postgres table from username to handle with zero downtime, using the full expand-contract sequence, while a load generator runs both an old (reads/writes username) and a new (reads/writes handle) client against the database the whole time.

Requirements

Acceptance criteria

A request-error timeline for both v1 and v2 across all six phases showing zero failed requests attributable to schema incompatibility — measured under live load, not assumed.
A before/after lock evidence pair: a pg_locks snapshot during a contended ALTER showing the lock_timeout abort-and-retry path, versus the same ALTER succeeding in a quiet window.
Backfill metrics: batch count, rows per batch, and a replication-lag (or WAL-rate) graph that stays bounded for the batched run, next to the single-statement UPDATE's WAL spike on the throwaway copy.
A one-paragraph write-up naming, for each phase, the invariant that held (schema compatible with both versions) and the specific failure it prevented (rename outage, lock freeze, WAL flood).

Senior stretch

Replace the manual SQL with a migration tool (Atlas or golang-migrate): wire pg_advisory_lock runner serialisation, lock_timeout + retries, and a non-transactional annotation on a CREATE INDEX CONCURRENTLY on handle. Verify indisvalid after the build.
Reproduce the lock-queue incident deliberately: hold a 60-second analytics SELECT, fire the ALTER with lock_timeout = 0, and capture the 503s and pool exhaustion — then re-run with lock_timeout = '2s' and show the freeze is gone.
Add migration observability: alert on migration retries > 3, any post-deploy indisvalid index, replication lag > 10 s during backfill, and migration duration > 30 s — then trip each alert intentionally.
Wire Squawk into CI on the migrations directory and feed it the unsafe one-liner forms (RENAME COLUMN, CREATE INDEX without CONCURRENTLY, ADD COLUMN with a volatile DEFAULT); show each is caught at PR time.

Recap

This is the loop you will run for every breaking schema change in production: decompose into expand-contract phases so the schema is compatible with both code versions at every moment, defend each DDL step with lock_timeout so a contended ALTER aborts and retries instead of freezing the table, backfill in batches so WAL stays bounded and replicas keep up, add constraints with NOT VALID + VALIDATE so DML never blocks, and roll back forward — never with a destructive down migration. Doing it once on a loaded toy table under live traffic turns the production version into muscle memory.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.