Databases DB · 06 · 07

Migration failure taxonomy and production discipline

Nine named failure modes — lock-queue freeze, INVALID index, WAL flood, schema drift — each with a detection signal and a durable fix.

DB Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

Six months after introducing a migration pipeline, a team has had no lock-queue incidents. Then a backfill update generates 40 GB of WAL in ten minutes — replication lag climbs to 120 seconds, standby reads go stale, and two read endpoints return rows that contradict each other. The migration ran fine; the database failed around it.

The nine migration failure modes

Senior engineers do not just know that migrations can fail — they can name the failure, read its signal, and apply the correct fix before the incident grows. The table below is the mental model.

Mode	Signal	Durable fix
(a) Lock-queue freeze	Table frozen, pool exhausted, 503s	lock_timeout + retries (lesson 03)
(b) INVALID index	pg_indexes indisvalid = false post-deploy	DROP INDEX CONCURRENTLY + retry; alert on indisvalid
(c) Migration deadlock	ERROR: deadlock detected in migration log	Serialise via advisory lock; never parallel migrations on related tables
(d) Rollback destroying data	Data loss discovered after down migration	Never use down migrations in production; use forward fixes
(e) Schema drift on replicas	Standby queries fail; replica lag metric spikes	Gate code deploy on replica lag near-zero; use replica-aware tooling
(f) Backfill WAL flood	WAL generation rate spikes; replica lag climbs; disk fills	Batch UPDATEs in 1k–10k rows; pg_sleep between batches; monitor WAL rate
(g) Volatile default hidden rewrite	Migration took minutes; table was rewritten unexpectedly	Squawk catches `DEFAULT now()` in CI; constant default + post-migration update
(h) NOT NULL without backfill	ALTER COLUMN SET NOT NULL fails at apply time	Backfill first; use NOT VALID + VALIDATE pattern (lesson 04)
(i) RENAME during rolling deploy	Old pods: column does not exist errors	Expand-contract instead of single-step rename (lesson 05)

The backfill WAL flood in depth

A naive backfill runs one large UPDATE:

-- NEVER do this on a large table:
UPDATE users SET handle = username WHERE handle IS NULL;

On 100M rows, this generates a WAL (Write-Ahead Log) entry for every updated row — potentially 20–50 GB of WAL in minutes. Replicas must consume this WAL before they can serve reads; replication lag spikes to minutes or longer. During that window, standby read endpoints return stale data. If the lag exceeds max_standby_streaming_delay, Postgres cancels conflicting queries on the standby.

Durable fix: batch in 1k–10k row chunks with breathing room:

DO $$
DECLARE
  batch INT;
BEGIN
  LOOP
    UPDATE users SET handle = username
    WHERE handle IS NULL
      AND ctid IN (
        SELECT ctid FROM users WHERE handle IS NULL LIMIT 5000
      );
    GET DIAGNOSTICS batch = ROW_COUNT;
    EXIT WHEN batch = 0;
    PERFORM pg_sleep(0.1);
  END LOOP;
END $$;

Monitor SELECT * FROM pg_stat_replication — watch sent_lsn - replay_lsn stay near zero during the backfill.

Same task, two blast radii: one big UPDATE floods WAL and lags replicas; batching with pg_sleep bounds WAL per transaction and keeps lag near zero.

Schema drift on replicas

A migration applied on the primary propagates to replicas via streaming replication. Replication lag (normal range: <1 s; under load: 5–30 s) means replicas may see the old schema for seconds after the migration commits. If code is deployed before replication catches up:

Read replicas serve queries against old schema.
New code expecting the new column gets NULLs or errors from standby reads.

Durable fix: gate the code rollout on replication lag approaching zero. Monitor via pg_stat_replication.replay_lag on the primary. Most migration tools’ pre-deploy checks include a replica-lag query.

Squawk CI and strategic migration posture

Squawk (Linux Foundation) parses migration SQL and warns or errors on unsafe patterns:

ADD COLUMN with volatile DEFAULT → error
ALTER COLUMN TYPE without cast coercibility check → error
CREATE INDEX without CONCURRENTLY → error
RENAME COLUMN / TABLE → warn
DROP COLUMN without prior code-deploy confirmation → warn

Run Squawk on every PR touching migrations/**. Cost: under 30 s per migration PR. Benefit: catches the most common failure modes before merge.

The strategic posture: treat migration code with the same discipline as application code — PR review, CI lint, staging deploy on production-size data, runbook entry, observability on runtime and lock acquisition. Senior teams ship breaking changes routinely; the difference is that every change is planned, linted, observable, and forward-rollback-capable.

Migration observability targets

Alert threshold: migration retries: > 3 → page on-call
Alert threshold: indisvalid index post-deploy: Any = page
Alert threshold: migration duration: > 30 s → warn (rewrite?)
Alert threshold: replication lag during backfill: > 10 s → slow batches
Squawk CI runtime: < 30 s per migration PR
Production schema changes (mature teams): Daily

▸Why this works

Why does Postgres use WAL for replicas instead of just copying changed rows? WAL is the source of truth for crash recovery and point-in-time restore. Every change is recorded as a WAL entry before it is applied to the heap. Streaming replication simply tails the WAL and replays it on standbys. This means backfill operations that touch millions of rows generate millions of WAL entries — there is no way to suppress WAL generation for DML. Batching keeps WAL volume manageable by limiting the number of rows changed per transaction.

Quiz

A large single UPDATE backfill generates 40 GB of WAL in minutes. What is the first observable symptom on a primary + 2 replica setup?

Quiz

Squawk runs in CI and flags `CREATE INDEX ON orders(user_id)` (without CONCURRENTLY). What is the correct response?

Quiz

A migration applies on the primary. Replication lag is currently 15 seconds. What happens if the code deploy starts immediately?

Top chain: a single large UPDATE floods WAL, replicas fall behind, standby reads go stale. Bottom chain: batching with pg_sleep bounds WAL per transaction so replicas stay caught up.

Recall before you leave

01
Why does a large single-statement UPDATE backfill flood WAL, and what is the batch-size guideline?
02
What is schema drift on replicas and how does gating the code deploy on replication lag prevent it?
03
Name four things Squawk checks for in CI and explain why each is unsafe without the check.

Recap

Senior migration discipline names nine failure modes and builds observability for each. Lock-queue freeze (mode a) is the most common — fix: lock_timeout + retries. INVALID index (b) is detected by monitoring indisvalid post-deploy — fix: DROP INDEX CONCURRENTLY + retry. Backfill WAL flood (f) spikes replication lag — fix: 1k–10k row batches with pg_sleep. Schema drift on replicas (e) causes stale reads after fast-propagating DDL — fix: gate code deploy on near-zero replication lag. Squawk CI catches unsafe DDL at PR time: volatile defaults, non-concurrent indexes, renames, and coercibility-unchecked type changes. Mature teams ship schema changes daily because their tooling makes safety the path of least resistance, not a special-occasion discipline. Now when you see a migration incident on your on-call shift, run through the nine names: which mode is it, what is the signal, and what is the durable fix — not just for tonight, but so it never pages you again.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Advisory locks, migration tools, and deploy coordinationsenior

appears again in287

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.