awesome-everything RU
↑ Back to the climb

Databases

Migration failure taxonomy and production discipline

Crux Nine named failure modes — lock-queue freeze, INVALID index, WAL flood, schema drift — each with a detection signal and a durable fix.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min

Six months after introducing a migration pipeline, a team has had no lock-queue incidents. Then a backfill update generates 40 GB of WAL in ten minutes — replication lag climbs to 120 seconds, standby reads go stale, and two read endpoints return rows that contradict each other. The migration ran fine; the database failed around it.

The nine migration failure modes

ModeSignalDurable fix
(a) Lock-queue freezeTable frozen, pool exhausted, 503slock_timeout + retries (lesson 03)
(b) INVALID indexpg_indexes indisvalid = false post-deployDROP INDEX CONCURRENTLY + retry; alert on indisvalid
(c) Migration deadlockERROR: deadlock detected in migration logSerialise via advisory lock; never parallel migrations on related tables
(d) Rollback destroying dataData loss discovered after down migrationNever use down migrations in production; use forward fixes
(e) Schema drift on replicasStandby queries fail; replica lag metric spikesGate code deploy on replica lag near-zero; use replica-aware tooling
(f) Backfill WAL floodWAL generation rate spikes; replica lag climbs; disk fillsBatch UPDATEs in 1k–10k rows; pg_sleep between batches; monitor WAL rate
(g) Volatile default hidden rewriteMigration took minutes; table was rewritten unexpectedlySquawk catches DEFAULT now() in CI; constant default + post-migration update
(h) NOT NULL without backfillALTER COLUMN SET NOT NULL fails at apply timeBackfill first; use NOT VALID + VALIDATE pattern (lesson 04)
(i) RENAME during rolling deployOld pods: column does not exist errorsExpand-contract instead of single-step rename (lesson 05)

The backfill WAL flood in depth

A naive backfill runs one large UPDATE:

-- NEVER do this on a large table:
UPDATE users SET handle = username WHERE handle IS NULL;

On 100M rows, this generates a WAL (Write-Ahead Log) entry for every updated row — potentially 20–50 GB of WAL in minutes. Replicas must consume this WAL before they can serve reads; replication lag spikes to minutes or longer. During that window, standby read endpoints return stale data. If the lag exceeds max_standby_streaming_delay, Postgres cancels conflicting queries on the standby.

Durable fix: batch in 1k–10k row chunks with breathing room:

DO $$
DECLARE
  batch INT;
BEGIN
  LOOP
    UPDATE users SET handle = username
    WHERE handle IS NULL
      AND ctid IN (
        SELECT ctid FROM users WHERE handle IS NULL LIMIT 5000
      );
    GET DIAGNOSTICS batch = ROW_COUNT;
    EXIT WHEN batch = 0;
    PERFORM pg_sleep(0.1);
  END LOOP;
END $$;

Monitor SELECT * FROM pg_stat_replication — watch sent_lsn - replay_lsn stay near zero during the backfill.

Schema drift on replicas

A migration applied on the primary propagates to replicas via streaming replication. Replication lag (normal range: <1 s; under load: 5–30 s) means replicas may see the old schema for seconds after the migration commits. If code is deployed before replication catches up:

  • Read replicas serve queries against old schema.
  • New code expecting the new column gets NULLs or errors from standby reads.

Durable fix: gate the code rollout on replication lag approaching zero. Monitor via pg_stat_replication.replay_lag on the primary. Most migration tools’ pre-deploy checks include a replica-lag query.

Squawk CI and strategic migration posture

Squawk (Linux Foundation) parses migration SQL and warns or errors on unsafe patterns:

  • ADD COLUMN with volatile DEFAULT → error
  • ALTER COLUMN TYPE without cast coercibility check → error
  • CREATE INDEX without CONCURRENTLY → error
  • RENAME COLUMN / TABLE → warn
  • DROP COLUMN without prior code-deploy confirmation → warn

Run Squawk on every PR touching migrations/**. Cost: under 30 s per migration PR. Benefit: catches the most common failure modes before merge.

The strategic posture: treat migration code with the same discipline as application code — PR review, CI lint, staging deploy on production-size data, runbook entry, observability on runtime and lock acquisition. Senior teams ship breaking changes routinely; the difference is that every change is planned, linted, observable, and forward-rollback-capable.

Migration observability targets
Alert threshold: migration retries
&gt; 3 → page on-call
Alert threshold: indisvalid index post-deploy
Any = page
Alert threshold: migration duration
&gt; 30 s → warn (rewrite?)
Alert threshold: replication lag during backfill
&gt; 10 s → slow batches
Squawk CI runtime
&lt; 30 s per migration PR
Production schema changes (mature teams)
Daily
Why this works

Why does Postgres use WAL for replicas instead of just copying changed rows? WAL is the source of truth for crash recovery and point-in-time restore. Every change is recorded as a WAL entry before it is applied to the heap. Streaming replication simply tails the WAL and replays it on standbys. This means backfill operations that touch millions of rows generate millions of WAL entries — there is no way to suppress WAL generation for DML. Batching keeps WAL volume manageable by limiting the number of rows changed per transaction.

Quiz

A large single UPDATE backfill generates 40 GB of WAL in minutes. What is the first observable symptom on a primary + 2 replica setup?

Quiz

Squawk runs in CI and flags `CREATE INDEX ON orders(user_id)` (without CONCURRENTLY). What is the correct response?

Quiz

A migration applies on the primary. Replication lag is currently 15 seconds. What happens if the code deploy starts immediately?

Recall before you leave
  1. 01
    Why does a large single-statement UPDATE backfill flood WAL, and what is the batch-size guideline?
  2. 02
    What is schema drift on replicas and how does gating the code deploy on replication lag prevent it?
  3. 03
    Name four things Squawk checks for in CI and explain why each is unsafe without the check.
Recap

Senior migration discipline names nine failure modes and builds observability for each. Lock-queue freeze (mode a) is the most common — fix: lock_timeout + retries. INVALID index (b) is detected by monitoring indisvalid post-deploy — fix: DROP INDEX CONCURRENTLY + retry. Backfill WAL flood (f) spikes replication lag — fix: 1k–10k row batches with pg_sleep. Schema drift on replicas (e) causes stale reads after fast-propagating DDL — fix: gate code deploy on near-zero replication lag. Squawk CI catches unsafe DDL at PR time: volatile defaults, non-concurrent indexes, renames, and coercibility-unchecked type changes. Mature teams ship schema changes daily because their tooling makes safety the path of least resistance, not a special-occasion discipline.

Connected lessons
appears again in258
Continue the climb ↑Schema migrations: multiple-choice review
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources4
expand
  1. 01
  2. 02
  3. 03
  4. 04

Trademarks belong to their respective owners. Editorial reference only.