Databases DB · 03 · 07

Production failure modes and plan stability

Eight real failure modes — stale stats, generic-plan trap, correlated columns, SSD mis-tuning, JIT overhead, work_mem spill, random_page_cost drift, and plan flips — plus the operational strategy to detect and prevent plan regressions across deploys.

DB Senior ◷ 18 min

Level

FoundationsJuniorMiddleSenior

A B2B SaaS product’s orders dashboard degrades every Monday morning. By Tuesday it is fine again. The pattern: a batch import runs Sunday night, shifting data distributions; autovacuum has not triggered ANALYZE yet by Monday morning; the planner picks a Nested Loop on a 1000× underestimated join. ANALYZE in the post-batch hook eliminates the incident forever.

Eight real production failure modes

Failure mode	Symptom	Fix
Stale statistics cascade	Sudden slowdown after bulk insert; rows-estimated wildly off actual	ANALYZE in post-batch hook; lower autovacuum_analyze_scale_factor
Generic-plan trap	High stddev_exec_time; bimodal latency on prepared statement	plan_cache_mode = force_custom_plan for the role
Correlated-column blowup	Nested Loop running 1M iterations; rows-estimated ~1000× low	CREATE STATISTICS (dependencies, mcv); ANALYZE
random_page_cost mis-tuned (SSD)	Planner picks Seq Scan where Index Scan should win; queries read whole tables	SET random_page_cost = 1.1 in postgresql.conf
JIT overhead on OLTP	Short queries slower than expected; JIT compilation time exceeds execution savings	Raise jit_above_cost to 500000 or SET jit = off for OLTP role
work_mem spill	Sort Method: external merge; Hash Batches > 1; queries slow on joins/sorts	SET work_mem = ‘32MB’ per session; not globally without memory budget check
effective_cache_size too low	Planner prefers Seq Scan even on SSD with large RAM because it assumes cold data	SET effective_cache_size = ‘24GB’ (~75% of total RAM)
Plan flips with workload phase	Same query gets different plans at different times; autovacuum eventually rebalances	pg_hint_plan for critical paths; otherwise accept variance and alert on it

Plan stability across deploys

The most common operational pain: a deploy with new tables, new data, or different volumes causes plan regressions on unchanged queries.

The five-layer strategy:

Layer 1 — cost-constant tuning (one-time): Set random_page_cost = 1.1, effective_cache_size to 75% of RAM, work_mem = 16-32MB. Verify with EXPLAIN on the top-20 queries. This ensures the planner has the right mental model of your hardware.

Layer 2 — statistics freshness: Lower autovacuum_analyze_scale_factor = 0.02 on hot tables. Add ANALYZE critical_tables to every deploy hook. Create CREATE STATISTICS objects for correlated column groups identified via EXPLAIN.

Layer 3 — plan cache for skewed workloads: Set plan_cache_mode = 'force_custom_plan' on roles that run parameterised queries with skewed parameter distributions (identified via high stddev_exec_time).

Layer 4 — observability: Configure auto_explain (log_min_duration = 500ms, log_analyze = true, log_buffers = true, log_format = json, sample_rate = 0.05). Archive pg_stat_statements weekly (reset + save to file/S3).

Layer 5 — deploy procedure: Before every deploy, snapshot pg_stat_statements + EXPLAIN of top-20 queries to JSON. After deploy: ANALYZE critical tables, wait 5 minutes, re-snapshot. Diff key metrics. Flag any query whose plan structure changed or whose mean_exec_time grew by more than 50%.

All five layers together form a defence-in-depth: Layer 1 ensures the planner has an accurate model of your hardware; Layers 2–3 keep the input data accurate; Layer 4 catches regressions before users report them; Layer 5 ties stability to the deploy cycle rather than relying on luck.

Trace it

1/5

A senior engineer tunes a Postgres 16 OLTP database on NVMe SSD (32GB RAM, 500GB DB, ~5k QPS) that was deployed with default planner settings.

Step 1 of 5

Step 1: baseline.

Locked

Step 2: SSD cost tuning.

Locked

Step 3: statistics refresh and extension.

Locked

Step 4: memory and JIT.

Locked

Step 5: observability.

Pick the best fit

A team needs to stabilise execution plans across deploys for the top-20 OLTP queries. Pick the strategy.

The same four strategies as the matrix above, mapped along adaptive vs hard-pinned: only statistics-first is the default — the further you pin the planner, the narrower the case where it is correct.

Quiz

A batch job inserts 20M rows into the orders table every Sunday night. Every Monday morning, queries that were fast on Friday are slow. By Tuesday they are fast again. What is the root cause and the fix?

Quiz

A query shows `Sort Method: external merge Disk: 450MB` in EXPLAIN ANALYZE. What is the interpretation and fix?

Layer 1 - Cost constants random_page_cost=1.1, effective_cache_size

Layer 2 - Statistics freshness ANALYZE per deploy, CREATE STATISTICS

Layer 3 - Plan cache force_custom_plan for skewed params

Layer 4 - Observability auto_explain, pg_stat_statements

Layer 5 - Deploy procedure snapshot, ANALYZE, diff plans

Each layer prevents a class of regression; together they keep plans stable across deploys. Hints (pg_hint_plan) sit outside the stack - last resort for the 1-3 queries that must never regress.

Recall before you leave

01
Explain why the gap between rows-estimated and rows-actual at every plan node is the single most important diagnostic in Postgres.
02
Design the plan-stability strategy for a B2B SaaS Postgres 16 database on NVMe SSD (32GB RAM, 500GB DB, 5k OLTP QPS). Top-20 queries identified via pg_stat_statements. 3 of 20 are heavy aggregations.

Recap

Eight production failure modes cover most Postgres plan regressions: stale statistics after bulk operations (ANALYZE in post-batch hook), generic-plan trap on prepared statements with skewed parameters (plan_cache_mode = force_custom_plan), correlated-column row-estimate failures (CREATE STATISTICS dependencies + mcv), SSD mis-tuning (random_page_cost = 1.1), JIT overhead on OLTP (raise jit_above_cost), work_mem spill (sort / hash spill to disk — fix per-session), effective_cache_size mismatch, and plan flips with data phase changes. The operational strategy has five layers: SSD-tuned cost constants, aggressive statistics maintenance, selective plan_cache_mode overrides, auto_explain + pg_stat_statements observability, and a deploy procedure that diffs plans and metrics before and after every deploy. Now when a query regresses after a deploy, you have a checklist: did statistics drift (run ANALYZE), did a prepared statement switch to a generic plan (check stddev), did correlated columns tip an estimate (add CREATE STATISTICS)? Work through the eight modes in order and you will find the cause.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Plan cache, cost-constant tuning, and planner internalssenior

appears again in287

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.