awesome-everything RU
↑ Back to the climb

Databases

Production failure modes and plan stability

Crux Eight real failure modes — stale stats, generic-plan trap, correlated columns, SSD mis-tuning, JIT overhead, work_mem spill, random_page_cost drift, and plan flips — plus the operational strategy to detect and prevent plan regressions across deploys.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 18 min

A B2B SaaS product’s orders dashboard degrades every Monday morning. By Tuesday it is fine again. The pattern: a batch import runs Sunday night, shifting data distributions; autovacuum has not triggered ANALYZE yet by Monday morning; the planner picks a Nested Loop on a 1000× underestimated join. ANALYZE in the post-batch hook eliminates the incident forever.

Eight real production failure modes

Failure modeSymptomFix
Stale statistics cascadeSudden slowdown after bulk insert; rows-estimated wildly off actualANALYZE in post-batch hook; lower autovacuum_analyze_scale_factor
Generic-plan trapHigh stddev_exec_time; bimodal latency on prepared statementplan_cache_mode = force_custom_plan for the role
Correlated-column blowupNested Loop running 1M iterations; rows-estimated ~1000× lowCREATE STATISTICS (dependencies, mcv); ANALYZE
random_page_cost mis-tuned (SSD)Planner picks Seq Scan where Index Scan should win; queries read whole tablesSET random_page_cost = 1.1 in postgresql.conf
JIT overhead on OLTPShort queries slower than expected; JIT compilation time exceeds execution savingsRaise jit_above_cost to 500000 or SET jit = off for OLTP role
work_mem spillSort Method: external merge; Hash Batches > 1; queries slow on joins/sortsSET work_mem = ‘32MB’ per session; not globally without memory budget check
effective_cache_size too lowPlanner prefers Seq Scan even on SSD with large RAM because it assumes cold dataSET effective_cache_size = ‘24GB’ (~75% of total RAM)
Plan flips with workload phaseSame query gets different plans at different times; autovacuum eventually rebalancespg_hint_plan for critical paths; otherwise accept variance and alert on it

Plan stability across deploys

The most common operational pain: a deploy with new tables, new data, or different volumes causes plan regressions on unchanged queries.

The five-layer strategy:

Layer 1 — cost-constant tuning (one-time): Set random_page_cost = 1.1, effective_cache_size to 75% of RAM, work_mem = 16-32MB. Verify with EXPLAIN on the top-20 queries. This ensures the planner has the right mental model of your hardware.

Layer 2 — statistics freshness: Lower autovacuum_analyze_scale_factor = 0.02 on hot tables. Add ANALYZE critical_tables to every deploy hook. Create CREATE STATISTICS objects for correlated column groups identified via EXPLAIN.

Layer 3 — plan cache for skewed workloads: Set plan_cache_mode = 'force_custom_plan' on roles that run parameterised queries with skewed parameter distributions (identified via high stddev_exec_time).

Layer 4 — observability: Configure auto_explain (log_min_duration = 500ms, log_analyze = true, log_buffers = true, log_format = json, sample_rate = 0.05). Archive pg_stat_statements weekly (reset + save to file/S3).

Layer 5 — deploy procedure: Before every deploy, snapshot pg_stat_statements + EXPLAIN of top-20 queries to JSON. After deploy: ANALYZE critical tables, wait 5 minutes, re-snapshot. Diff key metrics. Flag any query whose plan structure changed or whose mean_exec_time grew by more than 50%.

Trace it
1/5

A senior engineer tunes a Postgres 16 OLTP database on NVMe SSD (32GB RAM, 500GB DB, ~5k QPS) that was deployed with default planner settings.

1
Step 1 of 5
Step 1: baseline.
2
Locked
Step 2: SSD cost tuning.
3
Locked
Step 3: statistics refresh and extension.
4
Locked
Step 4: memory and JIT.
5
Locked
Step 5: observability.
Pick the best fit

A team needs to stabilise execution plans across deploys for the top-20 OLTP queries. Pick the strategy.

Quiz

A batch job inserts 20M rows into the orders table every Sunday night. Every Monday morning, queries that were fast on Friday are slow. By Tuesday they are fast again. What is the root cause and the fix?

Quiz

A query shows `Sort Method: external merge Disk: 450MB` in EXPLAIN ANALYZE. What is the interpretation and fix?

Recall before you leave
  1. 01
    Explain why the gap between rows-estimated and rows-actual at every plan node is the single most important diagnostic in Postgres.
  2. 02
    Design the plan-stability strategy for a B2B SaaS Postgres 16 database on NVMe SSD (32GB RAM, 500GB DB, 5k OLTP QPS). Top-20 queries identified via pg_stat_statements. 3 of 20 are heavy aggregations.
Recap

Eight production failure modes cover most Postgres plan regressions: stale statistics after bulk operations (ANALYZE in post-batch hook), generic-plan trap on prepared statements with skewed parameters (plan_cache_mode = force_custom_plan), correlated-column row-estimate failures (CREATE STATISTICS dependencies + mcv), SSD mis-tuning (random_page_cost = 1.1), JIT overhead on OLTP (raise jit_above_cost), work_mem spill (sort / hash spill to disk — fix per-session), effective_cache_size mismatch, and plan flips with data phase changes. The operational strategy has five layers: SSD-tuned cost constants, aggressive statistics maintenance, selective plan_cache_mode overrides, auto_explain + pg_stat_statements observability, and a deploy procedure that diffs plans and metrics before and after every deploy. Use pg_hint_plan only for the 1-3 queries where the plan must be pinned and the correct plan shape is permanently known; let the planner adapt to everything else.

Practice

Do these to turn recognition into skill.

Connected lessons
appears again in258
Continue the climb ↑Execution plans: multiple-choice review
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.