Databases
Execution plans: diagnose and stabilise a slow query
Reading about the row-estimate cascade is not the same as watching a Nested Loop run half a million times because the planner believed the wrong number. Build a realistic dataset, drive a query into a planner blowup, and walk the unit’s diagnostic discipline until the estimate tracks reality — with EXPLAIN ANALYZE evidence at every step.
Turn the unit’s mental model into a reproducible loop: instrument with EXPLAIN ANALYZE and pg_stat_statements, diagnose a row-estimate cascade and a generic-plan trap from the plan output, fix each at the cause, and prove plan stability with before/after numbers under identical load.
Take a real Postgres 16 instance, seed a correlated, skewed dataset, and drive a representative query into a planner blowup. Diagnose the cause from EXPLAIN ANALYZE, fix it at the statistics/plan-cache layer (not by forcing a plan), and prove the fix with before/after measurements under the same load.
- A before/after table for the target query: scan type, join algorithm, rows-estimated vs actual on the key node, inner-loop count, p99 latency, and mean_exec_time — all measured under the same load, not estimated.
- The EXPLAIN ANALYZE after the statistics fix shows estimated tracking actual within ~2x on the previously-broken node, and the plan no longer uses the blown-up Nested Loop.
- Evidence of the generic-plan trap (bimodal latency, high stddev_exec_time, the GENERIC_PLAN output) and its resolution after force_custom_plan, with the prepared statement now using the correct composite index for both parameters.
- A one-paragraph write-up naming the cause of each fix (correlated-column independence assumption; generic-plan switch) and explaining why fixing the estimate or the plan-cache mode is correct where forcing a plan or rebuilding an index would not be.
- Add a one-page on-call runbook: triage from EXPLAIN ANALYZE (find the largest estimated-vs-actual gap first), the common causes (stale stats, correlated columns, non-sargable predicate, generic-plan trap), and a verification checklist.
- Build the five-layer plan-stability deploy procedure: snapshot pg_stat_statements + EXPLAIN of the top-20 before deploy, ANALYZE and re-snapshot after, and write a script that diffs plan structure and flags any mean_exec_time regression over 50%.
- Introduce a non-sargable predicate (EXTRACT(year FROM created_at) = 2026 or LOWER(email) = 'x'), show it forces a Seq Scan + Filter, then fix it with a range rewrite or an expression index / expression statistics and prove the scan changes.
- Force a work_mem spill on a large sort or hash join (Sort Method: external merge or Hash Batches > 1), then raise work_mem per session and show the spill disappear — while documenting the max_connections x work_mem budget you would not exceed globally.
This is the loop you will run in every real Postgres slow-query incident: instrument first with EXPLAIN ANALYZE and pg_stat_statements, find the largest rows-estimated vs actual gap and trace it to its predicate, fix the cause — extended statistics for correlated columns, force_custom_plan for skewed prepared statements, SSD-tuned cost constants — and verify with before/after numbers under identical load. Forcing a plan or rebuilding an index treats the symptom; fixing the estimate corrects the whole tree above it. Doing this once on a seeded dataset makes the production version muscle memory.