Databases DB · 03 · 04

pg_statistic, ANALYZE, and production observability

The planner''''s row estimates come from pg_statistic: MCV lists, histograms, correlation, and n_distinct. ANALYZE keeps them fresh. auto_explain and pg_stat_statements together tell you which queries are slow and why.

DB Middle ◷ 16 min

Level

FoundationsJuniorMiddleSenior

A query that ran in 5ms yesterday takes 4 seconds today. Nothing in the application changed. The plan is different. The cause is almost always the same: statistics drifted and the planner made a different choice. Knowing how Postgres builds and uses statistics is how you prevent the next incident.

Inside pg_statistic

When a plan goes wrong, the culprit is almost always wrong data inside pg_statistic — so knowing what lives there tells you exactly what to look for and how to fix it.

When you run ANALYZE orders, Postgres samples the table (by default 300 × default_statistics_target = 30,000 rows with the default target of 100) and computes per-column statistics stored in pg_statistic (readable via pg_stats):

Statistic	What it stores	Used for
most_common_vals (MCV)	Top-N values + their frequencies	Exact selectivity for frequent values
histogram_bounds	Equi-frequency histogram of non-MCV values	Selectivity for range and equality on rare values
n_distinct	Distinct value count (negative = fraction of rows)	GROUP BY cardinality estimates
correlation	Correlation between physical and logical order (−1 to 1)	Index scan random vs sequential cost estimate

How the planner uses these for WHERE x = 42:

Is 42 in the MCV list? Use its exact frequency directly.
Not an MCV? Find the histogram bucket containing 42, assume uniform distribution within the bucket.
Combine with reltuples (row count in pg_class) to get the estimated row count.

For range predicates (x > 100), the histogram is integrated over the range. For multi-column predicates, by default the planner assumes independence and multiplies selectivities — which is wrong for correlated columns (covered in lesson 05).

Read the catalog yourself:

SELECT tablename, attname, n_distinct, most_common_vals, most_common_freqs,
       histogram_bounds, correlation
FROM pg_stats
WHERE tablename = 'orders';

ANALYZE: when to run it

Autovacuum schedules ANALYZE per table based on:

autovacuum_analyze_threshold (default 50 rows) +
autovacuum_analyze_scale_factor × reltuples (default 0.1 = 10%)

On a 100M-row table, autovacuum analyzes only after 10M row changes — far too infrequent for tables with skewed or fast-changing data distributions. Operational fixes:

-- Lower the scale factor for hot tables:
ALTER TABLE orders SET (autovacuum_analyze_scale_factor = 0.02);

-- Bump statistics target for skewed columns:
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 1000;
ANALYZE orders;

The default 10% scale factor means a 100M-row table goes 10M changes between ANALYZE runs — far too stale for hot data. Dropping the scale factor to 0.02 cuts that to 2M, keeping selectivity estimates fresh.

Manual ANALYZE runs in seconds even on 100M-row tables (it samples, not full-scans). Always run it:

After bulk inserts or large UPDATEs that shift distributions
After schema changes that add columns
In post-deploy hooks before traffic resumes

EXPLAIN options for diagnostics

Beyond the basic plan, several EXPLAIN options are essential:

FORMAT JSON | XML | YAML | TEXT — JSON for tooling (explain.depesz.com, explain.dalibo.com, pganalyze)
VERBOSE — adds output column lists per node
SETTINGS (PG 12+) — prints non-default planner GUCs; diagnose environment drift between staging and production
WAL (PG 13+) — shows WAL bytes generated by the statement
GENERIC_PLAN (PG 16+) — plans a parameterised query without sample values; essential for prepared-statement diagnostics (covered in lesson 06)
SERIALIZE (PG 17+) — includes the cost of serialising rows for the client; closes the gap between EXPLAIN ANALYZE total time and client-observed latency

auto_explain and pg_stat_statements

Two extensions every production Postgres should have:

pg_stat_statements records every executed query (normalised by parameter), tracking calls, total_exec_time, mean_exec_time, rows, and buffer counts. Query: SELECT query, calls, total_exec_time, mean_exec_time FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 20. These top-20 by total_exec_time are your optimisation targets — a 2ms query called 10M times beats a 500ms query called twice.

auto_explain automatically logs EXPLAIN ANALYZE for any query exceeding a duration threshold:

-- postgresql.conf additions:
shared_preload_libraries = 'auto_explain'
auto_explain.log_min_duration = '500ms'
auto_explain.log_analyze = true
auto_explain.log_buffers = true
auto_explain.log_format = 'json'
auto_explain.sample_rate = 0.01   -- 1% of slow queries to limit log volume

Slow queries land in the Postgres logs with full ANALYZE + BUFFERS plans, no need to reproduce them in staging. Together they answer “which queries are slow, and why” without instrumenting the application.

Trace it

1/5

A query that ran in 5ms yesterday now takes 4 seconds. Nothing changed in the application. Diagnose.

Step 1 of 5

Step 1: confirm the regression.

Locked

Step 2: check for planning-time differences.

Locked

Step 3: check for execution-time differences.

Locked

Step 4: identify the prepared-statement trap.

Locked

Step 5: stabilise.

Order the steps

Order the EXPLAIN ANALYZE diagnostic checks from highest to lowest signal:

1 Rows estimated vs rows actual on every node — large gap = stale statistics
2 Plan node taking most actual time — the dominant cost in this query
3 loops count on inner side of Nested Loop — exposes outer-row underestimate
4 Sort Method (in-memory quicksort vs external merge) — work_mem fit
5 BUFFERS shared hit vs read — cache state and I/O picture
6 Index Cond vs Filter — index is narrowing, or only post-filtering
7 Planning Time vs Execution Time — planning dominance is rare but real

Quiz

On a 100M-row table with default autovacuum settings, after how many row changes will autovacuum trigger ANALYZE?

Quiz

Which query correctly identifies the top slow-query targets for optimisation using pg_stat_statements?

Stale pg_statistic means a stale estimate, which propagates into the plan choice. ANALYZE refreshes the sample; auto_explain detects when the estimate drifted from reality.

Recall before you leave

01
Walk through how the Postgres planner uses pg_statistic to estimate the selectivity of WHERE x = 42.
02
What operational discipline keeps statistics accurate in a fast-changing production database?
03
Explain what auto_explain does, how to configure it for production, and what its cost is.

Recap

The planner’s row estimates are computed from pg_statistic: for each column, ANALYZE stores a most-common-values list (MCV) with exact frequencies, an equi-frequency histogram for the rest, n_distinct for GROUP BY cardinality, and correlation for index cost estimation. ANALYZE samples the table (30,000 rows by default) and runs in seconds even on large tables. Autovacuum triggers it after 10% row changes on a 100M-row table — too infrequent for hot data; tune autovacuum_analyze_scale_factor = 0.02 on critical tables and run ANALYZE manually in post-deploy hooks. Now when a query suddenly regresses after a bulk load or a deploy, the first thing you check is whether statistics drifted — run EXPLAIN ANALYZE, look at rows-estimated vs actual, and if they diverge, run ANALYZE before anything else.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Join algorithms and the row-estimate cascademiddle

unlocks

deepens into

appears again in287

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Query plan visualizerPaste an EXPLAIN (ANALYZE, FORMAT JSON) and render the plan tree with per-node timing and row-estimate error, so a bad join jumps out visually.