Databases DB · 03 · 05

Extended statistics: fixing correlated-column estimate failures

By default the planner assumes columns are independent. CREATE STATISTICS teaches it about functional dependencies, multi-column n_distinct, and MCV combinations — eliminating the largest class of row-estimate failures.

DB Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

A WHERE clause filters on country='US' AND region='CA' AND status='shipped'. The planner predicts 500 matching rows — three independent probabilities multiplied together. There are actually 50,000. A Nested Loop runs 100× too many times. The columns are correlated, not independent. CREATE STATISTICS fixes this in two SQL lines.

Why the independence assumption fails

The planner’s default multi-column selectivity model:

P(country='US' AND region='CA' AND status='shipped')
  = P(country='US') × P(region='CA') × P(status='shipped')
  = 0.50 × 0.05 × 0.20
  = 0.005 (0.5%)

But if country determines region (every CA order is in US — a functional dependency), the actual selectivity is:

P(region='CA') × P(status='shipped') = 0.05 × 0.20 = 1.0%

On 100M rows: the planner estimates 500,000 rows; reality is 1,000,000 — a 2× error that can tip Nested Loop into catastrophic territory. With tighter correlations the error can be 1000×.

The three kinds of extended statistics

Each kind targets a distinct failure mode — when you see a bad multi-column estimate, the kind you choose depends on why the columns are correlated.

Pick the kind by the symptom: WHERE-clause underestimate from a column that implies another → dependencies; a frequent value combo mis-estimated → mcv; GROUP BY expecting too many groups → ndistinct.

CREATE STATISTICS (available since PG 10) supports three complementary kinds:

Kind	What it stores	Fixes	Since
`dependencies`	Functional dependency coefficients between column pairs	One column implies another (zip → city, country → region)	PG 10
`ndistinct`	Distinct combination count for multi-column groups	GROUP BY cardinality over multiple columns	PG 10
`mcv`	Most-common value tuples for the column combination	Exact estimates for frequent combinations	PG 12

dependencies

Stores functional dependency factors between column pairs. A factor close to 1.0 means “column A almost always determines column B”. When the planner sees WHERE country='US' AND region='CA', it checks the dependency between country and region — if country determines region with factor 0.95, the planner knows not to multiply P(country) × P(region); instead it estimates based on the more selective of the two.

CREATE STATISTICS stx_orders_geo (dependencies)
  ON country, region, status FROM orders;
ANALYZE orders;

ndistinct

Without this, the planner estimates the number of distinct (country, region) combinations as n_distinct(country) × n_distinct(region), which wildly overestimates when the columns are correlated. ndistinct stores the actual combination count.

Useful for queries like:

SELECT country, region, COUNT(*) FROM orders GROUP BY country, region;

Without ndistinct, the planner may allocate a Hash Aggregate expecting millions of groups when there are only hundreds.

mcv (multi-column most common values)

Stores the most frequent tuples of the column combination directly — analogous to the per-column MCV list, but for pairs or triples. For WHERE country='US' AND status='shipped', if (US, shipped) is a common combination, the planner reads its exact frequency from the MCV list rather than multiplying marginal probabilities.

Available since PG 12 and often the highest-value kind for OLTP query patterns.

Expression statistics (PG 14+)

PG 14 added statistics on expressions:

CREATE STATISTICS stx_orders_lower_email ON LOWER(email) FROM users;
ANALYZE users;

This lets the planner accurately estimate selectivity for WHERE LOWER(email) = 'x' even without a functional index on LOWER(email).

When and how to apply CREATE STATISTICS

The workflow:

Run EXPLAIN ANALYZE on the slow query. Find nodes where rows (estimate) differs from actual rows by more than 10×.
Identify the WHERE clause columns on that node. Check if they are naturally correlated: zip and city, country and region, product and category.
Create the statistics object:

-- Most common pattern: dependencies + mcv together
CREATE STATISTICS stx_orders_country_region_status
  (dependencies, mcv)
  ON country, region, status FROM orders;

-- For GROUP BY accuracy:
CREATE STATISTICS stx_orders_ndistinct
  (ndistinct)
  ON country, region FROM orders;

-- Run ANALYZE to populate:
ANALYZE orders;

Re-run EXPLAIN ANALYZE. Verify that rows tracks actual rows within 2× on the affected nodes.

These four steps together close the loop: identify the estimate gap, trace it to correlated columns, teach the planner the correlation, confirm the fix. Without step 4, you cannot know if the statistics object actually helped — the planner only uses it when the predicate matches the declared columns exactly.

Storage cost: small — kilobytes per statistics object. Planning improvement on the affected queries: 100–1000×.

▸Why this works

Why does default_statistics_target not solve this? Raising the target (e.g., to 1000) gives finer histograms and longer MCV lists for individual columns — it does not teach the planner about correlations between columns. With 1000-entry MCV lists, the planner still multiplies P(country) × P(region) — it just has better per-column estimates. The independence assumption survives. CREATE STATISTICS replaces the independence assumption for specific column groups; SET STATISTICS refines the data used within it.

Quiz

Which is the right tool to make the planner aware that two columns are correlated (e.g., zip and city)?

Quiz

When is `ndistinct` the right kind of extended statistics to create?

Quiz

After running CREATE STATISTICS on (country, region, status) and ANALYZE, how do you verify the statistics actually improved the estimate?

Because country determines region, multiplying P(country) in is wrong. CREATE STATISTICS (dependencies) teaches the planner to discount the implied column, doubling the estimate from 0.5% to the true 1.0%.

Recall before you leave

01
Explain the independence assumption failure and why CREATE STATISTICS (dependencies) fixes it.
02
What is the difference between 'dependencies', 'mcv', and 'ndistinct' in CREATE STATISTICS?
03
How do you identify which column combinations need extended statistics in a production database?

Recap

By default the planner multiplies single-column selectivities assuming independence — wildly wrong for correlated columns like (country, region) or (zip, city). CREATE STATISTICS adds three kinds of multi-column information: dependencies (functional dependency factors between column pairs, fixing estimate failures for WHERE A=‘x’ AND B=‘y’ when A determines B), ndistinct (actual combination counts for accurate GROUP BY cardinality), and mcv (most-common value tuples for frequent specific combinations, available since PG 12). PG 14+ added expression statistics on computed expressions. Storage cost is kilobytes; planning improvement on affected queries can be 100–1000×. Now when you spot a 1000× estimate miss on a multi-column WHERE clause, before reaching for hints or index rewrites, ask whether those columns have a natural hierarchy — if so, two lines of SQL fix the planner’s model permanently.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

pg_statistic, ANALYZE, and production observabilitymiddle

unlocks

Plan cache, cost-constant tuning, and planner internalssenior

appears again in177

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Query plan visualizerPaste an EXPLAIN (ANALYZE, FORMAT JSON) and render the plan tree with per-node timing and row-estimate error, so a bad join jumps out visually.