awesome-everything RU
↑ Back to the climb

Databases

Extended statistics: fixing correlated-column estimate failures

Crux By default the planner assumes columns are independent. CREATE STATISTICS teaches it about functional dependencies, multi-column n_distinct, and MCV combinations — eliminating the largest class of row-estimate failures.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min

A WHERE clause filters on country='US' AND region='CA' AND status='shipped'. The planner predicts 500 matching rows — three independent probabilities multiplied together. There are actually 50,000. A Nested Loop runs 100× too many times. The columns are correlated, not independent. CREATE STATISTICS fixes this in two SQL lines.

Why the independence assumption fails

The planner’s default multi-column selectivity model:

P(country='US' AND region='CA' AND status='shipped')
  = P(country='US') × P(region='CA') × P(status='shipped')
  = 0.50 × 0.05 × 0.20
  = 0.005 (0.5%)

But if country determines region (every CA order is in US — a functional dependency), the actual selectivity is:

P(region='CA') × P(status='shipped') = 0.05 × 0.20 = 1.0%

On 100M rows: the planner estimates 500,000 rows; reality is 1,000,000 — a 2× error that can tip Nested Loop into catastrophic territory. With tighter correlations the error can be 1000×.

The three kinds of extended statistics

CREATE STATISTICS (available since PG 10) supports three complementary kinds:

KindWhat it storesFixesSince
dependenciesFunctional dependency coefficients between column pairsOne column implies another (zip → city, country → region)PG 10
ndistinctDistinct combination count for multi-column groupsGROUP BY cardinality over multiple columnsPG 10
mcvMost-common value tuples for the column combinationExact estimates for frequent combinationsPG 12

dependencies

Stores functional dependency factors between column pairs. A factor close to 1.0 means “column A almost always determines column B”. When the planner sees WHERE country='US' AND region='CA', it checks the dependency between country and region — if country determines region with factor 0.95, the planner knows not to multiply P(country) × P(region); instead it estimates based on the more selective of the two.

CREATE STATISTICS stx_orders_geo (dependencies)
  ON country, region, status FROM orders;
ANALYZE orders;

ndistinct

Without this, the planner estimates the number of distinct (country, region) combinations as n_distinct(country) × n_distinct(region), which wildly overestimates when the columns are correlated. ndistinct stores the actual combination count.

Useful for queries like:

SELECT country, region, COUNT(*) FROM orders GROUP BY country, region;

Without ndistinct, the planner may allocate a Hash Aggregate expecting millions of groups when there are only hundreds.

mcv (multi-column most common values)

Stores the most frequent tuples of the column combination directly — analogous to the per-column MCV list, but for pairs or triples. For WHERE country='US' AND status='shipped', if (US, shipped) is a common combination, the planner reads its exact frequency from the MCV list rather than multiplying marginal probabilities.

Available since PG 12 and often the highest-value kind for OLTP query patterns.

Expression statistics (PG 14+)

PG 14 added statistics on expressions:

CREATE STATISTICS stx_orders_lower_email ON LOWER(email) FROM users;
ANALYZE users;

This lets the planner accurately estimate selectivity for WHERE LOWER(email) = 'x' even without a functional index on LOWER(email).

When and how to apply CREATE STATISTICS

The workflow:

  1. Run EXPLAIN ANALYZE on the slow query. Find nodes where rows (estimate) differs from actual rows by more than 10×.
  2. Identify the WHERE clause columns on that node. Check if they are naturally correlated: zip and city, country and region, product and category.
  3. Create the statistics object:
-- Most common pattern: dependencies + mcv together
CREATE STATISTICS stx_orders_country_region_status
  (dependencies, mcv)
  ON country, region, status FROM orders;

-- For GROUP BY accuracy:
CREATE STATISTICS stx_orders_ndistinct
  (ndistinct)
  ON country, region FROM orders;

-- Run ANALYZE to populate:
ANALYZE orders;
  1. Re-run EXPLAIN ANALYZE. Verify that rows tracks actual rows within 2× on the affected nodes.

Storage cost: small — kilobytes per statistics object. Planning improvement on the affected queries: 100–1000×.

Why this works

Why does default_statistics_target not solve this? Raising the target (e.g., to 1000) gives finer histograms and longer MCV lists for individual columns — it does not teach the planner about correlations between columns. With 1000-entry MCV lists, the planner still multiplies P(country) × P(region) — it just has better per-column estimates. The independence assumption survives. CREATE STATISTICS replaces the independence assumption for specific column groups; SET STATISTICS refines the data used within it.

Quiz

Which is the right tool to make the planner aware that two columns are correlated (e.g., zip and city)?

Quiz

When is `ndistinct` the right kind of extended statistics to create?

Quiz

After running CREATE STATISTICS on (country, region, status) and ANALYZE, how do you verify the statistics actually improved the estimate?

Recall before you leave
  1. 01
    Explain the independence assumption failure and why CREATE STATISTICS (dependencies) fixes it.
  2. 02
    What is the difference between 'dependencies', 'mcv', and 'ndistinct' in CREATE STATISTICS?
  3. 03
    How do you identify which column combinations need extended statistics in a production database?
Recap

By default the planner multiplies single-column selectivities assuming independence — wildly wrong for correlated columns like (country, region) or (zip, city). CREATE STATISTICS adds three kinds of multi-column information: dependencies (functional dependency factors between column pairs, fixing estimate failures for WHERE A=‘x’ AND B=‘y’ when A determines B), ndistinct (actual combination counts for accurate GROUP BY cardinality), and mcv (most-common value tuples for frequent specific combinations, available since PG 12). PG 14+ added expression statistics on computed expressions. Storage cost is kilobytes; planning improvement on affected queries can be 100–1000×. After creating statistics, always run ANALYZE and verify with EXPLAIN ANALYZE that rows-estimated tracks rows-actual within 2× on the affected nodes.

Practice

Do these to turn recognition into skill.

Connected lessons
appears again in174
Continue the climb ↑Plan cache, cost-constant tuning, and planner internals
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.