Databases
Extended statistics: fixing correlated-column estimate failures
A WHERE clause filters on country='US' AND region='CA' AND status='shipped'. The planner predicts 500 matching rows — three independent probabilities multiplied together. There are actually 50,000. A Nested Loop runs 100× too many times. The columns are correlated, not independent. CREATE STATISTICS fixes this in two SQL lines.
Why the independence assumption fails
The planner’s default multi-column selectivity model:
P(country='US' AND region='CA' AND status='shipped')
= P(country='US') × P(region='CA') × P(status='shipped')
= 0.50 × 0.05 × 0.20
= 0.005 (0.5%)But if country determines region (every CA order is in US — a functional dependency), the actual selectivity is:
P(region='CA') × P(status='shipped') = 0.05 × 0.20 = 1.0%On 100M rows: the planner estimates 500,000 rows; reality is 1,000,000 — a 2× error that can tip Nested Loop into catastrophic territory. With tighter correlations the error can be 1000×.
The three kinds of extended statistics
CREATE STATISTICS (available since PG 10) supports three complementary kinds:
| Kind | What it stores | Fixes | Since |
|---|---|---|---|
dependencies | Functional dependency coefficients between column pairs | One column implies another (zip → city, country → region) | PG 10 |
ndistinct | Distinct combination count for multi-column groups | GROUP BY cardinality over multiple columns | PG 10 |
mcv | Most-common value tuples for the column combination | Exact estimates for frequent combinations | PG 12 |
dependencies
Stores functional dependency factors between column pairs. A factor close to 1.0 means “column A almost always determines column B”. When the planner sees WHERE country='US' AND region='CA', it checks the dependency between country and region — if country determines region with factor 0.95, the planner knows not to multiply P(country) × P(region); instead it estimates based on the more selective of the two.
CREATE STATISTICS stx_orders_geo (dependencies)
ON country, region, status FROM orders;
ANALYZE orders;ndistinct
Without this, the planner estimates the number of distinct (country, region) combinations as n_distinct(country) × n_distinct(region), which wildly overestimates when the columns are correlated. ndistinct stores the actual combination count.
Useful for queries like:
SELECT country, region, COUNT(*) FROM orders GROUP BY country, region;Without ndistinct, the planner may allocate a Hash Aggregate expecting millions of groups when there are only hundreds.
mcv (multi-column most common values)
Stores the most frequent tuples of the column combination directly — analogous to the per-column MCV list, but for pairs or triples. For WHERE country='US' AND status='shipped', if (US, shipped) is a common combination, the planner reads its exact frequency from the MCV list rather than multiplying marginal probabilities.
Available since PG 12 and often the highest-value kind for OLTP query patterns.
Expression statistics (PG 14+)
PG 14 added statistics on expressions:
CREATE STATISTICS stx_orders_lower_email ON LOWER(email) FROM users;
ANALYZE users;This lets the planner accurately estimate selectivity for WHERE LOWER(email) = 'x' even without a functional index on LOWER(email).
When and how to apply CREATE STATISTICS
The workflow:
- Run
EXPLAIN ANALYZEon the slow query. Find nodes whererows(estimate) differs fromactual rowsby more than 10×. - Identify the WHERE clause columns on that node. Check if they are naturally correlated: zip and city, country and region, product and category.
- Create the statistics object:
-- Most common pattern: dependencies + mcv together
CREATE STATISTICS stx_orders_country_region_status
(dependencies, mcv)
ON country, region, status FROM orders;
-- For GROUP BY accuracy:
CREATE STATISTICS stx_orders_ndistinct
(ndistinct)
ON country, region FROM orders;
-- Run ANALYZE to populate:
ANALYZE orders;- Re-run
EXPLAIN ANALYZE. Verify thatrowstracksactual rowswithin 2× on the affected nodes.
Storage cost: small — kilobytes per statistics object. Planning improvement on the affected queries: 100–1000×.
Why this works
Why does default_statistics_target not solve this? Raising the target (e.g., to 1000) gives finer histograms and longer MCV lists for individual columns — it does not teach the planner about correlations between columns. With 1000-entry MCV lists, the planner still multiplies P(country) × P(region) — it just has better per-column estimates. The independence assumption survives. CREATE STATISTICS replaces the independence assumption for specific column groups; SET STATISTICS refines the data used within it.
Which is the right tool to make the planner aware that two columns are correlated (e.g., zip and city)?
When is `ndistinct` the right kind of extended statistics to create?
After running CREATE STATISTICS on (country, region, status) and ANALYZE, how do you verify the statistics actually improved the estimate?
- 01Explain the independence assumption failure and why CREATE STATISTICS (dependencies) fixes it.
- 02What is the difference between 'dependencies', 'mcv', and 'ndistinct' in CREATE STATISTICS?
- 03How do you identify which column combinations need extended statistics in a production database?
By default the planner multiplies single-column selectivities assuming independence — wildly wrong for correlated columns like (country, region) or (zip, city). CREATE STATISTICS adds three kinds of multi-column information: dependencies (functional dependency factors between column pairs, fixing estimate failures for WHERE A=‘x’ AND B=‘y’ when A determines B), ndistinct (actual combination counts for accurate GROUP BY cardinality), and mcv (most-common value tuples for frequent specific combinations, available since PG 12). PG 14+ added expression statistics on computed expressions. Storage cost is kilobytes; planning improvement on affected queries can be 100–1000×. After creating statistics, always run ANALYZE and verify with EXPLAIN ANALYZE that rows-estimated tracks rows-actual within 2× on the affected nodes.
Practice
Do these to turn recognition into skill.
appears again in174
- The journey of a request: seven stops from socket to responsejunior
- Accept and parse: from kernel queue to a typed requestmiddle
- Routing and middleware: choosing what runs, and in what ordermiddle
- Handler and response: from business logic to bytes on the wiremiddle
- Streaming and backpressure: when the client reads slower than you writesenior
- Timeouts and tail latency: budgets, deadlines, and the fan-out trapsenior
- Middleware and DI: the two patterns that shape every backendjunior
- Writing middleware: signatures, next(), and the three framework modelsmiddle
- Inversion of control: how dependencies reach a classmiddle
- DI scopes and lifecycles: singleton, request, transientmiddle
- DI as a testing seam: fakes, mocks, and the boundary that matterssenior
- DI containers in production: resolution graphs, circular deps, and when not tosenior
- Blocking vs non-blocking I/O: two ways to waitjunior
- The event loop: one thread, ordered phasesmiddle
- What blocks the loop: CPU work and sync callsmiddle
- Offloading CPU work: worker threads and the libuv poolmiddle
- Backpressure and bounded concurrencysenior
- Throughput under load: tail latency and saturationsenior
- Why pool: the cost of creating a connectionjunior
- Pool sizing: why bigger is not fastermiddle
- Acquisition and timeouts: the wait queue is the real latency dialmiddle
- Retry strategies: backoff, jitter, and thundering herdmiddle
- Observability, production failures, and global-scale designsenior
- Tasks, microtasks, and scheduler.yield()middle
- Timer accuracy, throttling, and idle workmiddle
- Node.js event loop: phases, nextTick, and loop lagsenior
- Rendering strategies: SSG, SSR, ISR, streaming, and hydrationjunior
- SSG, SSR, ISR, streaming, and RSC — how each worksmiddle
- Hydration cost: selective, progressive, islands, resumabilitymiddle
- Core Web Vitals: what LCP, INP, and CLS measurejunior
- LCP: four phases, one dominant costmiddle
- INP: input delay, processing, presentationmiddle
- Lab vs field: why the two disagree and how to use eachmiddle
- Metric tradeoffs, RUM attribution, and the CI+field loopsenior
- The full picture: URL to LCP to INP as a relay racejunior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- Bits on the wirejunior
- Latency mathmiddle
- Bufferbloat and congestionsenior
- The physical frontiersenior
- Sequence numbers and connection statemiddle
- Flow control and congestion controlmiddle
- BBR, production observability, and beyond TCPsenior
- CDN: putting content next doorjunior
- Anycast and GeoDNS: routing to the nearest edgemiddle
- Tiered cache and Cache-Controlmiddle
- Vary header and cache keysmiddle
- Stale-while-revalidate and cache stampedesenior
- Edge workers and edge-side compositionsenior
- CDN operations and observabilitysenior
- WebSocket: the HTTP upgrade handshakejunior
- WebSocket vs SSE vs long-polling: choosing the right transportmiddle
- WebSocket backpressure: when clients can''''t keep upmiddle
- Reconnection: jittered backoff, thundering herd, message resumptionsenior
- WebSocket at scale: HTTP/2 multiplexing, permessage-deflate, C10Msenior
- WebSocket in production: proxies, security, and distributed architecturesenior
- What reverse proxies dojunior
- Balancing algorithms: round-robin to power-of-two-choicesmiddle
- L4 vs L7 load balancing and client-IP preservationmiddle
- Health checks, connection draining, and slow startmiddle
- Retry storms, circuit breakers, and load sheddingsenior
- Resilient LB architecture: anycast, zone-aware routing, and observabilitysenior
- Why QUIC and not TCP+TLSjunior
- QUIC streams and head-of-line blockingjunior
- Integrated handshake and 1-RTTmiddle
- Connection IDs and network migrationmiddle
- Loss detection and congestion controlmiddle
- 0-RTT resumption and packet encryptionsenior
- Deployment tradeoffs and CPU costsenior
- DDoS: what it is and why it worksjunior
- Amplification attacks and state exhaustionmiddle
- Rate limiting: algorithms and architecturemiddle
- WAFs, firewalls, mTLS, and HSTSmiddle
- DNS cache poisoning and BGP hijackingsenior
- Defense-in-depth architecture and attack economicssenior
- The twelve layers: one URL, seven actorsjunior
- DNS, TCP, TLS in sequence: where the milliseconds gomiddle
- Critical render path and Core Web Vitalsmiddle
- Proxy intercepts and security gates: rate limiters, WAF, mTLSmiddle
- Alternate paths: QUIC 0-RTT, WebSocket upgrade, connection migrationmiddle
- Observability: distributed traces, USE/RED, and samplingsenior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- What the three signals are: logs, metrics, and tracesjunior
- Metrics and cardinality: the cost model of a time-series databasemiddle
- Logs and volume: the cost model of structured loggingmiddle
- Traces and sampling: the cost model of distributed tracingmiddle
- Join keys and exemplars: making the three signals composemiddle
- Observability 2.0: wide events and the cost shiftsenior
- Failure modes and engineering practice: cardinality budgets, PII, and samplingsenior
- Why structured logs exist: the diary vs the spreadsheetjunior
- The production log schema: fields every line must carrymiddle
- Log levels and alert routingmiddle
- Sampling strategies and log costmiddle
- PII redaction and log injectionsenior
- Trace context propagation in logssenior
- OTel Logs Data Model and audit logs as a subsystemsenior
- OTel signals, Semantic Conventions, and the OTLP wire formatmiddle
- Auto-instrumentation and manual spans: the 80/20 of OTelmiddle
- The OTel Collector: receivers, processors, exporters, and deployment patternsmiddle
- Sampling strategies: head, tail, and parent-basedmiddle
- Vendor neutrality, eBPF instrumentation, the Operator, and browser/serverless OTelsenior
- Operating the OTel Collector: reliability, version skew, failure modes, and governancesenior
- RED and USE: two checklists, one triage disciplinejunior
- Instrumenting RED in Prometheus: counters, histograms, and cardinality disciplinemiddle
- USE on Linux: CPU, memory, disk, network, and PSImiddle
- Golden signals, dashboard layout, and service mesh auto-REDmiddle
- Cardinality as a cost driver: labels, PII, exemplars, and samplingmiddle
- Native histograms, SLO tie-in, and production failure patternsmiddle
- Choosing SLIs and SLO targets: ratios, not feelingsmiddle
- Multi-window multi-burn-rate alerting: why AND beats ORmiddle
- Error budget policy, latency SLOs, and composite journeysmiddle
- Iceberg SLIs, composite SLO math, and SLA vs SLOsenior
- Flame graphs: reading the picture that shows where time goesjunior
- Sampling vs instrumentation profiling: why 99 Hz wins in productionmiddle
- Profile types: CPU, memory, off-CPU, mutex — which one to reach formiddle
- Continuous profiling: always-on flame graphs with eBPF and trace-id correlationmiddle
- How flame graphs are built from samples, and the production workflows that use themmiddle
- Linux perf, eBPF internals, PGO, and the limits of samplingsenior
- Profiling in production: security, war stories, OTel profiles, and the infrastructure designsenior
- The debugging funnel: SLO → RED → trace → profilejunior
- OTel architecture: one SDK, four signals, one wire formatmiddle
- Cost discipline: keeping observability under 5% of infra spendmiddle
- Scale, security, and the ROI of observable systemssenior
- Why profile first: measure where time actually goesjunior
- Amdahl''''s law and self-time: the ceiling on every speedup you can shipmiddle
- The measurement loop: microbench, macrobench, prod profile, observer effectmiddle
- Reading flame graphs: shapes, per-language profilers, and the 60-second scanmiddle
- Statistical baselines: why one run is not a measurementmiddle
- Profiler history and microbenchmark pitfalls: Knuth to GWPsenior
- Hardware counters, cold-start profiles, and profile securitysenior
- Continuous profiling at scale: costs, CI gates, trace correlation, and anti-patternssenior
- What makes a hot path: symptom vs causejunior
- Five shapes of hotspot: CPU, alloc, cache, lock, syscallmiddle
- Reading parent and child chains: where to apply the fixmiddle
- JIT deopt, the fix-and-verify loop, and PR-time profilingmiddle
- Hardware counters and Intel TMA: sub-category diagnosissenior
- False sharing and native-bridge hot pathssenior
- Hot paths in production: security, tail latency, and tooling lineagesenior
- Memory hierarchy: why the same O(N) loop can be 17x slowerjunior
- Row-major vs column-major: access order and the 9x gapjunior
- Branch prediction and branchless codemiddle
- Hardware prefetcher, TLB, and memory-level parallelismsenior
- GC basics: what the runtime taxes you forjunior
- GC algorithms: generational, concurrent, and per-runtimemiddle
- GC tradeoffs: pause, throughput, heap — and object poolingmiddle
- GC tuning: pacing, heap shape, and allocation observabilitymiddle
- GC internals: tri-color invariant, write barriers, and per-runtime deep-divessenior
- GC in production: observability, security, edge cases, and fleet governancesenior
- N+1: one logical operation, many round-tripsjunior
- Fix families: JOIN, IN, preload, and DataLoadermiddle
- Detecting N+1: query logs, APM traces, and CI gatesmiddle
- DataLoader: batching across resolver treesmiddle
- Cross-protocol N+1: HTTP fan-out and Redis MGETmiddle
- N+1 at scale: pool exhaustion, plan changes, and denormalisationsenior
- Batching: amortize fixed cost per operationjunior
- The batching window: size and wait timemiddle
- Batching in Kafka and Postgresmiddle
- io_uring and observability of batchingmiddle
- From Nagle to io_uring: evolution of batchingmiddle
- Backpressure, failure isolation, and batch security in productionsenior
- What a bundle actually costs: download, parse, compile, executejunior
- Core Web Vitals: LCP, INP, and CLSmiddle
- Code splitting: route-level, component-level, vendor splittingmiddle
- Tree shaking and compression: removing what you don''''t usemiddle
- Third-party scripts: the silent budget killermiddle
- CI enforcement and RUM: making budgets stickmiddle
- V8 JIT pipeline, HTTP priorities, and bundle securitysenior
- The performance loop: discipline, not a projectjunior
- Classify and fix: matching bottleneck families to remediesmiddle
- Observability stack and CI gates: catching regressions before they shipmiddle
- Incident to enforcement: SLO burn to verified fix in 35 minutesmiddle
- Culture, economics, and org-scale performancesenior