Databases DB · 08 · 02

Acts 1–3 in depth: schema, indexes, and planner statistics

The first three acts compound irreversibly — a wrong type is a multi-day migration at 100M rows, the leading-column rule is the most violated principle in the wild, and stale statistics make the planner choose the wrong plan 1000× slower.

DB Middle ◷ 14 min

Level

FoundationsJuniorMiddleSenior

A team adds a TEXT column for email instead of CITEXT on Day 0. Six months and 100M rows later, a case-sensitivity bug requires a full-table type change. The original schema decision — made in five seconds — costs three days and a maintenance window. Schema is the only act where the cost of being wrong is irreversible.

Act 1 — Schema: the one irreversible decision

When you design a schema, you are making bets that are almost impossible to reverse at scale — a five-second decision on Day 0 can cost three days of maintenance at Year 2.

A wrong type (TEXT vs CITEXT for email) becomes a multi-day migration at 100M rows. A wrong key choice (composite natural key vs surrogate BIGSERIAL) makes every join expensive forever. The relational model rewards strictness up front:

NOT NULL by default — nullable columns invite application-level null checks that should be database constraints.
FK constraints unless you have measured throughput pressure — referential integrity prevents orphaned child rows that produce wrong query results at scale.
CHECK constraints for business invariants — a direct SQL update can create invalid state without them.
Surrogate keys for external references — id BIGSERIAL insulates from email changes; retrofitting surrogate keys at 1B rows requires a full table rewrite.

The same Day-0 mistake; the bill escalates with row count — which is why schema is the one act whose wrong decision you cannot cheaply walk back.

Schema mistakes and their escape hatches:

A composite natural key (user_id, organization_id, email) as the PK means every join carries three columns; every FK replicates three columns. Fix: surrogate id is cheaper, but retrofitting requires a migration that rewrites the table.
A missing FK on a child table means no referential integrity; the data decays over time. Fix: add the FK retroactively — but first fix orphaned child rows.
A missing CHECK constraint on a domain-specific column (status must be one of ACTIVE, PENDING, ARCHIVED) means the application is the only guard. Fix: add the constraint, quarantine invalid existing rows.

Bend the rules only with evidence — disable FK on the highest-write child tables if the FK trigger cost exceeds the write budget, never as a stylistic preference.

Act 2 — Indexes: pay back cost only on the right columns

Every index you add speeds one set of reads and slows every write — so before you add one, ask yourself: which query is slow, and does the column you are indexing match its filter exactly?

A B-tree on email resolves WHERE email = ? in O(log N) page reads. A B-tree on (org_id, created_at DESC) resolves “the 50 latest events for an org” in two page reads. A GIN index on prefs jsonb makes WHERE prefs @> '{"theme":"dark"}' a 2 ms lookup instead of a 4 s seq-scan.

The forgotten cost: every insert and update writes to every index. The leading-column rule is the most-violated principle in the wild.

Index selection failures:

A B-tree index on a low-cardinality column (gender with 2 values, status with 5 values) is almost always a waste — the planner chooses seq-scan and the index eats write throughput.
An index on (a, b) does not help queries that filter only by b. A (user_id, created_at) index built for “events for a user” does not help “recent events by anyone.” A separate (created_at, user_id) index satisfies both and the planner picks it based on the query.
JSONB indexes (GIN or GIST) are powerful but expensive on writes — 100K inserts/second can halve write throughput if the JSONB column is indexed naively.
Partial indexes solve the write-throughput problem for dimension tables: CREATE INDEX ... WHERE status != 'ARCHIVED' covers the 5% of active rows at a fraction of the cost.

The cost of being wrong here is permanent: legacy apps inherit the bad column order, and any attempt to drop the index for remediation risks query regressions in undocumented code paths.

Act 3 — Statistics: the planner picks plans from histograms

You have the right indexes and the right schema — yet queries are still inexplicably slow half the time. The culprit is almost always stale statistics, and the fix is one command.

Postgres samples the table during ANALYZE and builds a histogram per column. The planner uses histograms to estimate row counts; the cost model ranks plans. When statistics are stale, estimates are wrong.

A 1000× off estimate produces a 1000× slower plan: the planner picks a nested loop expecting 10 rows; the real cardinality is 10000 rows; the loop scans 10M index entries instead of one hash join over a 10K-row build side.

Defence:

autovacuum_analyze_scale_factor = 0.05 (or smaller, 0.01) on busy tables — at the default 0.1, a table that grows by 1M rows per day has stale stats within hours.
CREATE STATISTICS (dependencies) ON user_id, created_at FROM events teaches the planner column correlations (new users tend to appear in recent data; the planner assumes independence by default).
pg_stat_statements to find the offending plans by total execution time.
auto_explain.log_min_duration = '500ms' for tail-latency triage.

The trap at scale: a single bad estimate cascades. The planner caches plan IDs (pg_stat_statements.plan_id); a human has to find and force a replan by dropping and re-creating the query’s prepared statement, or by running DISCARD PLANS.

Acts 1–3 cost and fix times

Schema: fix wrong type at 1M rows: hours
Schema: fix wrong type at 100M rows: days + maintenance window
Index lookup in shared_buffers (L3 cache): 5–50 µs
Index lookup from SSD: 5–10 ms
Seq-scan of 1 GB table on SSD: 1–3 s
autovacuum analyze on default 0.1 factor: triggers after 100K new rows

▸Why this works

A work_mem limit (default 4 MB) affects sorts and hash joins. If the planner’s estimates are wrong and the actual data requires more work_mem than available, the query spills to disk and becomes very slow. Re-analyzing solves the estimate problem; increasing work_mem solves the hash-build problem; neither is a long-term substitute for correct statistics.

Quiz

An index exists on (user_id, created_at). A query filters only by created_at. What happens?

Quiz

autovacuum_analyze_scale_factor = 0.1 (the default). A table grows by 1M rows per day. How often does ANALYZE run?

Quiz

You add a surrogate BIGSERIAL id to a table that already uses email as the natural key. What is the primary benefit at year-3 scale?

Schema defines the table; statistics feed the planner's row estimates; the planner picks an index access path. A wrong type, a missing index, or stale stats each breaks this path at a different stage.

Recall before you leave

01
Why is schema described as the 'one irreversible decision' and what does that mean operationally?
02
Explain the leading-column rule and give an example of how violating it causes a production incident.
03
What is the cascade failure when statistics are stale and how do autovacuum_analyze_scale_factor and CREATE STATISTICS fix it?

Recap

Acts 1–3 are the foundation that every later act builds on. Schema is irreversible — wrong types and missing surrogate keys become multi-day migrations at 100M rows; the rule is NOT NULL by default, FK unless you have measured throughput pressure, and surrogate keys for external references. The leading-column rule governs every B-tree index: an index on (a, b) does not help b-alone queries, and getting this wrong permanently slows the queries that need b — the fix is a new index, but the old bad index cannot be safely dropped without an audit. Statistics decay as the table grows faster than autovacuum’s default threshold; the defence is a lower autovacuum_analyze_scale_factor on high-growth tables and CREATE STATISTICS for correlated columns. All three acts share a theme: the cost of the right decision at Day 0 is near zero; the cost of retrofitting the wrong decision at Year 3 is exponential. Now when you see a query that is slow despite having an index, check the statistics before adding another index — the planner might already know about the index and simply distrust the row estimate.

Connected lessons

builds on

deepens into

appears again in287

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.