awesome-everything RU
↑ Back to the climb

Databases

Acts 1–3 in depth: schema, indexes, and planner statistics

Crux The first three acts compound irreversibly — a wrong type is a multi-day migration at 100M rows, the leading-column rule is the most violated principle in the wild, and stale statistics make the planner choose the wrong plan 1000× slower.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 14 min

A team adds a TEXT column for email instead of CITEXT on Day 0. Six months and 100M rows later, a case-sensitivity bug requires a full-table type change. The original schema decision — made in five seconds — costs three days and a maintenance window. Schema is the only act where the cost of being wrong is irreversible.

Act 1 — Schema: the one irreversible decision

A wrong type (TEXT vs CITEXT for email) becomes a multi-day migration at 100M rows. A wrong key choice (composite natural key vs surrogate BIGSERIAL) makes every join expensive forever. The relational model rewards strictness up front:

  • NOT NULL by default — nullable columns invite application-level null checks that should be database constraints.
  • FK constraints unless you have measured throughput pressure — referential integrity prevents orphaned child rows that produce wrong query results at scale.
  • CHECK constraints for business invariants — a direct SQL update can create invalid state without them.
  • Surrogate keys for external referencesid BIGSERIAL insulates from email changes; retrofitting surrogate keys at 1B rows requires a full table rewrite.

Schema mistakes and their escape hatches:

  • A composite natural key (user_id, organization_id, email) as the PK means every join carries three columns; every FK replicates three columns. Fix: surrogate id is cheaper, but retrofitting requires a migration that rewrites the table.
  • A missing FK on a child table means no referential integrity; the data decays over time. Fix: add the FK retroactively — but first fix orphaned child rows.
  • A missing CHECK constraint on a domain-specific column (status must be one of ACTIVE, PENDING, ARCHIVED) means the application is the only guard. Fix: add the constraint, quarantine invalid existing rows.

Bend the rules only with evidence — disable FK on the highest-write child tables if the FK trigger cost exceeds the write budget, never as a stylistic preference.

Act 2 — Indexes: pay back cost only on the right columns

A B-tree on email resolves WHERE email = ? in O(log N) page reads. A B-tree on (org_id, created_at DESC) resolves “the 50 latest events for an org” in two page reads. A GIN index on prefs jsonb makes WHERE prefs @> '{"theme":"dark"}' a 2 ms lookup instead of a 4 s seq-scan.

The forgotten cost: every insert and update writes to every index. The leading-column rule is the most-violated principle in the wild.

Index selection failures:

  • A B-tree index on a low-cardinality column (gender with 2 values, status with 5 values) is almost always a waste — the planner chooses seq-scan and the index eats write throughput.
  • An index on (a, b) does not help queries that filter only by b. A (user_id, created_at) index built for “events for a user” does not help “recent events by anyone.” A separate (created_at, user_id) index satisfies both and the planner picks it based on the query.
  • JSONB indexes (GIN or GIST) are powerful but expensive on writes — 100K inserts/second can halve write throughput if the JSONB column is indexed naively.
  • Partial indexes solve the write-throughput problem for dimension tables: CREATE INDEX ... WHERE status != 'ARCHIVED' covers the 5% of active rows at a fraction of the cost.

The cost of being wrong here is permanent: legacy apps inherit the bad column order, and any attempt to drop the index for remediation risks query regressions in undocumented code paths.

Act 3 — Statistics: the planner picks plans from histograms

Postgres samples the table during ANALYZE and builds a histogram per column. The planner uses histograms to estimate row counts; the cost model ranks plans. When statistics are stale, estimates are wrong.

A 1000× off estimate produces a 1000× slower plan: the planner picks a nested loop expecting 10 rows; the real cardinality is 10000 rows; the loop scans 10M index entries instead of one hash join over a 10K-row build side.

Defence:

  • autovacuum_analyze_scale_factor = 0.05 (or smaller, 0.01) on busy tables — at the default 0.1, a table that grows by 1M rows per day has stale stats within hours.
  • CREATE STATISTICS (dependencies) ON user_id, created_at FROM events teaches the planner column correlations (new users tend to appear in recent data; the planner assumes independence by default).
  • pg_stat_statements to find the offending plans by total execution time.
  • auto_explain.log_min_duration = '500ms' for tail-latency triage.

The trap at scale: a single bad estimate cascades. The planner caches plan IDs (pg_stat_statements.plan_id); a human has to find and force a replan by dropping and re-creating the query’s prepared statement, or by running DISCARD PLANS.

Acts 1–3 cost and fix times
Schema: fix wrong type at 1M rows
hours
Schema: fix wrong type at 100M rows
days + maintenance window
Index lookup in shared_buffers (L3 cache)
5–50 µs
Index lookup from SSD
5–10 ms
Seq-scan of 1 GB table on SSD
1–3 s
autovacuum analyze on default 0.1 factor
triggers after 100K new rows
Why this works

A work_mem limit (default 4 MB) affects sorts and hash joins. If the planner’s estimates are wrong and the actual data requires more work_mem than available, the query spills to disk and becomes very slow. Re-analyzing solves the estimate problem; increasing work_mem solves the hash-build problem; neither is a long-term substitute for correct statistics.

Quiz

An index exists on (user_id, created_at). A query filters only by created_at. What happens?

Quiz

autovacuum_analyze_scale_factor = 0.1 (the default). A table grows by 1M rows per day. How often does ANALYZE run?

Quiz

You add a surrogate BIGSERIAL id to a table that already uses email as the natural key. What is the primary benefit at year-3 scale?

Recall before you leave
  1. 01
    Why is schema described as the 'one irreversible decision' and what does that mean operationally?
  2. 02
    Explain the leading-column rule and give an example of how violating it causes a production incident.
  3. 03
    What is the cascade failure when statistics are stale and how do autovacuum_analyze_scale_factor and CREATE STATISTICS fix it?
Recap

Acts 1–3 are the foundation that every later act builds on. Schema is irreversible — wrong types and missing surrogate keys become multi-day migrations at 100M rows; the rule is NOT NULL by default, FK unless you have measured throughput pressure, and surrogate keys for external references. The leading-column rule governs every B-tree index: an index on (a, b) does not help b-alone queries, and getting this wrong permanently slows the queries that need b — the fix is a new index, but the old bad index cannot be safely dropped without an audit. Statistics decay as the table grows faster than autovacuum’s default threshold; the defence is a lower autovacuum_analyze_scale_factor on high-growth tables and CREATE STATISTICS for correlated columns. All three acts share a theme: the cost of the right decision at Day 0 is near zero; the cost of retrofitting the wrong decision at Year 3 is exponential.

Connected lessons
appears again in258
Continue the climb ↑Acts 4–6 in depth: MVCC bloat, connection pooling, and safe migrations
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources4
expand
  1. 01
  2. 02
  3. 03
  4. 04

Trademarks belong to their respective owners. Editorial reference only.