Databases DB · 02 · 10

Indexes: design and audit a real index set

Hands-on project — design, deploy, and audit the index set for a realistic multi-tenant table, proving every choice with EXPLAIN (ANALYZE, BUFFERS) before/after numbers.

DB Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading about the leading-column rule is not the same as watching a 4-second dashboard query drop to under a millisecond because you got the composite, the partial predicate, and the INCLUDE list right. Build a realistic multi-tenant schema, load it at scale, and turn every index decision into a measured before/after — with EXPLAIN as the witness at each step.

Goal

Turn the unit’s mental model into a reproducible engineering loop: load a table to production-realistic scale, profile the hot queries, design the deliberate index set (composite + partial + INCLUDE + the right types), deploy without locking, prove each query with EXPLAIN, then run the audit playbook and reclaim what is wasted.

Project

0 of 7

Objective

Take a multi-tenant tasks/comments schema (the starter below or your own), load it to ~50-100M rows with a realistic status skew, and design, deploy, and audit a complete index set so every hot query runs as an Index Only Scan or selective Index Scan — each proven with EXPLAIN (ANALYZE, BUFFERS) before/after, with total index storage held under ~20% of table size.

Requirements

Acceptance criteria

A before/after table per hot query: scan type (Seq Scan to Index Only / Index Scan), Heap Fetches, Buffers, and total execution time — measured at full scale, not estimated.
Every intended index-only scan shows Heap Fetches: 0 after VACUUM, and no query that should be index-served still shows a Seq Scan or a post-scan Sort node.
The parent-delete cascade demonstration shows the multi-second/minutes Seq-Scan cascade collapsing to a fast index lookup once comments' FK column is indexed.
Total index storage is under ~20% of table size, with a one-paragraph budget accounting (each index's size, why it exists, which query it serves) and the leading-column justification for each composite.
An audit summary naming each unused/redundant/bloated index found and the storage reclaimed, plus confirmation that no hot query regressed after the drops.

Senior stretch

Add pg_trgm fuzzy search (ILIKE-anywhere / similarity on title) and pgvector HNSW semantic search on an embedding column; compare index size, write cost, and recall against the GIN tsvector baseline.
Demonstrate the implicit-coercion trap end to end: show WHERE bigint_col = '42' (TEXT literal) seq-scanning, then the typed-parameter version using the index — capture both plans.
Set up a HOT-update experiment: lower fillfactor to 70 on an update-heavy table, drive updates that change no indexed column, and show the drop in index writes (HOT n_tup_hot_upd in pg_stat_user_tables) versus fillfactor 100.
Add a CI gate: run EXPLAIN (ANALYZE) on the five hot queries against a seeded canary on every migration and fail the build if any plan regresses to a Seq Scan, a Sort node, or Heap Fetches above a threshold.

Recap

This is the loop you will run in every real indexing task: load to realistic scale, baseline the hot queries with EXPLAIN (ANALYZE, BUFFERS), design the deliberate set (composite around the always-present filter, partial for the hot subset, INCLUDE for the projection, the right type for the data shape), deploy CONCURRENTLY and verify indisvalid, index every FK column, prove each query is index-served with Heap Fetches: 0, then audit for unused/redundant/bloated indexes and reclaim the waste. Doing it once on a realistic schema makes the production version muscle memory.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.