Databases DB · 02 · 01

What an index is and how it speeds up queries

An index is a sorted copy of column values that lets Postgres skip full table scans. Every index trades write speed and storage for read speed.

DB Junior ◷ 12 min

Level

FoundationsJuniorMiddleSenior

Already know this unit? Take a 1-minute quick check →

A team’s “filter by status” dashboard takes 50ms on 100k rows in staging. In production with 50M rows it takes 8 seconds. The schema is correct. The query is correct. One DDL line — CREATE INDEX CONCURRENTLY ON orders(status) — brings it back to 20ms. That line is the entirety of this lesson.

By the end of this lesson you will know exactly why that one line works, what it costs, and when adding it makes things worse instead of better.

What an index is

An index is a separate data structure — usually a B-tree — that stores a sorted copy of one or more column values plus pointers to the actual rows. When a query filters by an indexed column, Postgres navigates the tree in O(log n) steps instead of scanning every row.

The book-index metaphor holds precisely: a book’s back-matter index is a sorted list of terms with page numbers. To find “MVCC”, you open the index (one O(log n) lookup), read the page number, and jump there. Without the index you scan every page. A database index is identical — a copy of column values, organized for fast lookup by one specific question.

Without index	With index
Sequential scan — every row touched	B-tree walk — O(log n) then fetch matching rows
1M rows: ~10ms; 100M rows: ~2–10s	1M or 100M rows: ~1–5ms
No extra storage, no write overhead	~10–30% table size per index; every write maintains the index

The cost: indexes tax every write

Why not index every column and call it a day? Because every index you add is a permanent tax on every future write. The question is not “could an index help this query?” but “is the read speedup worth the write slowdown on this table?”

Every index is a separate on-disk structure that must stay synchronized with the table. Every INSERT, UPDATE, or DELETE that touches an indexed column must also update the index — costing CPU and I/O. A table with ten unused indexes on a write-heavy workload spends 10× more on writes than necessary.

Typical per-index write overhead: 5–50 µs per write. At 10,000 inserts/second with ten indexes, that is an extra 0.5–5ms of overhead per second of clock time — measurable, and often the bottleneck when teams over-index.

Index storage on disk: roughly bytes_per_key × row_count × 1.3. For a BIGINT key (8 bytes), a 100M-row table produces a ~1 GB index. Each index beyond the first multiplies this.

A concrete scenario

Consider: SELECT * FROM orders WHERE user_id = 42 on a table with 10M rows. Without an index: sequential scan, ~2 seconds. After CREATE INDEX ON orders(user_id): index lookup plus heap fetch, ~3 milliseconds. Same query, 600× speedup, one DDL line.

The index buys a 600x read speedup, but you pay for it on every write and on disk — that is the whole tradeoff this lesson is about.

Postgres ships six index types. B-tree is the default and handles 95% of production use — equality, range, ORDER BY, LIKE 'abc%'. The other five (GIN, GiST, BRIN, Hash, Bloom) serve specific data shapes and are covered in lesson 04.

Production rule: always use CONCURRENTLY

CREATE INDEX takes an ACCESS EXCLUSIVE lock — all reads and writes are blocked for the duration. On a table with active traffic, even a 5-minute lock is an incident.

CREATE INDEX CONCURRENTLY uses a multi-phase build that does not take the exclusive lock. It scans the table twice and waits for in-flight transactions between phases. It is 2–3× slower and cannot be wrapped in a transaction, but it produces no visible impact on application traffic.

If a CONCURRENTLY build fails (e.g., a unique-index encounters a duplicate, or the process is killed), the index is left in an INVALID state (indisvalid = false in pg_indexes). The planner treats it as non-existent. Fix: DROP INDEX CONCURRENTLY and retry.

▸Why this works

Why does Postgres not just index every column by default? Because every index on a write-heavy table is a write tax. A table with 20 indexes and 50k inserts/second can spend more time maintaining indexes than writing rows. The discipline is intentional: measure the hot queries, add indexes for those, audit and drop the unused. Over-indexing and under-indexing are equally damaging production patterns.

Order the steps

Order the steps to add an index correctly in production:

1 Run EXPLAIN ANALYZE on the slow query to confirm a seq scan is the bottleneck
2 Identify the columns being filtered or sorted
3 Pick the index type (B-tree default; GIN for JSONB; GiST for geo)
4 Use CREATE INDEX CONCURRENTLY to avoid locking the table
5 Wait for the build to complete (can take minutes to hours on large tables)
6 Re-run EXPLAIN ANALYZE to confirm the planner now uses the index
7 Monitor write latency after deploy — every index adds a small write cost

Quiz

An index is created on the column users(email). Which query benefits from it?

Quiz

What is the cost of adding an index?

Complete the analogy

Fill in the blank: a database index is to a SQL query what a book _______ is to finding a topic — a copy of part of the content, organised for fast lookup.

Instead of touching every row (sequential scan), the index walks the tree top-down to the matching leaf, then follows the pointer to fetch only the rows that match.

Recall before you leave

01
In two sentences, why are indexes not free and why do production teams not just index every column?
02
What happens if CREATE INDEX CONCURRENTLY fails mid-way, and how do you recover?
03
A query does a sequential scan on a 50M-row table. You add an index. What does EXPLAIN ANALYZE show before and after?

Recap

An index is a separate sorted data structure — typically a B-tree — that stores copies of one or more column values with pointers to the matching heap rows. It turns full-table sequential scans (O(n)) into tree walks (O(log n)), reducing a 2-second query on 10M rows to 3ms. The tradeoff: every index costs 10–30% of table size in storage and 5–50 µs per write per index. Always add indexes with CREATE INDEX CONCURRENTLY to avoid table-level locks in production. If the build fails, check pg_indexes for indisvalid = false and retry after dropping the invalid entry.

Now when you see a slow query on a large table, your first instinct should be EXPLAIN ANALYZE — and if it shows a Seq Scan, you know exactly what to reach for and what it will cost.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

What a relation is: tables, rows, keys, and constraintsjunior

unlocks

deepens into

appears again in177

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Grounded RAG ServiceA RAG demo that answers from a corpus is easy; a RAG service you'd trust in front of users is not. The hard part isn't retrieval, it's grounding: making the model say only what the retrieved text supports, attaching citations the reader can check, and proving with an eval set that the answers don't drift into confident fiction. You'll build the whole loop — chunk, embed, store, retrieve top-k, ground, cite, score — and feel exactly where it leaks.URL shortener at scaleBuild a URL shortener that survives real traffic — then run it: deploy it, watch it, and work the incident when one hot link melts your cache.