Databases DB · 02 · 07

Index design exercise: full-text search strategy

A synthesis exercise: choose the right index strategy for a full-text search feature, evaluate tradeoffs between GIN tsvector, pg_trgm, Elasticsearch, and pgvector HNSW, and design indexes for a complete ticketing system.

DB Senior ◷ 20 min

Level

FoundationsJuniorMiddleSenior

The team ships a search box. Users type in it. The query is WHERE title ILIKE '%invoice%'. In staging with 10k tasks this is fine. In production with 50M tasks it takes 12 seconds. ILIKE with a leading wildcard cannot use a B-tree. The fix is not “add an index on title.” It is choosing the right index type for the question being asked. That choice depends on whether users want keyword match, fuzzy match, or semantic match — and this lesson works through all three.

Why ILIKE cannot scale

WHERE title ILIKE '%term%' has a leading wildcard. B-tree indexes require a known prefix; they cannot answer “does this string contain the term anywhere.” Every row must be evaluated — this is always O(n).

The three scalable alternatives:

GIN on tsvector — word-level inverted index; answers “which documents contain this word or phrase”; understands stemming and language rules.
pg_trgm GIN — trigram decomposition; answers “which strings contain this substring or are similar to this string”; handles typos and partial input.
pgvector HNSW — graph-based approximate nearest-neighbour on embeddings; answers “which documents are semantically similar to this query”; requires a model inference pipeline.

Each option solves a different question: GIN tsvector for “does this document contain the word?”, pg_trgm for “does this string look like the input?”, HNSW for “does this document mean the same thing as the query?”. Picking the wrong one is not just slow — it is fundamentally unable to answer what the user is asking.

GIN on tsvector: the Postgres-native choice

-- Add a generated tsvector column (Postgres 12+)
ALTER TABLE tasks
  ADD COLUMN tsv_search TSVECTOR
  GENERATED ALWAYS AS (
    to_tsvector('english', coalesce(title, '') || ' ' || coalesce(body, ''))
  ) STORED;

-- GIN index on the generated column
CREATE INDEX CONCURRENTLY idx_tasks_search
  ON tasks USING GIN (tsv_search);

-- Query
SELECT id, title, ts_rank(tsv_search, query) AS rank
FROM tasks, to_tsquery('english', 'invoice & payment') AS query
WHERE tsv_search @@ query
ORDER BY rank DESC
LIMIT 20;

Trade: GIN indexes are 2-5x the size of the column data. Each insert updates GIN posting lists for every lexeme in the document. For documents with 200 words, each insert touches ~200 GIN entries. fastupdate=on (the default) defers these updates — writes are fast, but the deferred buffer must eventually flush, causing periodic latency spikes. Use fastupdate=off for write-heavy tables where consistent latency matters more than peak throughput.

Does not handle: typos (invoce), substring matches inside words (inv), semantic similarity.

pg_trgm: fuzzy and substring search

The pg_trgm extension breaks strings into trigrams (3-character windows) and builds a GIN or GiST index over them. This enables:

LIKE '%term%' and ILIKE '%term%' with index support
Similarity search (% operator) for typo tolerance
<-> similarity distance ordering

CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX CONCURRENTLY idx_tasks_title_trgm
  ON tasks USING GIN (title gin_trgm_ops);

-- Now these use the index:
SELECT * FROM tasks WHERE title ILIKE '%invoice%';
SELECT * FROM tasks WHERE title % 'invoce';  -- similarity, handles typo

Trade: trigram indexes are large (similar to GIN on tsvector or larger for short strings), and similarity queries are slower than exact-match GIN searches. Best for short strings (usernames, SKUs, titles) where fuzzy matching is the dominant use case.

pgvector HNSW: semantic search

Embeddings represent meaning numerically. Two semantically similar documents have embeddings that are close in vector space. HNSW (Hierarchical Navigable Small World) is a graph-based index for approximate nearest-neighbour search on high-dimensional vectors.

CREATE EXTENSION IF NOT EXISTS vector;

ALTER TABLE tasks ADD COLUMN embedding VECTOR(1536);  -- e.g., OpenAI text-embedding-3-small

CREATE INDEX CONCURRENTLY idx_tasks_embedding_hnsw
  ON tasks USING HNSW (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- Semantic search: find tasks semantically similar to a query embedding
SELECT id, title, 1 - (embedding <=> $1::vector) AS similarity
FROM tasks
ORDER BY embedding <=> $1::vector
LIMIT 20;

Trade: HNSW indexes are large (vector dimension × row count; a 1,536-dimension embedding on 10M rows = ~60 GB). Writes are slow — graph rebalancing per insert. The index is approximate: recall@10 is typically 95-99%. An embedding pipeline (model inference per document and query) is required.

In 2026 production AI/ML systems, pgvector + HNSW is the canonical choice for semantic search when Postgres is the primary store. For scale or advanced feature requirements beyond pgvector’s reach (hundreds of millions of vectors, real-time filtering), dedicated vector databases (Pinecone, Weaviate, Milvus) are alternatives.

Hybrid: combining approaches

For most product search requirements, a hybrid approach combines exact-match and semantic results:

-- GIN tsvector for exact keyword ranking
-- HNSW for semantic similarity ranking
-- Application layer: merge and re-rank results
SELECT id, title, ts_rank(tsv_search, query) AS kw_rank, NULL AS sem_rank
FROM tasks, to_tsquery('english', $1) AS query
WHERE tsv_search @@ query
UNION ALL
SELECT id, title, NULL AS kw_rank, 1 - (embedding <=> $2::vector) AS sem_rank
FROM tasks
ORDER BY embedding <=> $2::vector
LIMIT 50;
-- Merge by id, sum ranks, re-sort, take top 20

This is more complex to implement but provides both exact-match recall and semantic relevance.

Pick the best fit

A new full-text search feature on a 'documents' table (50M rows) needs an index. Pick the right strategy.

Which RFC?

Which Postgres version introduced the INCLUDE clause in CREATE INDEX, enabling covering indexes without the included columns affecting the sort key?

Quiz

A team adds an FK with ON DELETE CASCADE from comments(post_id) to posts(id) but does NOT index comments(post_id). What happens on DELETE FROM posts WHERE id = 42 if comments has 100M rows?

Design challenge

Design the complete index set for a ticketing system. Table: tasks (id BIGSERIAL PK, workspace_id BIGINT, project_id BIGINT, assignee_user_id BIGINT, status TEXT, priority SMALLINT, title TEXT, body TEXT, ticket_id TEXT, created_at TIMESTAMPTZ). Scale: 100M tasks; 80% done (cold), 15% open (hot), 5% in_progress (hot). Budget: total index storage under 20% of table size. Five hot queries listed below.

Query A: list open/in_progress tasks in a project, ordered by priority then created_at.
Query B: list open/in_progress tasks assigned to a specific user across all projects in a workspace.
Query C: find task by ticket_id (unique per workspace).
Query D: full-text search across task titles and bodies.
Query E: find tasks created in the last 24 hours in a workspace.

Reference answer

Index 1 — idx_tasks_ws_project_open: btree (workspace_id, project_id, priority, created_at DESC) WHERE status IN ('open','in_progress') INCLUDE (id, title, assignee_user_id). Serves Query A. Leading workspace_id scopes to tenant. project_id narrows to project. priority + created_at DESC serve the ORDER BY. Partial WHERE cuts index to 20% of table. INCLUDE makes it covering for the typical list projection. Index 2 — idx_tasks_ws_assignee_open: btree (workspace_id, assignee_user_id, created_at DESC) WHERE status IN ('open','in_progress') INCLUDE (id, project_id, title, priority). Serves Query B. workspace_id as tenant scope, assignee_user_id for user filter, created_at for sort. Partial same as above. Index 3 — idx_tasks_ws_ticket_unique: btree UNIQUE (workspace_id, ticket_id). Serves Query C. Point lookup in microseconds. Also enforces per-tenant uniqueness. Index 4 — idx_tasks_search: GIN on stored generated column tsv_search = to_tsvector('english', coalesce(title,'') || ' ' || coalesce(body,'')). Serves Query D. Full-text on title + body. Index 5 — idx_tasks_ws_recent: btree (workspace_id, created_at DESC) INCLUDE (id, title, status, assignee_user_id). Serves Query E. No status filter (shows all statuses in the 24h feed). Cost: Index 1 and 2 are partial (20% of 100M = 20M entries each), ~600 MB each. Index 3 covers full 100M rows (ticket_id is short text), ~2 GB. Index 4 GIN ~3-5 GB depending on document length. Index 5 full 100M rows, ~1.5 GB. Total ~8-10 GB on a ~40 GB table — within 20% budget. Write overhead: each insert touches all 5 indexes; each open→done status update does not touch partial indexes 1 and 2 (they only index open/in_progress). Drop any pre-existing single-column index on workspace_id — Index 1, 2, and 5 all cover that prefix.

Should cover

Partial WHERE status IN ('open','in_progress') cuts index size by 80% and makes writes to done tasks cheaper.
INCLUDE columns cover the typical projection without bloating the sort key.
A UNIQUE index on (workspace_id, ticket_id) doubles as both the constraint enforcement and the lookup index.
GIN on a STORED GENERATED tsvector column is the Postgres-native full-text answer — no extra infrastructure.
Budget accounting: sum index sizes; verify they are under 20% of table size; verify write overhead is acceptable.
After design, drop redundant prefix indexes that composites now cover.

The partial composites (WHERE status IN ('open','in_progress')) are the cheapest entries because they index only the hot 20% of rows; GIN tsvector dominates. The five sum to ~8.6 GB — comfortably under 20% of the ~40 GB table.

▸Why this works

Why does Postgres need six index types when most other databases default to one? Because the fundamental data structures are incompatible. B-tree requires a total order. JSONB documents have no total order. Geometric shapes require spatial predicates. Text search requires word-level inverted lists. Embedding similarity requires high-dimensional graph navigation. A single universal index structure would either be astronomically expensive or unable to express the right operations. The design trade — multiple specialized types, each fit for its data shape — keeps query performance practical at production scale.

ILIKE '%term%' is always O(n). The question being asked decides the type: word-level keyword search uses GIN tsvector; typo and substring search uses pg_trgm trigrams; meaning-based search uses pgvector HNSW with an embedding pipeline. Hybrid search merges keyword and semantic results.

Recall before you leave

01
Explain why GIN tsvector is the default full-text search choice in Postgres, and what its limitations are.
02
Design an index for a query that: filters by workspace_id and status, sorts by created_at DESC, and projects only id and title. Explain every choice.
03
When would you choose pgvector HNSW over GIN tsvector for search, and what are the operational costs?

Recap

WHERE title ILIKE '%term%' is always O(n) — leading wildcards defeat B-tree. Scalable alternatives: GIN on a tsvector STORED GENERATED column for keyword full-text (stemming, ts_rank, no extra infrastructure); pg_trgm GIN for fuzzy and substring matching (typo tolerance, ILIKE-anywhere); pgvector HNSW for semantic embedding search (natural language, requires model inference pipeline).

For a ticketing system at 100M rows, the deliberate index set: two partial composites (WHERE status IN (‘open’,‘in_progress’)) for the hot 20% of data, covering the open-tasks dashboard queries; a UNIQUE composite for per-tenant ticket_id; a GIN tsvector for full-text; a full-table composite for the recent-24h feed — total under 20% of table size, all within write-overhead budget.

The INCLUDE clause (Postgres 11+) adds projection columns to index leaves without affecting the sort key, enabling index-only scans for typical list projections. Partial indexes cut size proportionally to the selectivity of the WHERE clause — the most underused performance lever in production Postgres schemas.

Now when someone asks “why is our search slow?”, you know the first question is not “do we have an index?” but “which question are users asking — keyword, fuzzy, or semantic?” The answer determines the index type, the infrastructure cost, and the operational complexity. Pick deliberately.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 7 done

Connected lessons

builds on

Production failure modes and the index audit playbooksenior

appears again in177

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.