awesome-everything RU
↑ Back to the climb

Databases

Index design exercise: full-text search strategy

Crux A synthesis exercise: choose the right index strategy for a full-text search feature, evaluate tradeoffs between GIN tsvector, pg_trgm, Elasticsearch, and pgvector HNSW, and design indexes for a complete ticketing system.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 20 min

The team ships a search box. Users type in it. The query is WHERE title ILIKE '%invoice%'. In staging with 10k tasks this is fine. In production with 50M tasks it takes 12 seconds. ILIKE with a leading wildcard cannot use a B-tree. The fix is not “add an index on title.” It is choosing the right index type for the question being asked. That choice depends on whether users want keyword match, fuzzy match, or semantic match — and this lesson works through all three.

Why ILIKE cannot scale

WHERE title ILIKE '%term%' has a leading wildcard. B-tree indexes require a known prefix; they cannot answer “does this string contain the term anywhere.” Every row must be evaluated — this is always O(n).

The three scalable alternatives:

  1. GIN on tsvector — word-level inverted index; answers “which documents contain this word or phrase”; understands stemming and language rules.
  2. pg_trgm GIN — trigram decomposition; answers “which strings contain this substring or are similar to this string”; handles typos and partial input.
  3. pgvector HNSW — graph-based approximate nearest-neighbour on embeddings; answers “which documents are semantically similar to this query”; requires a model inference pipeline.

GIN on tsvector: the Postgres-native choice

-- Add a generated tsvector column (Postgres 12+)
ALTER TABLE tasks
  ADD COLUMN tsv_search TSVECTOR
  GENERATED ALWAYS AS (
    to_tsvector('english', coalesce(title, '') || ' ' || coalesce(body, ''))
  ) STORED;

-- GIN index on the generated column
CREATE INDEX CONCURRENTLY idx_tasks_search
  ON tasks USING GIN (tsv_search);

-- Query
SELECT id, title, ts_rank(tsv_search, query) AS rank
FROM tasks, to_tsquery('english', 'invoice & payment') AS query
WHERE tsv_search @@ query
ORDER BY rank DESC
LIMIT 20;

Trade: GIN indexes are 2-5x the size of the column data. Each insert updates GIN posting lists for every lexeme in the document. For documents with 200 words, each insert touches ~200 GIN entries. fastupdate=on (the default) defers these updates — writes are fast, but the deferred buffer must eventually flush, causing periodic latency spikes. Use fastupdate=off for write-heavy tables where consistent latency matters more than peak throughput.

Does not handle: typos (invoce), substring matches inside words (inv), semantic similarity.

The pg_trgm extension breaks strings into trigrams (3-character windows) and builds a GIN or GiST index over them. This enables:

  • LIKE '%term%' and ILIKE '%term%' with index support
  • Similarity search (% operator) for typo tolerance
  • <-> similarity distance ordering
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX CONCURRENTLY idx_tasks_title_trgm
  ON tasks USING GIN (title gin_trgm_ops);

-- Now these use the index:
SELECT * FROM tasks WHERE title ILIKE '%invoice%';
SELECT * FROM tasks WHERE title % 'invoce';  -- similarity, handles typo

Trade: trigram indexes are large (similar to GIN on tsvector or larger for short strings), and similarity queries are slower than exact-match GIN searches. Best for short strings (usernames, SKUs, titles) where fuzzy matching is the dominant use case.

Embeddings represent meaning numerically. Two semantically similar documents have embeddings that are close in vector space. HNSW (Hierarchical Navigable Small World) is a graph-based index for approximate nearest-neighbour search on high-dimensional vectors.

CREATE EXTENSION IF NOT EXISTS vector;

ALTER TABLE tasks ADD COLUMN embedding VECTOR(1536);  -- e.g., OpenAI text-embedding-3-small

CREATE INDEX CONCURRENTLY idx_tasks_embedding_hnsw
  ON tasks USING HNSW (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- Semantic search: find tasks semantically similar to a query embedding
SELECT id, title, 1 - (embedding <=> $1::vector) AS similarity
FROM tasks
ORDER BY embedding <=> $1::vector
LIMIT 20;

Trade: HNSW indexes are large (vector dimension × row count; a 1,536-dimension embedding on 10M rows = ~60 GB). Writes are slow — graph rebalancing per insert. The index is approximate: recall@10 is typically 95-99%. An embedding pipeline (model inference per document and query) is required.

In 2026 production AI/ML systems, pgvector + HNSW is the canonical choice for semantic search when Postgres is the primary store. For scale or advanced feature requirements beyond pgvector’s reach (hundreds of millions of vectors, real-time filtering), dedicated vector databases (Pinecone, Weaviate, Milvus) are alternatives.

Hybrid: combining approaches

For most product search requirements, a hybrid approach combines exact-match and semantic results:

-- GIN tsvector for exact keyword ranking
-- HNSW for semantic similarity ranking
-- Application layer: merge and re-rank results
SELECT id, title, ts_rank(tsv_search, query) AS kw_rank, NULL AS sem_rank
FROM tasks, to_tsquery('english', $1) AS query
WHERE tsv_search @@ query
UNION ALL
SELECT id, title, NULL AS kw_rank, 1 - (embedding <=> $2::vector) AS sem_rank
FROM tasks
ORDER BY embedding <=> $2::vector
LIMIT 50;
-- Merge by id, sum ranks, re-sort, take top 20

This is more complex to implement but provides both exact-match recall and semantic relevance.

Pick the best fit

A new full-text search feature on a 'documents' table (50M rows) needs an index. Pick the right strategy.

Which RFC?

Which Postgres version introduced the INCLUDE clause in CREATE INDEX, enabling covering indexes without the included columns affecting the sort key?

Quiz

A team adds an FK with ON DELETE CASCADE from comments(post_id) to posts(id) but does NOT index comments(post_id). What happens on DELETE FROM posts WHERE id = 42 if comments has 100M rows?

Design challenge

Design the complete index set for a ticketing system. Table: tasks (id BIGSERIAL PK, workspace_id BIGINT, project_id BIGINT, assignee_user_id BIGINT, status TEXT, priority SMALLINT, title TEXT, body TEXT, ticket_id TEXT, created_at TIMESTAMPTZ). Scale: 100M tasks; 80% done (cold), 15% open (hot), 5% in_progress (hot). Budget: total index storage under 20% of table size. Five hot queries listed below.

  • Query A: list open/in_progress tasks in a project, ordered by priority then created_at.
  • Query B: list open/in_progress tasks assigned to a specific user across all projects in a workspace.
  • Query C: find task by ticket_id (unique per workspace).
  • Query D: full-text search across task titles and bodies.
  • Query E: find tasks created in the last 24 hours in a workspace.
Why this works

Why does Postgres need six index types when most other databases default to one? Because the fundamental data structures are incompatible. B-tree requires a total order. JSONB documents have no total order. Geometric shapes require spatial predicates. Text search requires word-level inverted lists. Embedding similarity requires high-dimensional graph navigation. A single universal index structure would either be astronomically expensive or unable to express the right operations. The design trade — multiple specialized types, each fit for its data shape — keeps query performance practical at production scale.

Recall before you leave
  1. 01
    Explain why GIN tsvector is the default full-text search choice in Postgres, and what its limitations are.
  2. 02
    Design an index for a query that: filters by workspace_id and status, sorts by created_at DESC, and projects only id and title. Explain every choice.
  3. 03
    When would you choose pgvector HNSW over GIN tsvector for search, and what are the operational costs?
Recap

WHERE title ILIKE '%term%' is always O(n) — leading wildcards defeat B-tree. Scalable alternatives: GIN on a tsvector STORED GENERATED column for keyword full-text (stemming, ts_rank, no extra infrastructure); pg_trgm GIN for fuzzy and substring matching (typo tolerance, ILIKE-anywhere); pgvector HNSW for semantic embedding search (natural language, requires model inference pipeline).

For a ticketing system at 100M rows, the deliberate index set: two partial composites (WHERE status IN (‘open’,‘in_progress’)) for the hot 20% of data, covering the open-tasks dashboard queries; a UNIQUE composite for per-tenant ticket_id; a GIN tsvector for full-text; a full-table composite for the recent-24h feed — total under 20% of table size, all within write-overhead budget.

The INCLUDE clause (Postgres 11+) adds projection columns to index leaves without affecting the sort key, enabling index-only scans for typical list projections. Partial indexes cut size proportionally to the selectivity of the WHERE clause — the most underused performance lever in production Postgres schemas.

Connected lessons
appears again in174
Continue the climb ↑Indexes: multiple-choice review
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.