Data Engineering DATA · 06 · 01

Full-text search: inverted indexes, analysis, and why ranking beats matching

LIKE ''''%term%'''' scans every row and ignores relevance. Full-text search flips the problem: an inverted index finds candidate docs in milliseconds, an analysis pipeline normalizes text, and BM25 ranks by usefulness, not just presence.

DATA Junior ◷ 17 min

Level

FoundationsJuniorMiddleSenior

A product search ships with WHERE title LIKE '%running shoes%'. It works in the demo. At 50k rows it’s a 400ms full table scan because the leading % kills any index. Then a customer searches “running shoe” — singular — and gets zero results, while “Running Shoes” returns nothing because the match is case-sensitive. The query is doing string matching when the job was search. Two different problems wearing the same = operator.

In the next fifteen minutes you will see exactly why LIKE breaks at scale, how an inverted index makes queries fast regardless of corpus size, and which one mistake silently returns zero results — even when everything else looks correct.

Why `LIKE '%term%'` can never be search

LIKE '%running%' has two fatal properties. First, the leading wildcard means no B-tree index can help — the database scans every row and runs a substring match on each, so cost grows linearly with table size. At a few thousand rows nobody notices; at a few million it’s a timeout. Second, it matches substrings of bytes, not words of language. “running” won’t match “run”, “Run”, “ran”, or “runs”. It happily matches “scunthorpe” inside a longer word and misses “shoe” when the text says “shoes”. And critically, LIKE has no notion of relevance: a row where your term appears once in a footer ranks identically to one where it’s the title repeated three times. Matching is binary; search is ordered.

Full-text search replaces both halves. The inverted index solves finding — turning a linear scan into a hash lookup. Relevance scoring solves ordering — deciding which of the 10,000 matches a human should see first. You need both. A system that finds everything but ranks badly is as useless as one that ranks perfectly but is too slow to run.

The inverted index: term → posting list

A normal (forward) index maps a document to its words. An inverted index maps the other way: each term points to a posting list of the document ids that contain it, usually with positions and term frequencies attached. Searching for “running shoes” becomes: look up the posting list for running, look up the posting list for shoes, intersect them. That’s two dictionary lookups and a merge of two sorted id lists — milliseconds regardless of corpus size, because you never touch documents that don’t contain the terms.

Documents:                  Inverted index:
  doc1: "running shoes"       running -> [doc1, doc3]
  doc2: "blue dress"          shoes   -> [doc1]
  doc3: "trail running gear"  blue    -> [doc2]
                              dress   -> [doc2]
                              trail   -> [doc3]
                              gear    -> [doc3]
Query "running shoes" -> intersect [doc1,doc3] & [doc1] = [doc1]

This is the data structure inside Postgres tsvector/GIN and inside Lucene (the engine under Elasticsearch and OpenSearch). The hard part isn’t the lookup — it’s deciding what counts as a “term”, which is the analysis pipeline.

Analysis: the pipeline that decides what matches

Raw text never goes into the index directly. It runs through an analyzer: tokenize → lowercase → remove stopwords → stem or lemmatize. “The Runners Were Running” becomes roughly [run, run] after splitting on whitespace, lowercasing, dropping the stopwords “the”/“were”, and stemming “runners”/“running” down to the root “run”. Now a query for “run”, “running”, or “runners” all hit the same indexed token. This is why full-text search matches singular and plural where LIKE can’t — the matching happens on normalized tokens, not raw strings.

The senior trap is hiding in plain sight: the same analyzer must run at index time and at query time. If you stem documents to “run” when indexing but forget to stem the query, a search for “running” produces the token “running”, which isn’t in the index, and you get zero results — silently, with no error. This is the most common “search is broken and I don’t know why” incident. The fix and the diagnostic are the same tool: Elasticsearch’s _analyze API (or Postgres to_tsvector) shows you the exact tokens both sides produce, so you can see the mismatch.

▸Why this works

Stemming and lemmatization aren’t the same trade. Stemming (Porter, Snowball) chops suffixes with crude rules — fast, but “university” and “universe” both stem to “univers”, and “ran” doesn’t reduce to “run”. Lemmatization uses a dictionary to find the true base form (handling “ran” → “run”) — more accurate, more expensive, language-specific. Most production search uses stemming because it’s good enough and cheap; you reach for lemmatization when precision matters more than CPU.

Relevance: TF-IDF, then BM25

Once you have candidates, you score them. The intuition is TF-IDF: a term matters more in a document the more often it appears there (term frequency), but less if it’s common across the whole corpus (inverse document frequency) — “the” appearing 50 times tells you nothing. Modern engines use BM25, a refinement that fixes TF-IDF’s biggest flaw: raw term frequency grows without bound, so a spammy document repeating “shoes” 200 times would dominate. BM25 saturates term frequency — the 10th occurrence adds far less than the 2nd — and normalizes for document length so a long article isn’t unfairly boosted just for being long.

BM25 has two knobs every senior should recognize. k1 (default 1.2 in Elasticsearch and Lucene) controls how fast term frequency saturates — lower saturates sooner. b (default 0.75) controls how strongly document length is normalized: 0 ignores length entirely, 1 normalizes fully. The defaults are good for prose; you tune them when your documents are unusually short (product titles) or long (manuals), and you tune by measuring relevance against judged queries — never by vibes, because a tuning change that helps one query class silently regresses another.

Postgres vs a dedicated engine: the real decision

Postgres tsvector + GIN is genuinely good search, and the right default until it isn’t. The decision a senior actually weighs:

Concern	Postgres tsvector + GIN	Elasticsearch / OpenSearch
Setup & ops	Already running; one index, no extra system to operate	A separate cluster to deploy, secure, monitor, and keep in sync
Relevance	`ts_rank` is basic; BM25 only via extensions	BM25 by default, tunable `k1`/`b` per field
Fuzzy / typo tolerance	Bolt-on (`pg_trgm`); not built into FTS	First-class fuzziness + edit distance
Faceting / aggregations	Hand-rolled `GROUP BY`	Native aggregations, built for facets
Scale-out	Vertical; one node	Sharding + replicas across nodes

Inside Postgres there’s a second choice: GIN vs GiST. GIN looks up roughly 3x faster but is ~3x slower to build, 2–3x larger on disk, and slower to update — and with fastupdate=on it batches writes into a pending list that bloats the index and can stall a query when it flushes. GiST is smaller and faster to update but lossy, so it rechecks the table on matches. Rule of thumb: GIN for read-heavy, mostly-static text; GiST when writes are constant or index size hurts. The failure mode at scale is GIN bloat under high write — your “fast” search index quietly degrades until a REINDEX or autovacuum tuning rescues it.

Near-real-time and the reindex-the-world tax

Dedicated engines aren’t instantly consistent — they’re near-real-time. Elasticsearch’s refresh_interval defaults to 1s, meaning a freshly indexed document isn’t searchable for up to a second. That’s a feature: refreshing is expensive, so under heavy bulk loads seniors raise it to 30s or set it to -1 during a big import and refresh once at the end — trading visibility latency for throughput. The trap is assuming write-then-read consistency you don’t have.

The biggest operational tax is the reindex-the-world migration. Lucene won’t let you change the type or analyzer of an existing field — the tokens are already committed. Want to add a synonym filter or fix a stemming bug across 200M documents? You build a new index with the new mapping, reindex everything into it (hours), then atomically swap a read alias from old to new so clients never see downtime. Forgetting to design behind an alias on day one is the mistake that turns a one-line analyzer fix into a multi-day, all-hands migration.

Pick the best fit

A SaaS app already on Postgres needs search over ~2M support tickets: keyword match, basic relevance, low ops budget, writes are modest. Pick the approach.

Quiz

Documents are indexed with a stemming analyzer, but the query is sent without analysis. A user searches 'running' and gets zero results. Why?

Quiz

Why did search engines move from raw TF-IDF to BM25 for term-frequency scoring?

Order the steps

Order the stages a piece of text passes through, from raw input to a ranked result:

1 Analyze: tokenize, lowercase, drop stopwords, stem/lemmatize into normalized terms
2 Index: write each term to its posting list in the inverted index
3 Query: run the SAME analyzer over the search string to produce query terms
4 Match: look up and intersect the posting lists for the query terms
5 Rank: score the candidates with BM25 and return them ordered

Index build (top) and query (bottom) run the SAME analyzer, so tokens match. The inverted index maps each term to a postings list; the query intersects those lists, then BM25/TF-IDF ranks the survivors.

Recall before you leave

01
A teammate says full-text search 'just returns nothing' for some queries and there are no errors in the logs. Walk through how you'd diagnose it.
02
When is Postgres tsvector/GIN the right choice, and what specifically pushes you to a dedicated engine like Elasticsearch?

Recap

Full-text search exists because LIKE '%term%' is two failures at once: a full scan that can’t use an index, and byte-substring matching with no concept of relevance. The inverted index fixes finding — each term points to a posting list of doc ids, so a query is a dictionary lookup and a merge, fast regardless of corpus size. The analysis pipeline (tokenize, lowercase, stopword, stem/lemmatize) decides what counts as a term, and its iron rule is that the same analyzer must run at index and query time, or you get silent zero results. Ranking is the other half: BM25, with its k1 and b knobs, saturates term frequency and normalizes for length so the most useful documents surface, not just any that contain the word. Postgres tsvector with a GIN index is the right default — fast lookups, no extra system — and you weigh GIN’s bloat-under-write against GiST’s lossiness inside it. You graduate to Elasticsearch or OpenSearch when you need faceting, fuzzy matching, BM25 tuning, or scale, accepting near-real-time refresh and designing behind a read alias so the inevitable reindex-the-world migration stays a zero-downtime alias swap. Now when you see “search is broken” in an incident channel, your first move is to run the analyzer on both sides and compare tokens — not to reach for a scoring knob.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.