Data Engineering DATA · 06 · 07

Full-text search: multiple-choice review

Multiple-choice synthesis across the search unit — inverted index, analyzer parity, BM25 tuning, GIN vs GiST, near-real-time refresh, and zero-downtime reindex.

DATA Senior ◷ 13 min

Level

FoundationsJuniorMiddleSenior

Six questions that cut across the whole unit. Each one mirrors a call you make designing or debugging real search — not a definition to recite, but a tradeoff to weigh against the corpus, the write rate, and the SLO.

Goal

Confirm you can connect the inverted index, the analysis pipeline, BM25 relevance, the Postgres-vs-engine decision, and the operational traps (refresh latency, reindex migrations) into one model of how search actually behaves.

Quiz

A query for 'running' returns zero results with no error in the logs. Documents were indexed with a stemming analyzer; the query path does not analyze the term the same way. What is the most likely root cause?

Quiz

Why did modern engines move from raw TF-IDF to BM25 for term-frequency scoring?

Quiz

A SaaS app on Postgres needs search over ~2M support tickets: keyword match, basic relevance, modest writes, low ops budget. What is the right default?

Quiz

You picked GIN for a tsvector index. Months later, on a now write-heavy table, search latency degrades intermittently with no query change. What is the most likely mechanism?

Quiz

An Elasticsearch service writes a document and immediately reads it back by search, intermittently getting a miss. The team assumed write-then-read consistency. What is happening?

Quiz

You need to fix a stemming bug by changing an existing field's analyzer across 200M documents in Elasticsearch. What is the correct migration, and why?

Recap

The through-line: matching is tokens, not strings, so the same analyzer must run at index and query time or you get silent zero results; BM25 ranks the candidates the inverted index produces by saturating term frequency and normalizing length; Postgres GIN is the right default until you need facets, fuzziness, BM25 tuning, or scale, at which point GIN’s write-bloat versus GiST’s lossiness is the inside choice; and the operational traps — near-real-time refresh latency and the immutable-tokens reindex — are why you design behind a read alias from day one. Now when you see an analyzer-mismatch incident or a GIN-bloat alert, you know which lever to reach for before touching anything else.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.