Data Engineering
Full-text search: multiple-choice review
Six questions that cut across the whole unit. Each one mirrors a call you make designing or debugging real search — not a definition to recite, but a tradeoff to weigh against the corpus, the write rate, and the SLO.
Confirm you can connect the inverted index, the analysis pipeline, BM25 relevance, the Postgres-vs-engine decision, and the operational traps (refresh latency, reindex migrations) into one model of how search actually behaves.
A query for 'running' returns zero results with no error in the logs. Documents were indexed with a stemming analyzer; the query path does not analyze the term the same way. What is the most likely root cause?
Why did modern engines move from raw TF-IDF to BM25 for term-frequency scoring?
A SaaS app on Postgres needs search over ~2M support tickets: keyword match, basic relevance, modest writes, low ops budget. What is the right default?
You picked GIN for a tsvector index. Months later, on a now write-heavy table, search latency degrades intermittently with no query change. What is the most likely mechanism?
An Elasticsearch service writes a document and immediately reads it back by search, intermittently getting a miss. The team assumed write-then-read consistency. What is happening?
You need to fix a stemming bug by changing an existing field's analyzer across 200M documents in Elasticsearch. What is the correct migration, and why?
The through-line: matching is tokens, not strings, so the same analyzer must run at index and query time or you get silent zero results; BM25 ranks the candidates the inverted index produces by saturating term frequency and normalizing length; Postgres GIN is the right default until you need facets, fuzziness, BM25 tuning, or scale, at which point GIN’s write-bloat versus GiST’s lossiness is the inside choice; and the operational traps — near-real-time refresh latency and the immutable-tokens reindex — are why you design behind a read alias from day one.