awesome-everything RU
↑ Back to the climb

Data Engineering

Full-text search: multiple-choice review

Crux Multiple-choice synthesis across the search unit — inverted index, analyzer parity, BM25 tuning, GIN vs GiST, near-real-time refresh, and zero-downtime reindex.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 13 min

Six questions that cut across the whole unit. Each one mirrors a call you make designing or debugging real search — not a definition to recite, but a tradeoff to weigh against the corpus, the write rate, and the SLO.

Goal

Confirm you can connect the inverted index, the analysis pipeline, BM25 relevance, the Postgres-vs-engine decision, and the operational traps (refresh latency, reindex migrations) into one model of how search actually behaves.

Quiz

A query for 'running' returns zero results with no error in the logs. Documents were indexed with a stemming analyzer; the query path does not analyze the term the same way. What is the most likely root cause?

Quiz

Why did modern engines move from raw TF-IDF to BM25 for term-frequency scoring?

Quiz

A SaaS app on Postgres needs search over ~2M support tickets: keyword match, basic relevance, modest writes, low ops budget. What is the right default?

Quiz

You picked GIN for a tsvector index. Months later, on a now write-heavy table, search latency degrades intermittently with no query change. What is the most likely mechanism?

Quiz

An Elasticsearch service writes a document and immediately reads it back by search, intermittently getting a miss. The team assumed write-then-read consistency. What is happening?

Quiz

You need to fix a stemming bug by changing an existing field's analyzer across 200M documents in Elasticsearch. What is the correct migration, and why?

Recap

The through-line: matching is tokens, not strings, so the same analyzer must run at index and query time or you get silent zero results; BM25 ranks the candidates the inverted index produces by saturating term frequency and normalizing length; Postgres GIN is the right default until you need facets, fuzziness, BM25 tuning, or scale, at which point GIN’s write-bloat versus GiST’s lossiness is the inside choice; and the operational traps — near-real-time refresh latency and the immutable-tokens reindex — are why you design behind a read alias from day one.

Continue the climb ↑Full-text search: free-recall review
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.