Data Engineering
Full-text search: free-recall review
Retrieval beats re-reading. For each prompt, say or write a full answer from memory before you open the model answer — the effort of recall is what turns the unit’s ideas into something you can apply under pressure.
Reconstruct the unit’s core mechanisms — the inverted index, the analysis pipeline and its parity rule, BM25 and its knobs, the Postgres-vs-engine decision, and the operational traps — without looking back at the lesson.
- 01Why can LIKE '%term%' never be search, and what two distinct problems does full-text search solve in its place?
- 02Describe the inverted index and what makes a query fast on it regardless of corpus size.
- 03What does an analyzer do, and why must the same analyzer run at index time and query time?
- 04Explain why search moved from TF-IDF to BM25, and what the k1 and b knobs control.
- 05When is Postgres tsvector/GIN the right default, what pushes you to a dedicated engine, and how do you choose GIN vs GiST inside Postgres?
- 06What does 'near-real-time' mean for a dedicated engine, and why must you design behind a read alias from day one?
If you could reconstruct each answer from memory, you hold the unit’s spine: LIKE fails at both finding and ranking, the inverted index makes finding a dictionary lookup, the analysis pipeline decides what a term is and its parity rule is non-negotiable, BM25 saturates term frequency and normalizes length so the useful docs surface, Postgres GIN is the right default until facets/fuzziness/scale push you to a dedicated engine, and near-real-time refresh plus the immutable-tokens reindex are why you build behind a read alias from the start.