awesome-everything RU
↑ Back to the climb

Data Engineering

Full-text search: build and migrate a search index

Crux Hands-on project — build real full-text search over a document corpus, prove relevance against judged queries, and migrate an analyzer change behind a read alias with zero downtime.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 240 min

Reading about analyzer parity and reindex migrations is not the same as shipping a search feature that survives a stemming bug fix at scale. Build search over a real corpus, prove its relevance with numbers, then perform the migration every search system eventually needs — without anyone noticing.

Goal

Turn the unit’s model into a working system: index a corpus into an inverted index, enforce one analyzer across index and query, tune BM25 against judged queries, and rehearse the zero-downtime reindex-behind-an-alias that an immutable-token engine forces on you.

Project
0 of 7
Objective

Build full-text search over a real document corpus (≥50k docs) — first on Postgres tsvector/GIN, then optionally on Elasticsearch/OpenSearch — that returns relevant ranked results in milliseconds, and prove it with measured relevance and a zero-downtime analyzer migration.

Requirements
Acceptance criteria
  • A relevance table: precision@10 (or nDCG@10) before and after tuning on the same judged-query set, measured not estimated, with no query class regressed below baseline.
  • Evidence of the analyzer-parity guard: the deliberate-mismatch run returning zero results, and the parity test that now prevents it, both captured.
  • An EXPLAIN (ANALYZE) output proving the search query uses the GIN index (bitmap/index scan) rather than a sequential scan on the corpus.
  • A migration log showing the new index built and reindexed, the read alias swapped atomically, and queries served correctly throughout — with a one-paragraph write-up of why the immutable-token constraint forces a rebuild-and-swap rather than an in-place edit.
Senior stretch
  • Add a second backend (Elasticsearch or OpenSearch) over the same corpus and compare relevance, latency, and operational cost against the Postgres version on the identical judged-query set.
  • Add fuzzy/typo tolerance (pg_trgm in Postgres, or ES fuzziness) and measure how much it improves recall on a misspelled-query subset without wrecking precision.
  • Add faceting/aggregations (counts by category alongside results) and measure the latency cost of computing facets per query.
  • Add near-real-time write handling: ingest a stream of new docs, raise refresh_interval (or batch tsvector updates) under a bulk load, and show write throughput improves while documenting the visibility-latency tradeoff you accepted.
Recap

This is the loop you run when search is your responsibility: index into an inverted structure, enforce one analyzer on both sides and guard it with a parity test, tune ranking against judged queries with real numbers, and rehearse the reindex-behind-an-alias migration that an immutable-token engine will eventually demand. Doing it once on a real corpus is what makes analyzer mismatches and reindex migrations obvious instead of terrifying when they show up in production.

Continue the climb ↑Vector search: the recall–latency–memory triangle behind RAG
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.