Data Engineering DATA · 06 · 10

Full-text search: build and migrate a search index

Hands-on project — build real full-text search over a document corpus, prove relevance against judged queries, and migrate an analyzer change behind a read alias with zero downtime.

DATA Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading about analyzer parity and reindex migrations is not the same as shipping a search feature that survives a stemming bug fix at scale. Build search over a real corpus, prove its relevance with numbers, then perform the migration every search system eventually needs — without anyone noticing.

Goal

Turn the unit’s model into a working system: index a corpus into an inverted index, enforce one analyzer across index and query, tune BM25 against judged queries, and rehearse the zero-downtime reindex-behind-an-alias that an immutable-token engine forces on you.

Project

0 of 7

Objective

Build full-text search over a real document corpus (≥50k docs) — first on Postgres tsvector/GIN, then optionally on Elasticsearch/OpenSearch — that returns relevant ranked results in milliseconds, and prove it with measured relevance and a zero-downtime analyzer migration.

Requirements

Acceptance criteria

A relevance table: precision@10 (or nDCG@10) before and after tuning on the same judged-query set, measured not estimated, with no query class regressed below baseline.
Evidence of the analyzer-parity guard: the deliberate-mismatch run returning zero results, and the parity test that now prevents it, both captured.
An EXPLAIN (ANALYZE) output proving the search query uses the GIN index (bitmap/index scan) rather than a sequential scan on the corpus.
A migration log showing the new index built and reindexed, the read alias swapped atomically, and queries served correctly throughout — with a one-paragraph write-up of why the immutable-token constraint forces a rebuild-and-swap rather than an in-place edit.

Senior stretch

Add a second backend (Elasticsearch or OpenSearch) over the same corpus and compare relevance, latency, and operational cost against the Postgres version on the identical judged-query set.
Add fuzzy/typo tolerance (pg_trgm in Postgres, or ES fuzziness) and measure how much it improves recall on a misspelled-query subset without wrecking precision.
Add faceting/aggregations (counts by category alongside results) and measure the latency cost of computing facets per query.
Add near-real-time write handling: ingest a stream of new docs, raise refresh_interval (or batch tsvector updates) under a bulk load, and show write throughput improves while documenting the visibility-latency tradeoff you accepted.

Recap

This is the loop you run when search is your responsibility: index into an inverted structure, enforce one analyzer on both sides and guard it with a parity test, tune ranking against judged queries with real numbers, and rehearse the reindex-behind-an-alias migration that an immutable-token engine will eventually demand. Doing it once on a real corpus is what makes analyzer mismatches and reindex migrations obvious instead of terrifying when they show up in production.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.