Data Engineering
Full-text search: build and migrate a search index
Reading about analyzer parity and reindex migrations is not the same as shipping a search feature that survives a stemming bug fix at scale. Build search over a real corpus, prove its relevance with numbers, then perform the migration every search system eventually needs — without anyone noticing.
Turn the unit’s model into a working system: index a corpus into an inverted index, enforce one analyzer across index and query, tune BM25 against judged queries, and rehearse the zero-downtime reindex-behind-an-alias that an immutable-token engine forces on you.
Build full-text search over a real document corpus (≥50k docs) — first on Postgres tsvector/GIN, then optionally on Elasticsearch/OpenSearch — that returns relevant ranked results in milliseconds, and prove it with measured relevance and a zero-downtime analyzer migration.
- A relevance table: precision@10 (or nDCG@10) before and after tuning on the same judged-query set, measured not estimated, with no query class regressed below baseline.
- Evidence of the analyzer-parity guard: the deliberate-mismatch run returning zero results, and the parity test that now prevents it, both captured.
- An EXPLAIN (ANALYZE) output proving the search query uses the GIN index (bitmap/index scan) rather than a sequential scan on the corpus.
- A migration log showing the new index built and reindexed, the read alias swapped atomically, and queries served correctly throughout — with a one-paragraph write-up of why the immutable-token constraint forces a rebuild-and-swap rather than an in-place edit.
- Add a second backend (Elasticsearch or OpenSearch) over the same corpus and compare relevance, latency, and operational cost against the Postgres version on the identical judged-query set.
- Add fuzzy/typo tolerance (pg_trgm in Postgres, or ES fuzziness) and measure how much it improves recall on a misspelled-query subset without wrecking precision.
- Add faceting/aggregations (counts by category alongside results) and measure the latency cost of computing facets per query.
- Add near-real-time write handling: ingest a stream of new docs, raise refresh_interval (or batch tsvector updates) under a bulk load, and show write throughput improves while documenting the visibility-latency tradeoff you accepted.
This is the loop you run when search is your responsibility: index into an inverted structure, enforce one analyzer on both sides and guard it with a parity test, tune ranking against judged queries with real numbers, and rehearse the reindex-behind-an-alias migration that an immutable-token engine will eventually demand. Doing it once on a real corpus is what makes analyzer mismatches and reindex migrations obvious instead of terrifying when they show up in production.