Data Engineering
Data platform: design ingest to serving
Reading about seven stores that agree with each other is not the same as making them agree. Build a small end-to-end platform for one product domain — ingest to warehouse to serving — choose the right store, format, and index for each workload, then deliberately break a seam and prove your contracts catch it.
Turn the whole track into one coherent system: route each workload to the store and layout that fits it, connect them with a delivery contract that survives a crash, and demonstrate that consistency, freshness, and lineage are properties you designed at the seams — not assumptions.
Design and build a small data platform for one product domain (e.g. an e-commerce catalog of products + orders) spanning ingest, warehouse, and serving. Pick the right store, file format, and index for each workload, wire a reliable integration contract between them, and prove the system stays correct when a seam fails.
- A one-page architecture diagram labelling each store with the workload it serves and why that layout (row vs columnar vs inverted vs vector) was chosen — no store is doing a job it's wrong for.
- A data-contract table per seam: canonical schema + owner, delivery guarantee (and therefore consumer idempotency), freshness SLA, and the reconciliation that repairs drift.
- A demonstrated end-to-end flow: a product update in OLTP becomes visible in the warehouse, the dashboard MV, search, and the vector index — with the propagation path and lag shown.
- Evidence the chaos test worked: a captured drift (deleted product still searchable) and a captured repair (reconciliation diff + fix), plus a freshness check failing on a stalled feed.
- A lineage walk write-up: given a wrong dashboard number, the backward path gold to silver to bronze to CDC offset to OLTP, using a point-in-time/time-travel query to prove which layer stalled.
- Add hybrid search: combine BM25 lexical scores and vector similarity into one ranked result set, and measure the relevance lift over either alone on a small labelled query set.
- Add an event-sourced audit stream for one entity (append-only log with event id + version) and rebuild a read model by replaying it, proving current state is a fold over the log.
- Add a CI gate that runs the freshness checks and a sample reconciliation on a canary dataset, failing the build on undeclared staleness or unrepaired drift.
- Tune the vector index (ef_search / HNSW M) and chart the recall-vs-latency curve, then pick an operating point against a stated recall SLO instead of a default.
This is the system you’ll actually be asked to design: ingest from a row-oriented OLTP source of truth, stream it reliably with an outbox instead of a dual-write, land it columnar for cheap prunable scans, transform it through replayable medallion layers over retained raw, and serve each workload from the store built for it — MV for dashboards, inverted index for keyword search, vector ANN for semantics. Each store is individually correct; the platform is correct only because you designed the contract, delivery guarantee, freshness SLA, and reconciliation at every seam — and proved it by breaking one on purpose.