Data Engineering DATA · 08 · 10

Data platform: design ingest to serving

Capstone project — design and build a small data platform from ingest to serving, choosing the right store, format, and index per workload, and prove the seams hold.

DATA Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading about seven stores that agree with each other is not the same as making them agree. Build a small end-to-end platform for one product domain — ingest to warehouse to serving — choose the right store, format, and index for each workload, then deliberately break a seam and prove your contracts catch it.

Goal

Turn the whole track into one coherent system: route each workload to the store and layout that fits it, connect them with a delivery contract that survives a crash, and demonstrate that consistency, freshness, and lineage are properties you designed at the seams — not assumptions.

Project

0 of 8

Objective

Design and build a small data platform for one product domain (e.g. an e-commerce catalog of products + orders) spanning ingest, warehouse, and serving. Pick the right store, file format, and index for each workload, wire a reliable integration contract between them, and prove the system stays correct when a seam fails.

Requirements

Acceptance criteria

A one-page architecture diagram labelling each store with the workload it serves and why that layout (row vs columnar vs inverted vs vector) was chosen — no store is doing a job it's wrong for.
A data-contract table per seam: canonical schema + owner, delivery guarantee (and therefore consumer idempotency), freshness SLA, and the reconciliation that repairs drift.
A demonstrated end-to-end flow: a product update in OLTP becomes visible in the warehouse, the dashboard MV, search, and the vector index — with the propagation path and lag shown.
Evidence the chaos test worked: a captured drift (deleted product still searchable) and a captured repair (reconciliation diff + fix), plus a freshness check failing on a stalled feed.
A lineage walk write-up: given a wrong dashboard number, the backward path gold to silver to bronze to CDC offset to OLTP, using a point-in-time/time-travel query to prove which layer stalled.

Senior stretch

Add hybrid search: combine BM25 lexical scores and vector similarity into one ranked result set, and measure the relevance lift over either alone on a small labelled query set.
Add an event-sourced audit stream for one entity (append-only log with event id + version) and rebuild a read model by replaying it, proving current state is a fold over the log.
Add a CI gate that runs the freshness checks and a sample reconciliation on a canary dataset, failing the build on undeclared staleness or unrepaired drift.
Tune the vector index (ef_search / HNSW M) and chart the recall-vs-latency curve, then pick an operating point against a stated recall SLO instead of a default.

Recap

This is the system you’ll actually be asked to design: ingest from a row-oriented OLTP source of truth, stream it reliably with an outbox instead of a dual-write, land it columnar for cheap prunable scans, transform it through replayable medallion layers over retained raw, and serve each workload from the store built for it — MV for dashboards, inverted index for keyword search, vector ANN for semantics. Each store is individually correct; the platform is correct only because you designed the contract, delivery guarantee, freshness SLA, and reconciliation at every seam — and proved it by breaking one on purpose.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.