awesome-everything RU
↑ Back to the climb

Data Engineering

Data platform: design ingest to serving

Crux Capstone project — design and build a small data platform from ingest to serving, choosing the right store, format, and index per workload, and prove the seams hold.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 240 min

Reading about seven stores that agree with each other is not the same as making them agree. Build a small end-to-end platform for one product domain — ingest to warehouse to serving — choose the right store, format, and index for each workload, then deliberately break a seam and prove your contracts catch it.

Goal

Turn the whole track into one coherent system: route each workload to the store and layout that fits it, connect them with a delivery contract that survives a crash, and demonstrate that consistency, freshness, and lineage are properties you designed at the seams — not assumptions.

Project
0 of 8
Objective

Design and build a small data platform for one product domain (e.g. an e-commerce catalog of products + orders) spanning ingest, warehouse, and serving. Pick the right store, file format, and index for each workload, wire a reliable integration contract between them, and prove the system stays correct when a seam fails.

Requirements
Acceptance criteria
  • A one-page architecture diagram labelling each store with the workload it serves and why that layout (row vs columnar vs inverted vs vector) was chosen — no store is doing a job it's wrong for.
  • A data-contract table per seam: canonical schema + owner, delivery guarantee (and therefore consumer idempotency), freshness SLA, and the reconciliation that repairs drift.
  • A demonstrated end-to-end flow: a product update in OLTP becomes visible in the warehouse, the dashboard MV, search, and the vector index — with the propagation path and lag shown.
  • Evidence the chaos test worked: a captured drift (deleted product still searchable) and a captured repair (reconciliation diff + fix), plus a freshness check failing on a stalled feed.
  • A lineage walk write-up: given a wrong dashboard number, the backward path gold to silver to bronze to CDC offset to OLTP, using a point-in-time/time-travel query to prove which layer stalled.
Senior stretch
  • Add hybrid search: combine BM25 lexical scores and vector similarity into one ranked result set, and measure the relevance lift over either alone on a small labelled query set.
  • Add an event-sourced audit stream for one entity (append-only log with event id + version) and rebuild a read model by replaying it, proving current state is a fold over the log.
  • Add a CI gate that runs the freshness checks and a sample reconciliation on a canary dataset, failing the build on undeclared staleness or unrepaired drift.
  • Tune the vector index (ef_search / HNSW M) and chart the recall-vs-latency curve, then pick an operating point against a stated recall SLO instead of a default.
Recap

This is the system you’ll actually be asked to design: ingest from a row-oriented OLTP source of truth, stream it reliably with an outbox instead of a dual-write, land it columnar for cheap prunable scans, transform it through replayable medallion layers over retained raw, and serve each workload from the store built for it — MV for dashboards, inverted index for keyword search, vector ANN for semantics. Each store is individually correct; the platform is correct only because you designed the contract, delivery guarantee, freshness SLA, and reconciliation at every seam — and proved it by breaking one on purpose.

shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.