Data Engineering
Data platform: multiple-choice review
Six questions that cut across the whole track. Each one is a design call you make when one fact has to live in several stores at once — not a definition, but a choice of store, format, or contract under a real workload.
Confirm you can route a workload to the right store and layout, and reason about the seams between them — the synthesis the OLTP/OLAP, ELT, Parquet, MV, event-sourcing, search, and vector units all built toward.
A single product fact must serve point lookups in the checkout path AND a full-table revenue scan for analytics. What is the senior architecture?
Your team is choosing between ETL (transform in a separate engine before load) and ELT (load raw, transform in-warehouse with dbt). The data is messy and the business keeps changing the definition of 'active user'. Which fits, and why?
A nightly dashboard query filters event_date = '2026-05-01' AND country = 'US' over a 2 TB Parquet/Iceberg table and still scans most of the data. What is the highest-leverage fix?
A gold materialized view serving a dashboard refreshes every 6 hours. A finance lead complains the number 'is wrong' versus a live SQL count. Both are internally correct. What did the design get wrong?
A service writes an order to Postgres, then publishes 'OrderPlaced' to Kafka so search and analytics react. Occasionally the search index never learns about an order. What is the root cause and the fix?
Catalog search must match misspelled product names AND the RAG assistant must answer 'a laptop good for video editing'. One team proposes using the vector index for both. What is the correct split?
The track’s through-line is one habit: route each workload to the store and layout that fits it, then design the contract at every seam. Row-store OLTP for point writes, columnar Parquet for scans (with footer-stat pruning), ELT over retained raw for replayable definitions, MVs for read latency with a declared freshness SLA, an outbox to kill the dual-write, inverted indexes for lexical search, and vector ANN for semantic retrieval. Each store is correct for its job; the system stays correct only when you own the schema, delivery guarantee, freshness SLA, and reconciliation between them.