Data Engineering
ELT vs ETL: free-recall review
Retrieval beats re-reading. For each prompt, say or write a full answer from memory before you open the model answer — the effort of recall is what makes the pipeline reasoning stick.
Reconstruct the unit’s spine without looking back: why the industry flipped, what replayability buys, the medallion contract, where cost moved, and the two properties — incremental and idempotent — that keep a warehouse pipeline cheap and correct.
- 01Explain why the industry flipped from ETL to ELT, and what you gave up in the trade.
- 02What is replayability, and why is it the deepest reason ELT became the default — deeper than cost?
- 03Describe the medallion architecture and the one contract that keeps it sound.
- 04Where did the cost go when the industry moved to ELT, and what is the single most expensive mistake?
- 05What does it mean for a load to be idempotent, why must warehouse loads be idempotent, and how do you achieve it in dbt?
- 06How do you run a 90-day backfill safely after fixing a transform bug, and why are microbatch models the right tool?
If you could reconstruct each answer from memory, you hold the unit’s spine: decoupled storage/compute enabled the flip; replayability (not cost) is its deepest payoff; the medallion contract — immutable bronze, transform forward — is what makes replay sound; the cost moved onto the metered warehouse so you go incremental by default; every load must be idempotent (merge on a unique_key) because loaders retry; and microbatch turns a long backfill into restartable, independent, idempotent units.