Data Engineering
ELT vs ETL: multiple-choice review
Six questions that cut across the whole unit. Each one is a decision you actually make designing a pipeline — not a definition to recite, but a tradeoff to weigh against cost, replay, and compliance.
Confirm you can connect where the Transform runs to its downstream consequences: replayability, the warehouse bill, schema discipline, and the idempotency that keeps a retry from doubling your data.
What single architectural change in cloud warehouses (Snowflake, BigQuery) is the real reason the industry flipped from ETL to ELT?
You discover a timezone bug in a transform that has shipped wrong numbers for six months. Under ELT with a medallion architecture, what is the fast, correct fix?
A dbt model was set to full-refresh by default and scheduled hourly; it rebuilds a 2 TB fact table from scratch every run and the Snowflake bill jumped 40%. The output is correct. Where is the bug and what is the fix?
A regulated fintech ingests payment events containing card PANs, and compliance forbids raw cardholder data from ever sitting in the analytics warehouse. Which pattern fits, and why is the modern ELT default wrong here?
An EL tool retried a partially-succeeded load and your revenue fact table now shows inflated totals. What property was missing, and what is the durable design fix?
Someone calls schema-on-read 'pure freedom — no schema to fight at load time.' What does the unit's framing say they are missing?
The through-line: where the Transform runs decides everything downstream. Decoupled storage/compute made landing raw cheap, which buys replayability through the medallion contract (immutable bronze, cleaned silver, business-ready gold). But the T now meters on the warehouse bill, so you go incremental by default. And because loaders retry, every load must be idempotent — merge on a unique_key — or a retry doubles your data. ELT is the default; ETL still wins when a hard rule says raw PII must never touch the warehouse.