Data Engineering
ELT vs ETL: where the Transform runs, and why the industry flipped
A finance analyst pings you on a Monday: “Why did our Snowflake bill jump 40% last week?” Nothing shipped — except a pipeline change. Someone set a dbt model to full-refresh by default and scheduled it hourly. Every run rescans the entire 2 TB fact table and rebuilds it from scratch. The transform is correct. The bill is the bug. This is the ELT tax nobody warned them about: when the warehouse does the T, the warehouse meters every byte you touch.
The pipeline split: where does T live
Both patterns move data from sources into a warehouse. The only real difference is the order of the middle letter — and that one swap changes everything downstream.
ETL (extract, transform, load) was born when storage and compute were the same expensive box. You could not afford to dump everything into the warehouse, so you transformed first — in a separate engine (Informatica, a Spark/Hadoop cluster, a Python job) — cleaned, joined, filtered, and aggregated, then loaded only the polished result. The warehouse only ever saw the finished table.
ELT (extract, load, transform) flips the last two steps. You load the raw source data into the warehouse first, untouched, then transform it in place with SQL. The transformation is no longer a separate system — it is queries running on the same engine that stores the data. dbt is the dominant tool for the “T”: version-controlled, tested, documented SQL models instead of a black-box pipeline.
The reason the industry flipped is architectural, not fashion. Snowflake and BigQuery decouple storage from compute. Storage is cheap object storage; compute is a separate, elastic resource you spin up only when a query runs. That breaks the old constraint completely — you can afford to land raw, because storing it costs almost nothing, and you only pay for compute when you actually transform.
Why ELT won: raw is replayable
The deepest reason ELT became the default is not cost — it is replayability. In ETL, the raw source data is thrown away after transformation; the warehouse holds only the transformed output. So when (not if) you discover a bug in the transform — a wrong join key, a timezone error, a currency that should have been converted — you cannot fix history. The raw input is gone. You have to re-extract from the source, and the source may have changed, rate-limited you, or no longer hold the old rows.
In ELT, the raw data lives permanently in the warehouse. A transform bug is a one-line SQL fix plus a re-run against data you already have. No re-extract. This is the heart of the medallion architecture: a bronze layer of raw, append-only ingested data; a silver layer of cleaned, conformed, deduplicated tables; a gold layer of business-ready aggregates and marts. Each layer is rebuildable from the one below it, and bronze is rebuildable from the source only if you must. The contract is: never mutate bronze, always transform forward.
| Dimension | ETL (transform before load) | ELT (transform in-warehouse) |
|---|---|---|
| Raw fidelity | Lost — only transformed output is kept | Preserved — bronze layer is replayable |
| Fix a transform bug | Re-extract from source (may be gone) | Edit SQL, re-run on existing raw |
| Where compute is billed | Separate engine (your own cluster) | The warehouse — every transform is on the bill |
| PII / cost control before load | Strong — strip/mask before it lands | Weak — raw (incl. PII) lands first |
| Schema discipline | Schema-on-write (enforced at load) | Schema-on-read (you enforce in silver) |
The cost moves onto the warehouse bill
ELT does not make transformation free — it moves the cost from a CapEx cluster you owned to a metered OpEx line on the warehouse bill, and that line is brutally sensitive to how you write SQL. The Hook is the canonical failure: a model materialized as a full table rebuild, scheduled often, scanning everything every run. On Snowflake you pay per warehouse-second; on BigQuery you pay per byte scanned. A SELECT * over an unpartitioned 2 TB table, run hourly, is a four-figure monthly line item by itself.
The fix is incremental models. Instead of rebuilding the whole table, you process only the new or changed rows since the last run. In dbt this is the is_incremental() macro wrapping a filter:
{% if is_incremental() %}
where event_time >= (select max(event_time) from {{ this }})
{% endif %}On the first run the table is built fully; on every run after, only rows newer than the current max are touched. A nightly job that scanned 2 TB now scans the day’s delta — often a few GB — cutting both runtime and the metered bytes by one to two orders of magnitude. The senior discipline: incremental by default, full refresh only when logic changes (dbt run --full-refresh), and a per-team warehouse so a heavy transform never starves the BI dashboards sharing compute. Teams that skip auto-suspend on idle warehouses bleed credits for compute that is doing nothing.
Why this works
“Schema-on-read” sounds like freedom but it is a deferred bill. ETL’s schema-on-write rejects a malformed row at load time — you find out immediately. ELT happily lands anything into bronze; the contract violation surfaces later, in silver, often as a silent NULL or a wrong join. ELT does not remove the schema work — it moves it downstream and makes it your job to enforce in tests, not the loader’s.
Idempotency: the retry that doubles your data
The failure that wakes people up is duplicate data. EL tools like Fivetran and Airbyte, and your own loaders, retry on transient failure — that is correct behavior. But if a load is not idempotent, a retry that re-runs a partially-succeeded batch inserts the same rows twice. Now your revenue total is inflated and nobody trusts the dashboard.
The cure is to make loads and incremental transforms idempotent: running them twice produces the same result as running them once. In dbt incremental models this is the merge strategy with a unique_key:
{{ config(materialized='incremental', incremental_strategy='merge', unique_key='event_id') }}On each row, merge updates if the unique_key already exists and inserts if it does not — an upsert. A re-run of the same batch updates rows in place instead of duplicating them. The trap: an incremental model with no unique_key and no is_incremental() filter silently appends its full output on every run, so it both duplicates and re-scans everything — the worst of both worlds. Microbatch models go further, treating each time-window batch as an atomic, independently-replaceable unit, so a failed batch reruns cleanly without touching its neighbors.
A regulated fintech ingests payment events containing card PANs and PII. Compliance forbids raw cardholder data from ever sitting in the analytics warehouse. Pick the pattern.
You find a timezone bug in a transform that has been running for six months. Under ELT with a medallion architecture, what's the fast fix?
An incremental dbt model is configured with no unique_key and no is_incremental() filter. What happens on every scheduled run?
Order the stages of a modern ELT pipeline from source to dashboard:
- 1 Extract + Load: EL tool (Fivetran/Airbyte) copies raw source data into the warehouse
- 2 Bronze: raw, append-only landed data, never mutated
- 3 Silver: cleaned, deduplicated, conformed tables (dbt models, schema enforced here)
- 4 Gold: business-ready aggregates and marts
- 5 BI / dashboards read from gold
- 01Explain why the industry flipped from ETL to ELT, and what you gave up in the trade.
- 02A retry duplicated rows in a fact table and inflated revenue. What design would have prevented it, and why?
ETL and ELT differ only in where the Transform runs, and that one swap decides cost, replayability, and data fidelity. ETL transforms in a separate engine before load and keeps only the polished output — strong for stripping PII and controlling cost before data lands, but raw is discarded, so a transform bug means re-extracting from a source that may be gone. ELT loads raw into the warehouse first and transforms in SQL (dbt), which the cloud’s separation of storage and compute made cheap to do. Its decisive advantage is replayability via the medallion pattern: immutable bronze raw, cleaned silver, business-ready gold, each rebuildable from below. The price is that transform compute now meters on the warehouse bill — so you go incremental (process only the delta, not a full rebuild) and idempotent (merge on a unique_key) or you get exploding spend and duplicate rows on retry. Choose ELT by default; reach for ETL when a hard rule says raw PII must never touch the warehouse.