Data Engineering DATA · 02 · 01

ELT vs ETL: where the Transform runs, and why the industry flipped

ETL transforms before load in a separate engine; ELT loads raw into the warehouse and transforms in SQL. Cheap columnar compute and split storage/compute flipped the default — and moved the cost onto the warehouse bill.

DATA Junior ◷ 16 min

Level

FoundationsJuniorMiddleSenior

A finance analyst pings you on a Monday: “Why did our Snowflake bill jump 40% last week?” Nothing shipped — except a pipeline change. Someone set a dbt model to full-refresh by default and scheduled it hourly. Every run rescans the entire 2 TB fact table and rebuilds it from scratch. The transform is correct. The bill is the bug. This is the ELT tax nobody warned them about: when the warehouse does the T, the warehouse meters every byte you touch.

By the end of this lesson you’ll understand why the industry flipped from ETL to ELT, what replayability actually buys you, and exactly how to avoid paying for compute you don’t need.

The pipeline split: where does T live

When you’re designing a data pipeline, the first question isn’t which tool to use — it’s where the transform runs. That one decision determines whether a bug costs you a one-line SQL fix or a week of re-extraction from a source that may no longer hold the data you need.

Both patterns move data from sources into a warehouse. The only real difference is the order of the middle letter — and that one swap changes everything downstream.

ETL (extract, transform, load) was born when storage and compute were the same expensive box. You could not afford to dump everything into the warehouse, so you transformed first — in a separate engine (Informatica, a Spark/Hadoop cluster, a Python job) — cleaned, joined, filtered, and aggregated, then loaded only the polished result. The warehouse only ever saw the finished table.

ELT (extract, load, transform) flips the last two steps. You load the raw source data into the warehouse first, untouched, then transform it in place with SQL. The transformation is no longer a separate system — it is queries running on the same engine that stores the data. dbt is the dominant tool for the “T”: version-controlled, tested, documented SQL models instead of a black-box pipeline.

The reason the industry flipped is architectural, not fashion. Snowflake and BigQuery decouple storage from compute. Storage is cheap object storage; compute is a separate, elastic resource you spin up only when a query runs. That breaks the old constraint completely — you can afford to land raw, because storing it costs almost nothing, and you only pay for compute when you actually transform.

Why ELT won: raw is replayable

The deepest reason ELT became the default is not cost — it is replayability. In ETL, the raw source data is thrown away after transformation; the warehouse holds only the transformed output. So when (not if) you discover a bug in the transform — a wrong join key, a timezone error, a currency that should have been converted — you cannot fix history. The raw input is gone. You have to re-extract from the source, and the source may have changed, rate-limited you, or no longer hold the old rows.

In ELT, the raw data lives permanently in the warehouse. A transform bug is a one-line SQL fix plus a re-run against data you already have. No re-extract. This is the heart of the medallion architecture: a bronze layer of raw, append-only ingested data; a silver layer of cleaned, conformed, deduplicated tables; a gold layer of business-ready aggregates and marts. Each layer is rebuildable from the one below it, and bronze is rebuildable from the source only if you must. The contract is: never mutate bronze, always transform forward.

Dimension	ETL (transform before load)	ELT (transform in-warehouse)
Raw fidelity	Lost — only transformed output is kept	Preserved — bronze layer is replayable
Fix a transform bug	Re-extract from source (may be gone)	Edit SQL, re-run on existing raw
Where compute is billed	Separate engine (your own cluster)	The warehouse — every transform is on the bill
PII / cost control before load	Strong — strip/mask before it lands	Weak — raw (incl. PII) lands first
Schema discipline	Schema-on-write (enforced at load)	Schema-on-read (you enforce in silver)

ETL (top) transforms before the load, so the warehouse only ever sees the finished table. ELT (bottom) loads raw first, then transforms in place inside the warehouse — the Transform box moves to the right of Load.

The cost moves onto the warehouse bill

ELT does not make transformation free — it moves the cost from a CapEx cluster you owned to a metered OpEx line on the warehouse bill, and that line is brutally sensitive to how you write SQL. The Hook is the canonical failure: a model materialized as a full table rebuild, scheduled often, scanning everything every run. On Snowflake you pay per warehouse-second; on BigQuery you pay per byte scanned. A SELECT * over an unpartitioned 2 TB table, run hourly, is a four-figure monthly line item by itself.

The fix is incremental models. Instead of rebuilding the whole table, you process only the new or changed rows since the last run. In dbt this is the is_incremental() macro wrapping a filter:

{% if is_incremental() %}
  where event_time >= (select max(event_time) from {{ this }})
{% endif %}

On the first run the table is built fully; on every run after, only rows newer than the current max are touched. A nightly job that scanned 2 TB now scans the day’s delta — often a few GB — cutting both runtime and the metered bytes by one to two orders of magnitude. When you’re setting up a new dbt model, the default posture is: incremental by default, full refresh only when logic changes (dbt run --full-refresh), and a per-team warehouse so a heavy transform never starves the BI dashboards sharing compute. Teams that skip auto-suspend on idle warehouses bleed credits for compute that is doing nothing.

▸Why this works

“Schema-on-read” sounds like freedom but it is a deferred bill. ETL’s schema-on-write rejects a malformed row at load time — you find out immediately. ELT happily lands anything into bronze; the contract violation surfaces later, in silver, often as a silent NULL or a wrong join. ELT does not remove the schema work — it moves it downstream and makes it your job to enforce in tests, not the loader’s.

Idempotency: the retry that doubles your data

The failure that wakes people up is duplicate data. EL tools like Fivetran and Airbyte, and your own loaders, retry on transient failure — that is correct behavior. But if a load is not idempotent, a retry that re-runs a partially-succeeded batch inserts the same rows twice. Now your revenue total is inflated and nobody trusts the dashboard.

The cure is to make loads and incremental transforms idempotent: running them twice produces the same result as running them once. In dbt incremental models this is the merge strategy with a unique_key:

{{ config(materialized='incremental', incremental_strategy='merge', unique_key='event_id') }}

On each row, merge updates if the unique_key already exists and inserts if it does not — an upsert. A re-run of the same batch updates rows in place instead of duplicating them. The trap: an incremental model with no unique_key and no is_incremental() filter silently appends its full output on every run, so it both duplicates and re-scans everything — the worst of both worlds. Microbatch models go further, treating each time-window batch as an atomic, independently-replaceable unit, so a failed batch reruns cleanly without touching its neighbors.

Pick the best fit

A regulated fintech ingests payment events containing card PANs and PII. Compliance forbids raw cardholder data from ever sitting in the analytics warehouse. Pick the pattern.

Quiz

You find a timezone bug in a transform that has been running for six months. Under ELT with a medallion architecture, what's the fast fix?

Quiz

An incremental dbt model is configured with no unique_key and no is_incremental() filter. What happens on every scheduled run?

Order the steps

Order the stages of a modern ELT pipeline from source to dashboard:

1 Extract + Load: EL tool (Fivetran/Airbyte) copies raw source data into the warehouse
2 Bronze: raw, append-only landed data, never mutated
3 Silver: cleaned, deduplicated, conformed tables (dbt models, schema enforced here)
4 Gold: business-ready aggregates and marts
5 BI / dashboards read from gold

Together, these five stages form the medallion contract: raw is preserved in bronze so every layer above it is rebuildable. Without that order — specifically without landing raw before you transform — you lose the ability to replay history when logic changes, which is the most costly thing to lose.

Recall before you leave

01
Explain why the industry flipped from ETL to ELT, and what you gave up in the trade.
02
A retry duplicated rows in a fact table and inflated revenue. What design would have prevented it, and why?

Recap

ETL and ELT differ only in where the Transform runs, and that one swap decides cost, replayability, and data fidelity. ETL transforms in a separate engine before load and keeps only the polished output — strong for stripping PII and controlling cost before data lands, but raw is discarded, so a transform bug means re-extracting from a source that may be gone. ELT loads raw into the warehouse first and transforms in SQL (dbt), which the cloud’s separation of storage and compute made cheap to do. Its decisive advantage is replayability via the medallion pattern: immutable bronze raw, cleaned silver, business-ready gold, each rebuildable from below. The price is that transform compute now meters on the warehouse bill — so you go incremental (process only the delta, not a full rebuild) and idempotent (merge on a unique_key) or you get exploding spend and duplicate rows on retry. Choose ELT by default; reach for ETL when a hard rule says raw PII must never touch the warehouse. Now when you see a Snowflake bill spike after a pipeline change, your first question is: is this model a full rebuild running on a schedule — and if so, what does the incremental filter look like?

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Idempotent ETL PipelinePipelines don't fail gracefully — they fail at 3 a.m., halfway through a load, and someone re-runs them. This project teaches the one property that separates a hobby script from production data engineering: a run you can repeat any number of times and still land exactly one copy of each row. You'll build batch ingestion, an idempotent load, a watermark for incremental pulls, and the data-quality gates that stop bad data before it poisons everything downstream.