Data Engineering DATA · 02 · 10

ELT vs ETL: build a replayable, idempotent pipeline

Hands-on project — build a small ELT pipeline with a medallion layout, make it incremental and idempotent, prove a retry does not duplicate, and run a restartable backfill with before/after cost numbers.

DATA Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading about idempotency and incremental models is not the same as watching a retry double your revenue and then fixing it for good. Build a small ELT pipeline, drive it into the two canonical failures — the exploding full-refresh bill and the duplicate-on-retry — and engineer them out with evidence at every step.

Goal

Turn the unit’s mental model into a working pipeline: land raw into a medallion layout, transform forward in SQL, make the load incremental and idempotent, prove a re-run does not duplicate, and run a restartable backfill — measuring cost before and after.

Project

0 of 7

Objective

Build a small ELT pipeline (dbt + a warehouse such as Snowflake, BigQuery, or DuckDB locally) that ingests an event or order stream into a medallion layout, then prove three properties with measurements: incremental loads scan only the delta, the load is idempotent under retry, and a 90-day backfill is restartable per batch.

Requirements

Acceptance criteria

A before/after table for the incremental conversion: bytes scanned (or rows processed) and run time per run, measured under the same data, not estimated. The delta run should scan one to two orders of magnitude less than the full rebuild.
A reproducible idempotency proof: the same load run twice yields identical row counts and identical aggregate totals (e.g. SUM(amount)), with the merge config shown.
A documented contrast of the broken config (no unique_key / no is_incremental()) vs the correct one, with the duplicate-row evidence from the broken run.
A backfill log showing per-batch runs, a deliberately failed batch, and a clean resume from that batch only — proving the backfill is restartable and bounded, not all-or-nothing.
A one-paragraph write-up: where the Transform runs in your pipeline, why bronze stays immutable, and which property (incremental vs idempotent) defends against which failure (cost blow-up vs duplicate data).

Senior stretch

Add a schema-on-read failure: land a malformed/extra-field row into bronze, show it passes load silently, and have a dbt test in silver catch it — making the deferred schema bill concrete.
Add a PII column to the source and implement both controls: an ETL-style pre-load mask (raw never lands) and an ELT-style in-warehouse mask, then write up which one satisfies a 'raw PII must never sit in the warehouse' rule and why.
Wire a cost panel: track warehouse credits / bytes scanned per model over a week and alert if any model's scan grows more than 2x run-over-run — the canary that catches a full-refresh regression before the bill does.
Add per-team warehouse isolation and auto-suspend on idle, then show that a heavy backfill on one warehouse does not slow queries on the BI warehouse and that idle compute stops billing.

Recap

This is the loop you run building any real ELT pipeline: land raw into immutable bronze, transform forward to silver and gold, go incremental so each run scans only the delta (not a full rebuild on the metered warehouse), make the load idempotent with merge on a unique_key so retries upsert instead of duplicate, and structure backfills as independent restartable batches. Doing it once on a small dataset — and watching the broken configs fail — makes the production version muscle memory.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.