Data Engineering
ELT vs ETL: build a replayable, idempotent pipeline
Reading about idempotency and incremental models is not the same as watching a retry double your revenue and then fixing it for good. Build a small ELT pipeline, drive it into the two canonical failures — the exploding full-refresh bill and the duplicate-on-retry — and engineer them out with evidence at every step.
Turn the unit’s mental model into a working pipeline: land raw into a medallion layout, transform forward in SQL, make the load incremental and idempotent, prove a re-run does not duplicate, and run a restartable backfill — measuring cost before and after.
Build a small ELT pipeline (dbt + a warehouse such as Snowflake, BigQuery, or DuckDB locally) that ingests an event or order stream into a medallion layout, then prove three properties with measurements: incremental loads scan only the delta, the load is idempotent under retry, and a 90-day backfill is restartable per batch.
- A before/after table for the incremental conversion: bytes scanned (or rows processed) and run time per run, measured under the same data, not estimated. The delta run should scan one to two orders of magnitude less than the full rebuild.
- A reproducible idempotency proof: the same load run twice yields identical row counts and identical aggregate totals (e.g. SUM(amount)), with the merge config shown.
- A documented contrast of the broken config (no unique_key / no is_incremental()) vs the correct one, with the duplicate-row evidence from the broken run.
- A backfill log showing per-batch runs, a deliberately failed batch, and a clean resume from that batch only — proving the backfill is restartable and bounded, not all-or-nothing.
- A one-paragraph write-up: where the Transform runs in your pipeline, why bronze stays immutable, and which property (incremental vs idempotent) defends against which failure (cost blow-up vs duplicate data).
- Add a schema-on-read failure: land a malformed/extra-field row into bronze, show it passes load silently, and have a dbt test in silver catch it — making the deferred schema bill concrete.
- Add a PII column to the source and implement both controls: an ETL-style pre-load mask (raw never lands) and an ELT-style in-warehouse mask, then write up which one satisfies a 'raw PII must never sit in the warehouse' rule and why.
- Wire a cost panel: track warehouse credits / bytes scanned per model over a week and alert if any model's scan grows more than 2x run-over-run — the canary that catches a full-refresh regression before the bill does.
- Add per-team warehouse isolation and auto-suspend on idle, then show that a heavy backfill on one warehouse does not slow queries on the BI warehouse and that idle compute stops billing.
This is the loop you run building any real ELT pipeline: land raw into immutable bronze, transform forward to silver and gold, go incremental so each run scans only the delta (not a full rebuild on the metered warehouse), make the load idempotent with merge on a unique_key so retries upsert instead of duplicate, and structure backfills as independent restartable batches. Doing it once on a small dataset — and watching the broken configs fail — makes the production version muscle memory.