Data Engineering DATA · 02 · 09

ELT vs ETL: model and config reading

Read real dbt model configs and a Snowflake bill signature, predict the pipeline behaviour — idempotency, incremental load, backfill — and pick the highest-leverage fix.

DATA Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

Pipeline bugs live in the model config and the SQL, not the diagram. Read each dbt model and trace, predict what it does on the second and third scheduled run, then pick the fix a senior data engineer reaches for first.

Goal

Practise the loop you run on every pipeline review: read the materialization config and the incremental filter, predict whether a re-run duplicates, rescans, or upserts cleanly, and reach for the highest-leverage fix.

Snippet 1 — the silent append

-- models/marts/fct_events.sql
{{ config(materialized='incremental') }}

select
  event_id,
  user_id,
  event_time,
  amount
from {{ ref('stg_events') }}

Quiz

This model is materialized 'incremental' but has no unique_key and no is_incremental() block. What happens on every scheduled run after the first, and what is the fix?

Snippet 2 — the idempotent merge

-- models/marts/fct_orders.sql
{{ config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='order_id'
) }}

select order_id, customer_id, status, updated_at, total
from {{ ref('stg_orders') }}
{% if is_incremental() %}
  where updated_at > (select max(updated_at) from {{ this }})
{% endif %}

Quiz

An order's status changes from 'pending' to 'shipped' (same order_id, newer updated_at). The EL tool then crashes mid-run and the orchestrator retries the whole batch. What is the final state of fct_orders?

Snippet 3 — the full-refresh bill

# Snowflake query history (one model, scheduled hourly)
fct_pageviews   MEDIUM_WH   bytes_scanned=2.1 TB   elapsed=11m   24 runs/day

{{ config(materialized='table') }}   -- not incremental
select * from {{ ref('stg_pageviews') }}   -- 2 TB, unpartitioned, hourly

Quiz

This single model scans 2.1 TB twenty-four times a day. What is the root cost driver, and what is the highest-leverage fix — not the most tempting one?

Snippet 4 — the backfill

-- A 90-day backfill is needed after fixing a transform bug.
-- The model uses microbatch:
{{ config(
    materialized='incremental',
    incremental_strategy='microbatch',
    event_time='event_time',
    batch_size='day',
    begin='2024-01-01'
) }}

dbt run --select fct_events --event-time-start "2024-01-01" --event-time-end "2024-04-01"

Quiz

Why is a microbatch model the right tool for this 90-day backfill, compared with a single full-refresh over the same range?

Recap

Every pipeline review reads the config and the filter: an incremental model with no unique_key and no is_incremental() silently appends and rescans everything; merge on a unique_key makes a retry idempotent (upsert, not duplicate); a full-table SELECT * scanned hourly is a cost bug fixed by going incremental, not by a bigger warehouse; and microbatch turns a long backfill into independent, restartable, idempotent daily units. Read the materialization first, then the filter, then decide.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.