Crux Read real dbt model configs and a Snowflake bill signature, predict the pipeline behaviour — idempotency, incremental load, backfill — and pick the highest-leverage fix.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min
Pipeline bugs live in the model config and the SQL, not the diagram. Read each dbt model and trace, predict what it does on the second and third scheduled run, then pick the fix a senior data engineer reaches for first.
Goal
Practise the loop you run on every pipeline review: read the materialization config and the incremental filter, predict whether a re-run duplicates, rescans, or upserts cleanly, and reach for the highest-leverage fix.
This model is materialized 'incremental' but has no unique_key and no is_incremental() block. What happens on every scheduled run after the first, and what is the fix?
Heads-up Upsert needs unique_key with the merge strategy. With neither set, dbt has no key to match on and no filter to limit the scan, so it just appends the full result.
Heads-up Incremental materialization alone does nothing — the row-limiting filter must be inside an is_incremental() block. Without it there is no WHERE clause, so dbt re-selects everything.
Heads-up dbt runs happily — that is the danger. It silently appends, so you discover the bug as duplicate revenue and a rising bill, not as an error.
Snippet 2 — the idempotent merge
-- models/marts/fct_orders.sql{{ config( materialized='incremental', incremental_strategy='merge', unique_key='order_id') }}select order_id, customer_id, status, updated_at, totalfrom {{ ref('stg_orders') }}{% if is_incremental() %} where updated_at > (select max(updated_at) from {{ this }}){% endif %}
Quiz
Completed
An order's status changes from 'pending' to 'shipped' (same order_id, newer updated_at). The EL tool then crashes mid-run and the orchestrator retries the whole batch. What is the final state of fct_orders?
Heads-up That is what an append would do. Merge matches on unique_key='order_id', so it updates the existing row rather than inserting a second one. The retry is safe.
Heads-up No duplicate row is created, so a downstream SUM sees the order once. The merge plus unique_key is precisely what makes the load idempotent and the totals correct.
Heads-up merge updates when the key exists and inserts when it does not — an upsert. The newer updated_at row matches order_id and overwrites the status to 'shipped'.
{{ config(materialized='table') }} -- not incrementalselect * from {{ ref('stg_pageviews') }} -- 2 TB, unpartitioned, hourly
Quiz
Completed
This single model scans 2.1 TB twenty-four times a day. What is the root cost driver, and what is the highest-leverage fix — not the most tempting one?
Heads-up A bigger warehouse finishes the same 2 TB scan faster but bills more credits per second — total cost barely moves. The leverage is scanning less data, not scanning the same data faster.
Heads-up That trims frequency but each run still rescans 2 TB, and you sacrifice freshness. Incremental keeps hourly freshness while scanning only the delta — strictly better than throttling the schedule.
Heads-up Partitioning helps only if the query filters on the partition key. A SELECT * with no WHERE reads every partition regardless. You first need an incremental filter to exploit partitioning.
Snippet 4 — the backfill
-- A 90-day backfill is needed after fixing a transform bug.-- The model uses microbatch:{{ config( materialized='incremental', incremental_strategy='microbatch', event_time='event_time', batch_size='day', begin='2024-01-01') }}
dbt run --select fct_events --event-time-start "2024-01-01" --event-time-end "2024-04-01"
Quiz
Completed
Why is a microbatch model the right tool for this 90-day backfill, compared with a single full-refresh over the same range?
Heads-up They differ fundamentally: full-refresh rebuilds the whole table as one transaction, so a failure at day 80 loses all prior work. Microbatch makes each day an independent, replaceable unit you can retry in isolation.
Heads-up Microbatch still runs SQL on the metered warehouse — it does not bypass compute cost. Its value is restartability and bounded blast radius per batch, not free compute.
Heads-up One giant transaction over 90 days is the opposite of safe at scale: it is long-running, can hit timeouts, and any failure discards all progress. Per-day atomic batches localise failure and let you resume from the failed day.
Recap
Every pipeline review reads the config and the filter: an incremental model with no unique_key and no is_incremental() silently appends and rescans everything; merge on a unique_key makes a retry idempotent (upsert, not duplicate); a full-table SELECT * scanned hourly is a cost bug fixed by going incremental, not by a bigger warehouse; and microbatch turns a long backfill into independent, restartable, idempotent daily units. Read the materialization first, then the filter, then decide.