Crux Read real SQL, a dbt model, a Parquet predicate, and a vector-retrieval snippet across the track, predict the behaviour, and pick the highest-leverage fix.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min
Data-platform bugs live in queries, transforms, and integration code — not in slides. Read each snippet across the track, predict what it does to correctness or cost, then pick the fix a senior engineer makes first.
Goal
Practise the loop you run on a real platform: read the query or transform, decide whether it fits the store it runs against, and reach for the highest-leverage fix — the right layout, the right delivery contract — before adding hardware.
Snippet 1 — the analytics query on OLTP
-- runs hourly against the live Postgres orders tableSELECT country, date_trunc('day', created_at) AS d, sum(amount)FROM ordersWHERE created_at >= now() - interval '90 days'GROUP BY 1, 2;
Quiz
Completed
This 90-day aggregate runs hourly on the live OLTP database. What is wrong with the placement, and the fix?
Heads-up An index helps select a date range but the query still aggregates millions of rows in a row-oriented layout, and it still loads the transactional store. The access pattern is OLAP; the store should be columnar, not just better-indexed.
Heads-up A 90-day group-by over a live OLTP table is a heavy scan that contends with the transactional path — the classic reason the workload split exists. Cheap aggregation requires a columnar layout that reads only the columns touched.
Heads-up A replica isolates load but keeps the row-oriented layout, so the scan is still slow — it reads every column of every row. Isolation is half the answer; the other half is a columnar store built for scans.
Snippet 2 — the incremental dbt model
{{ config(materialized='incremental', unique_key='order_id') }}SELECT order_id, customer_id, amount, status, updated_atFROM {{ source('raw', 'orders') }}{% if is_incremental() %}WHERE updated_at > (SELECT max(updated_at) FROM {{ this }}){% endif %}
Quiz
Completed
An order is hard-deleted in the source. After the next incremental run, what does this gold model show, and what is the fix?
Heads-up Incremental models append/merge only changed rows from the source filter; they never re-scan the full source, so they cannot notice an absence. A delete is invisible to a high-water-mark filter unless it arrives as an explicit change event.
Heads-up unique_key controls how a matched incoming row is merged (upsert) — it does nothing when no row arrives at all. A hard delete produces no incoming row, so there's nothing to match and remove.
Heads-up Full-refresh does reconcile deletes but throwing away incrementality on every run defeats the cost savings. The senior fix is to propagate deletes as CDC tombstones and merge them, keeping incremental for the common case.
Snippet 3 — the Parquet predicate
# Iceberg table partitioned by event_date, with country as a regular columndf = spark.read.table("events")result = (df .filter("upper(country) = 'US'") # function wraps the column .filter("event_date = '2026-05-01'") .groupBy("country").count())
Quiz
Completed
The event_date filter prunes well but the country filter scans far more than expected. Why, and what's the fix?
Heads-up Parquet prunes non-partition columns too, using per-row-group min/max footer statistics. The problem here is that wrapping the column in upper() hides it from those stats, not that pruning is limited to partitions.
Heads-up Filter order doesn't change pruning — the optimiser pushes both predicates down. The defeat is the function wrapper on country, which makes its predicate non-sargable against the footer stats.
Heads-up Over-partitioning by a high-cardinality column creates the small-files problem and won't help while upper() still wraps the predicate. First make the predicate prunable; only then consider partitioning country if cardinality is sane.
Snippet 4 — the vector retrieval
def recommend(query: str, k: int = 10): qvec = embed(query) hits = vector_index.search(qvec, ef_search=20, top_k=k) return [h.product for h in hits] # embeddings rebuilt nightly
Quiz
Completed
Two issues hide here: results sometimes recommend a product deleted hours ago, and recall feels low. What are the senior fixes?
Heads-up Re-embedding the full catalog every minute is wildly expensive and still leaves a window. The cheap, correct fix is a liveness filter against OLTP at query time — let the system of record answer 'does this still exist?'
Heads-up ANN trades recall for speed via parameters like ef_search; a too-small search list returns the wrong ten, not an unavoidable limit. Widening ef_search (or HNSW M/ef_construction) recovers recall at a latency cost you choose.
Heads-up Returning more rows neither removes deleted products (still no liveness check) nor improves the quality of candidates if ef_search stays at 20 — you just get more of the same approximate set. Fix recall with ef_search and staleness with a live filter.
Recap
Every platform bug is read in a query, a transform, or an integration snippet: an OLAP aggregate stranded on an OLTP store, an incremental model that can’t see deletes, a Parquet predicate wrapped in a function that defeats pruning, and a vector search that’s both stale and recall-starved. The fix is almost never more hardware — it’s the right store for the access pattern, deletes propagated as CDC tombstones, sargable predicates that let footer stats prune, and a live OLTP filter plus tuned ef_search for retrieval. Read the code, place the workload, then verify.