Data Engineering DATA · 08 · 09

Data platform: code and query reading

Read real SQL, a dbt model, a Parquet predicate, and a vector-retrieval snippet across the track, predict the behaviour, and pick the highest-leverage fix.

DATA Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

Data-platform bugs live in queries, transforms, and integration code — not in slides. Read each snippet across the track, predict what it does to correctness or cost, then pick the fix a senior engineer makes first.

Goal

Practise the loop you run on a real platform: read the query or transform, decide whether it fits the store it runs against, and reach for the highest-leverage fix — the right layout, the right delivery contract — before adding hardware.

Snippet 1 — the analytics query on OLTP

-- runs hourly against the live Postgres orders table
SELECT country, date_trunc('day', created_at) AS d, sum(amount)
FROM orders
WHERE created_at >= now() - interval '90 days'
GROUP BY 1, 2;

Quiz

This 90-day aggregate runs hourly on the live OLTP database. What is wrong with the placement, and the fix?

Snippet 2 — the incremental dbt model

{{ config(materialized='incremental', unique_key='order_id') }}
SELECT order_id, customer_id, amount, status, updated_at
FROM {{ source('raw', 'orders') }}
{% if is_incremental() %}
WHERE updated_at > (SELECT max(updated_at) FROM {{ this }})
{% endif %}

Quiz

An order is hard-deleted in the source. After the next incremental run, what does this gold model show, and what is the fix?

Snippet 3 — the Parquet predicate

# Iceberg table partitioned by event_date, with country as a regular column
df = spark.read.table("events")
result = (df
    .filter("upper(country) = 'US'")          # function wraps the column
    .filter("event_date = '2026-05-01'")
    .groupBy("country").count())

Quiz

The event_date filter prunes well but the country filter scans far more than expected. Why, and what's the fix?

Snippet 4 — the vector retrieval

def recommend(query: str, k: int = 10):
    qvec = embed(query)
    hits = vector_index.search(qvec, ef_search=20, top_k=k)
    return [h.product for h in hits]   # embeddings rebuilt nightly

Quiz

Two issues hide here: results sometimes recommend a product deleted hours ago, and recall feels low. What are the senior fixes?

Recap

Every platform bug is read in a query, a transform, or an integration snippet: an OLAP aggregate stranded on an OLTP store, an incremental model that can’t see deletes, a Parquet predicate wrapped in a function that defeats pruning, and a vector search that’s both stale and recall-starved. The fix is almost never more hardware — it’s the right store for the access pattern, deletes propagated as CDC tombstones, sargable predicates that let footer stats prune, and a live OLTP filter plus tuned ef_search for retrieval. Read the code, place the workload, then verify.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.