Data Engineering DATA · 08 · 01

Putting it together: the system breaks at the seams, not the stores

Trace one product through OLTP, warehouse, dbt, MVs, an event log, search, and a vector index. Each store is correct; the system drifts where two correct stores meet — so consistency, freshness, and lineage are end-to-end properties you design.

DATA Junior ◷ 17 min

Level

FoundationsJuniorMiddleSenior

Already know this unit? Take a 1-minute quick check →

Support escalates a ticket: a customer bought a laptop that the catalog search still shows “in stock,” the AI assistant still recommends it, and the analytics dashboard counts it in this week’s revenue twice. Four teams open four investigations. Postgres is correct — the order is there, the product is soft-deleted. The warehouse is correct — for the snapshot it last refreshed. The search index is correct — for the last CDC event it received. The vector index is correct — for the embeddings it last rebuilt. Every store passes its own unit tests. The bug lives in the gap between them, owned by no one.

One fact, seven copies

In ten minutes you’ll understand exactly where drift hides and how to close each gap — because the failure pattern repeats at every seam in every mature data platform.

By the time a product exists in a mature system, the same row lives in seven places. Postgres is the system of record (OLTP): the transaction that took the money is the truth. CDC streams that truth into a columnar warehouse on Parquet/Iceberg. dbt transforms it through bronze (raw landed), silver (cleaned, conformed), and gold (business aggregates). Materialized views and roll-ups serve dashboards without re-scanning billions of rows. An event-sourced audit stream records every state change for compliance. A full-text index (Elasticsearch/OpenSearch) powers catalog search. A vector index powers semantic search and RAG. Seven stores, one fact: “is this product available, and at what price?”

Each unit you studied earlier in this track optimised its store for one job — and that optimisation is exactly what makes the copies diverge. The warehouse is built for throughput, so it batches; batching means lag. The search index is built for query latency, so it denormalises; denormalisation means it can’t see a foreign-key change until something tells it. The vector index is expensive to rebuild, so embeddings are recomputed on a schedule, not per-write. None of these are bugs. They are the correct design of each store. The system-level bug is that nobody designed the contract between them.

The seam is the dual-write problem in disguise

Look at the most common origin of drift: your service writes the order to Postgres, then publishes “OrderPlaced” to Kafka so the search index and analytics can react. Two writes, two systems, no shared transaction. If the process dies after the Postgres commit but before the publish, Postgres has the order and the rest of the system never hears about it. This is the dual-write problem, and it is the canonical seam failure — Confluent’s writeup is blunt that “there’s no way to have a single atomic transaction” across a database and a message broker.

The senior fix is the transactional outbox: in the same DB transaction that writes the order, insert a row into an outbox table. The commit is atomic — order and event live or die together. A separate relay (or a CDC connector like Debezium reading the WAL) ships outbox rows to Kafka with at-least-once delivery. You traded the dual-write for a single write plus an idempotent consumer, and idempotency is non-negotiable: at-least-once means the same event will arrive twice, so every downstream store must dedupe on an event id or it will double-count revenue exactly like the Hook.

Seam (store → store)	How it drifts	The contract that fixes it
OLTP → Kafka → search index	A CDC event is lost; deleted product still searchable	Transactional outbox + idempotent consumer + periodic full reconcile
OLTP → warehouse → MV → dashboard	Refresh lag: MV serves a number the source already changed	A declared freshness SLA + freshness checks that fail the build
OLTP → vector index (RAG)	Embeddings rebuilt nightly; returns an hour-old deleted product	Filter retrieval against live OLTP ids; tombstone on delete
event stream → gold aggregate	Out-of-order / replayed events double-count	Event id + version; dedupe and order by sequence, not arrival

Freshness and lineage are properties of the path, not the store

When you’re debugging a number that disagrees with a live count, the first question to ask is not “which one is wrong?” but “as of when is each one right?” Ask “is the dashboard correct?” and the only honest answer is “as of when?” The gold MV might be perfect for the snapshot it refreshed at 02:00, but if a refund posted at 09:00, the number is correct and wrong — correct for its snapshot, wrong for the question the executive is actually asking. dbt formalises this with source freshness: you declare a warn_after and error_after window on each source, and the freshness job fails the pipeline when data is older than your SLA. dbt’s own rule of thumb is to run the freshness check at least twice as often as your tightest SLA, so you detect a stalled feed before the dashboard lies for an hour.

Lineage is the other end-to-end property. When the number is wrong, you don’t debug the dashboard — you walk the path backwards: gold → silver → bronze → CDC offset → OLTP. This is why the medallion split (bronze/silver/gold) and an immutable table format matter. Iceberg snapshots make every table modification a complete versioned point-in-time, so you can run a time-travel query at the exact snapshot the dashboard read and prove whether the warehouse was stale or the transform was wrong. A delay in bronze cascades to silver and gold, so a single stalled batch shows up as “yesterday’s revenue” three layers downstream — and only lineage lets you find which layer stalled.

▸Why this works

The reconciliation job is the unglamorous part nobody budgets for and everybody needs. CDC will eventually lose or duplicate an event — a connector restart, a WAL slot that filled and got dropped, a poison message skipped. The fix is not “make CDC perfect”; it’s a periodic full re-sync that recomputes a checksum (row count + a hash of key columns) per partition in OLTP and in the index, then repairs the diff. Treat live CDC as the fast path and reconciliation as the safety net — both, always.

You design the contract, not just the stores

The shift from mid-level to senior here is whose problem the seam is. A mid-level engineer owns a store and makes it correct. A senior owns the data contract: the schema, the event id and version, the delivery guarantee, the freshness SLA, and the reconciliation that closes the loop. The contract is what lets seven independently-correct stores compose into one system that agrees with itself.

Concretely, the contract answers four questions for every seam: What is the canonical schema and who can change it (a breaking column rename in OLTP silently nulls a downstream column)? What is the delivery guarantee, and therefore must consumers be idempotent (at-least-once → yes)? What is the freshness SLA, and what fails when it’s missed? And what reconciles drift, how often, and who gets paged? Skip any one and you get the Hook: four correct stores, one incorrect system, no owner.

Pick the best fit

Your service must update Postgres and then make the change visible in the search index and the warehouse. Pick the integration design.

Quiz

A product is soft-deleted in Postgres at 14:00. At 14:30 the RAG assistant still recommends it. The vector index rebuilds embeddings nightly. What's the senior fix?

Quiz

The revenue dashboard disagrees with a manual SQL count against Postgres. Both are 'correct.' What is the most likely cause?

Order the steps

A dashboard number is wrong. Order the lineage walk a senior does to localise the drift:

1 Ask 'as of when?' — establish the snapshot/refresh time the dashboard read
2 Check the gold MV / aggregate refresh history against the freshness SLA
3 Walk up the medallion: is silver stale because bronze landed late?
4 Check the CDC offset / outbox lag between OLTP and bronze
5 Compare a checksum (row count + key hash) of OLTP vs the warehouse partition to confirm where they diverge

The same fact flows left to right through ingestion, the raw lake, the ELT transform, and the warehouse — then fans out to the serving stores. Each arrow is a seam where two correct stores can disagree.

Recall before you leave

01
Walk a teammate through why the search index shows a product that Postgres deleted, even though both stores pass their own tests, and how you'd close the seam.
02
Explain why 'is the dashboard correct?' is the wrong question, and what freshness and lineage have to do with the answer.

Recap

By the time a product is live, one fact lives in seven stores: OLTP as system of record, a columnar warehouse fed by CDC, dbt’s bronze/silver/gold layers, materialized views for dashboards, an event-sourced audit stream, a full-text search index, and a vector index for RAG. Each store is optimised correctly for its own job — and that optimisation is exactly what makes the copies drift. The system breaks not inside any store but at the seam between two correct ones: a lost CDC event leaves a deleted product searchable, refresh lag makes an MV disagree with its OLTP source, a nightly-rebuilt vector index recommends an hour-old deletion. The senior move is to stop owning stores and start owning contracts: a transactional outbox to kill the dual-write, idempotent consumers for at-least-once delivery, a declared freshness SLA that fails the build, lineage through versioned medallion layers so you can walk drift backwards, and a periodic reconciliation that checksums OLTP against each copy and repairs the diff. Consistency, freshness, and lineage are end-to-end properties — you design them at the seams, or the system disagrees with itself while every part passes its tests. Now when you see a support ticket where every team’s store passes its own tests, you know the bug lives at a seam — and you know which four questions the contract must answer to close it.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Idempotent ETL PipelinePipelines don't fail gracefully — they fail at 3 a.m., halfway through a load, and someone re-runs them. This project teaches the one property that separates a hobby script from production data engineering: a run you can repeat any number of times and still land exactly one copy of each row. You'll build batch ingestion, an idempotent load, a watermark for incremental pulls, and the data-quality gates that stop bad data before it poisons everything downstream.