Data Engineering
Putting it together: the system breaks at the seams, not the stores
Support escalates a ticket: a customer bought a laptop that the catalog search still shows “in stock,” the AI assistant still recommends it, and the analytics dashboard counts it in this week’s revenue twice. Four teams open four investigations. Postgres is correct — the order is there, the product is soft-deleted. The warehouse is correct — for the snapshot it last refreshed. The search index is correct — for the last CDC event it received. The vector index is correct — for the embeddings it last rebuilt. Every store passes its own unit tests. The bug lives in the gap between them, owned by no one.
One fact, seven copies
By the time a product exists in a mature system, the same row lives in seven places. Postgres is the system of record (OLTP): the transaction that took the money is the truth. CDC streams that truth into a columnar warehouse on Parquet/Iceberg. dbt transforms it through bronze (raw landed), silver (cleaned, conformed), and gold (business aggregates). Materialized views and roll-ups serve dashboards without re-scanning billions of rows. An event-sourced audit stream records every state change for compliance. A full-text index (Elasticsearch/OpenSearch) powers catalog search. A vector index powers semantic search and RAG. Seven stores, one fact: “is this product available, and at what price?”
Each unit you studied earlier in this track optimised its store for one job — and that optimisation is exactly what makes the copies diverge. The warehouse is built for throughput, so it batches; batching means lag. The search index is built for query latency, so it denormalises; denormalisation means it can’t see a foreign-key change until something tells it. The vector index is expensive to rebuild, so embeddings are recomputed on a schedule, not per-write. None of these are bugs. They are the correct design of each store. The system-level bug is that nobody designed the contract between them.
The seam is the dual-write problem in disguise
Look at the most common origin of drift: your service writes the order to Postgres, then publishes “OrderPlaced” to Kafka so the search index and analytics can react. Two writes, two systems, no shared transaction. If the process dies after the Postgres commit but before the publish, Postgres has the order and the rest of the system never hears about it. This is the dual-write problem, and it is the canonical seam failure — Confluent’s writeup is blunt that “there’s no way to have a single atomic transaction” across a database and a message broker.
The senior fix is the transactional outbox: in the same DB transaction that writes the order, insert a row into an outbox table. The commit is atomic — order and event live or die together. A separate relay (or a CDC connector like Debezium reading the WAL) ships outbox rows to Kafka with at-least-once delivery. You traded the dual-write for a single write plus an idempotent consumer, and idempotency is non-negotiable: at-least-once means the same event will arrive twice, so every downstream store must dedupe on an event id or it will double-count revenue exactly like the Hook.
| Seam (store → store) | How it drifts | The contract that fixes it |
|---|---|---|
| OLTP → Kafka → search index | A CDC event is lost; deleted product still searchable | Transactional outbox + idempotent consumer + periodic full reconcile |
| OLTP → warehouse → MV → dashboard | Refresh lag: MV serves a number the source already changed | A declared freshness SLA + freshness checks that fail the build |
| OLTP → vector index (RAG) | Embeddings rebuilt nightly; returns an hour-old deleted product | Filter retrieval against live OLTP ids; tombstone on delete |
| event stream → gold aggregate | Out-of-order / replayed events double-count | Event id + version; dedupe and order by sequence, not arrival |
Freshness and lineage are properties of the path, not the store
Ask “is the dashboard correct?” and the only honest answer is “as of when?” The gold MV might be perfect for the snapshot it refreshed at 02:00, but if a refund posted at 09:00, the number is correct and wrong — correct for its snapshot, wrong for the question the executive is actually asking. dbt formalises this with source freshness: you declare a warn_after and error_after window on each source, and the freshness job fails the pipeline when data is older than your SLA. dbt’s own rule of thumb is to run the freshness check at least twice as often as your tightest SLA, so you detect a stalled feed before the dashboard lies for an hour.
Lineage is the other end-to-end property. When the number is wrong, you don’t debug the dashboard — you walk the path backwards: gold → silver → bronze → CDC offset → OLTP. This is why the medallion split (bronze/silver/gold) and an immutable table format matter. Iceberg snapshots make every table modification a complete versioned point-in-time, so you can run a time-travel query at the exact snapshot the dashboard read and prove whether the warehouse was stale or the transform was wrong. A delay in bronze cascades to silver and gold, so a single stalled batch shows up as “yesterday’s revenue” three layers downstream — and only lineage lets you find which layer stalled.
Why this works
The reconciliation job is the unglamorous part nobody budgets for and everybody needs. CDC will eventually lose or duplicate an event — a connector restart, a WAL slot that filled and got dropped, a poison message skipped. The fix is not “make CDC perfect”; it’s a periodic full re-sync that recomputes a checksum (row count + a hash of key columns) per partition in OLTP and in the index, then repairs the diff. Treat live CDC as the fast path and reconciliation as the safety net — both, always.
You design the contract, not just the stores
The shift from mid-level to senior here is whose problem the seam is. A mid-level engineer owns a store and makes it correct. A senior owns the data contract: the schema, the event id and version, the delivery guarantee, the freshness SLA, and the reconciliation that closes the loop. The contract is what lets seven independently-correct stores compose into one system that agrees with itself.
Concretely, the contract answers four questions for every seam: What is the canonical schema and who can change it (a breaking column rename in OLTP silently nulls a downstream column)? What is the delivery guarantee, and therefore must consumers be idempotent (at-least-once → yes)? What is the freshness SLA, and what fails when it’s missed? And what reconciles drift, how often, and who gets paged? Skip any one and you get the Hook: four correct stores, one incorrect system, no owner.
Your service must update Postgres and then make the change visible in the search index and the warehouse. Pick the integration design.
A product is soft-deleted in Postgres at 14:00. At 14:30 the RAG assistant still recommends it. The vector index rebuilds embeddings nightly. What's the senior fix?
The revenue dashboard disagrees with a manual SQL count against Postgres. Both are 'correct.' What is the most likely cause?
A dashboard number is wrong. Order the lineage walk a senior does to localise the drift:
- 1 Ask 'as of when?' — establish the snapshot/refresh time the dashboard read
- 2 Check the gold MV / aggregate refresh history against the freshness SLA
- 3 Walk up the medallion: is silver stale because bronze landed late?
- 4 Check the CDC offset / outbox lag between OLTP and bronze
- 5 Compare a checksum (row count + key hash) of OLTP vs the warehouse partition to confirm where they diverge
- 01Walk a teammate through why the search index shows a product that Postgres deleted, even though both stores pass their own tests, and how you'd close the seam.
- 02Explain why 'is the dashboard correct?' is the wrong question, and what freshness and lineage have to do with the answer.
By the time a product is live, one fact lives in seven stores: OLTP as system of record, a columnar warehouse fed by CDC, dbt’s bronze/silver/gold layers, materialized views for dashboards, an event-sourced audit stream, a full-text search index, and a vector index for RAG. Each store is optimised correctly for its own job — and that optimisation is exactly what makes the copies drift. The system breaks not inside any store but at the seam between two correct ones: a lost CDC event leaves a deleted product searchable, refresh lag makes an MV disagree with its OLTP source, a nightly-rebuilt vector index recommends an hour-old deletion. The senior move is to stop owning stores and start owning contracts: a transactional outbox to kill the dual-write, idempotent consumers for at-least-once delivery, a declared freshness SLA that fails the build, lineage through versioned medallion layers so you can walk drift backwards, and a periodic reconciliation that checksums OLTP against each copy and repairs the diff. Consistency, freshness, and lineage are end-to-end properties — you design them at the seams, or the system disagrees with itself while every part passes its tests.