Data Engineering DATA · 01 · 10

OLTP vs OLAP: split the workload

Hands-on project — reproduce the analytics-on-OLTP outage, then split the workload into a columnar store fed by CDC/ETL and prove the fix with before/after numbers.

DATA Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading about the replica-analytics outage is not the same as causing it and then fixing it. Stand up a row-store OLTP database, drive an analytical scan into it until checkout degrades, then split the workload into a columnar store fed by CDC/ETL — and prove every step with measurements.

Goal

Turn the unit’s mental model into a reproducible engineering loop: reproduce the contention, measure it, stand up the right OLAP layout, wire a feed, and verify both that analytics got faster and that the transactional path stopped suffering.

Project

0 of 7

Objective

Reproduce the analytics-on-OLTP failure on a row store, then move analytics to a separate columnar store fed by CDC or ETL — proving the analytical query got dramatically faster AND that the transactional workload is no longer harmed, with before/after numbers under identical load.

Requirements

Acceptance criteria

A before/after table: analytical query time, bytes/pages scanned, OLTP p99, buffer-cache hit ratio, and replication lag — all measured under the same load, not estimated.
The columnar store's plan shows it reads only the referenced columns (pruning), and the analytical query is at least 10× faster than the row-store scan.
With analytics moved off the OLTP instance, transactional p99 and cache hit ratio return to baseline while the analytical query runs — the contention is gone.
A one-paragraph write-up: why the index did not help (selectivity vs scan volume), which column-store property gave most of the speedup, and the data-latency cost you accepted for isolation.

Senior stretch

Add a compression comparison: report on-disk size of the table in the row store vs the columnar store, and break down which columns compressed most and why (low cardinality, same type).
Make the CDC feed near-real-time and measure end-to-end lag from an OLTP commit to its visibility in the columnar store; chart lag under a write spike.
Add a guardrail that prevents analytical queries from ever hitting the OLTP instance (a separate role/endpoint, statement timeout, or a query router) and show it blocks the destructive scan.
Stand up an HTAP option (e.g. a column replica or a HTAP engine) for the same workload and compare it against the two-store split on analytical latency, OLTP p99 under load, data freshness, and operational cost.

Recap

This is the loop you will run in every real data-architecture decision: reproduce the contention so you understand it, prove an index cannot fix a full-table aggregate, stand up the right columnar layout (denormalized, low-cardinality, pruned), feed it with CDC or ETL, and verify with before/after numbers that analytics got faster and the transactional path stopped suffering — accepting a bounded data-latency window for isolated storage, caches, and failure domains. Doing it once on a toy makes the production version muscle memory.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.