Data Engineering
OLTP vs OLAP: split the workload
Reading about the replica-analytics outage is not the same as causing it and then fixing it. Stand up a row-store OLTP database, drive an analytical scan into it until checkout degrades, then split the workload into a columnar store fed by CDC/ETL — and prove every step with measurements.
Turn the unit’s mental model into a reproducible engineering loop: reproduce the contention, measure it, stand up the right OLAP layout, wire a feed, and verify both that analytics got faster and that the transactional path stopped suffering.
Reproduce the analytics-on-OLTP failure on a row store, then move analytics to a separate columnar store fed by CDC or ETL — proving the analytical query got dramatically faster AND that the transactional workload is no longer harmed, with before/after numbers under identical load.
- A before/after table: analytical query time, bytes/pages scanned, OLTP p99, buffer-cache hit ratio, and replication lag — all measured under the same load, not estimated.
- The columnar store's plan shows it reads only the referenced columns (pruning), and the analytical query is at least 10× faster than the row-store scan.
- With analytics moved off the OLTP instance, transactional p99 and cache hit ratio return to baseline while the analytical query runs — the contention is gone.
- A one-paragraph write-up: why the index did not help (selectivity vs scan volume), which column-store property gave most of the speedup, and the data-latency cost you accepted for isolation.
- Add a compression comparison: report on-disk size of the table in the row store vs the columnar store, and break down which columns compressed most and why (low cardinality, same type).
- Make the CDC feed near-real-time and measure end-to-end lag from an OLTP commit to its visibility in the columnar store; chart lag under a write spike.
- Add a guardrail that prevents analytical queries from ever hitting the OLTP instance (a separate role/endpoint, statement timeout, or a query router) and show it blocks the destructive scan.
- Stand up an HTAP option (e.g. a column replica or a HTAP engine) for the same workload and compare it against the two-store split on analytical latency, OLTP p99 under load, data freshness, and operational cost.
This is the loop you will run in every real data-architecture decision: reproduce the contention so you understand it, prove an index cannot fix a full-table aggregate, stand up the right columnar layout (denormalized, low-cardinality, pruned), feed it with CDC or ETL, and verify with before/after numbers that analytics got faster and the transactional path stopped suffering — accepting a bounded data-latency window for isolated storage, caches, and failure domains. Doing it once on a toy makes the production version muscle memory.