Crux Read real query plans, DDL, and pipeline config across the OLTP vs OLAP unit, predict scan cost on row vs column storage, and pick the highest-leverage fix.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min
The query plan and the DDL are where the workload split shows itself. Read each snippet, predict how it behaves on row vs column storage, and choose the fix a senior makes first.
Goal
Practise the loop you run in every data-architecture review: read the query, predict the bytes scanned, and reach for the layout or pipeline change — not a band-aid index — before the analytical workload poisons the transactional one.
Snippet 1 — the EXPLAIN on the row store
EXPLAIN (ANALYZE, BUFFERS)SELECT country, SUM(total)FROM orders -- 40M rows, 50 columns, row-store PostgresWHERE created_at >= now() - interval '90 days'GROUP BY country;Seq Scan on orders (cost=0.00..2.1M rows=38_000_000) Buffers: shared read=4_900_000 -- ~38 GB pulled through the cachePlanning Time: 0.2 msExecution Time: 31_482 ms
Quiz
Completed
The planner chose a Seq Scan and read ~38 GB even though the query only needs country and total. What does this tell you, and what is the right fix?
Heads-up A 90-day window over all countries still matches most rows, so the planner stays on a Seq Scan; a covering index would also store most of the table and add write cost. Indexes fix selectivity, not scan volume.
Heads-up Buffers: shared read counts 8KB pages actually read from the heap. ~4.9M pages is ~38 GB of real I/O dragged through shared_buffers — exactly the cache-evicting scan to avoid on a prod box.
Heads-up work_mem helps the aggregation step, not the scan: the engine still reads every byte of every matched row to reach two columns. The scan volume is set by the row layout, not the hash memory.
Snippet 2 — the schema split
-- OLTP: heavily normalized for cheap, consistent single-row writesCREATE TABLE orders ( id bigint PRIMARY KEY, user_id bigint REFERENCES users(id), country_id smallint REFERENCES countries(id), total_cents bigint, created_at timestamptz);-- OLAP target: denormalized wide fact, country folded in as a low-cardinality columnCREATE TABLE orders_fact ( order_id UInt64, user_id UInt64, country LowCardinality(String), -- dictionary-encoded total_cents Int64, created_at DateTime) ENGINE = MergeTree ORDER BY (created_at, country);
Quiz
Completed
Why is the OLTP table normalized with a country_id FK while the OLAP fact denormalizes country into a LowCardinality(String) column?
Heads-up On the analytical side, joining a dimension on every multi-million-row scan is the cost you denormalize away; integrity is enforced upstream in OLTP. A repeated low-cardinality string dictionary-encodes to near-free, so the FK buys nothing here.
Heads-up Denormalizing OLTP multiplies write cost and re-introduces update anomalies — the opposite of what a write-heavy transactional table needs. Each layout matches its workload; you do not make OLTP look like OLAP.
Heads-up LowCardinality dictionary-encodes the column so repeated values become small ints — that is a large compression and scan win on a column repeated millions of times, not cosmetic.
Snippet 3 — the pipeline config
# Debezium connector: stream Postgres WAL changes into Kafka, land in the warehouseconnector.class: io.debezium.connector.postgresql.PostgresConnectorplugin.name: pgoutputtable.include.list: public.orders,public.users,public.countriessnapshot.mode: initial # backfill once, then stream changes# downstream sink batches into orders_fact every 30s
Quiz
Completed
This CDC pipeline streams the WAL into a warehouse with a 30s sink batch. What does it buy, and what is the tradeoff you must accept?
Heads-up WAL capture, transport, and a 30s sink batch each add latency; the warehouse trails the primary by a small window. The point is isolation with bounded lag, not zero lag.
Heads-up CDC only moves data; it does nothing to scan speed. The speed comes from the columnar target (orders_fact / MergeTree) that prunes and compresses. CDC is the feed, the layout is the win.
Heads-up The whole point is that analytics runs on a separate columnar store, not on any Postgres replica. Running it on a replica is the failure mode CDC exists to avoid.
Snippet 4 — the write-amplification trap
-- An engineer 'fixes' slow dashboards by writing orders straight into ClickHouse,-- one INSERT per order, from the checkout hot path:INSERT INTO orders_fact (order_id, user_id, country, total_cents, created_at)VALUES (918273, 4412, 'DE', 4999, now()); -- called ~5000×/s, one row each
Quiz
Completed
Why does pushing single-row INSERTs from the checkout hot path straight into the column store backfire?
Heads-up Column stores are slow at single-row writes: each touches every column file and spawns small parts to merge. Putting that on the checkout hot path is exactly the wrong layout for a write.
Heads-up Dedup is a separate concern. The core issue is write shape: single-row inserts are pathological for a column store regardless of keys; batching is the fix.
Heads-up Indexes do not make single-row column-store writes cheap — the write still touches every column file. The fix is to batch writes (or keep them in OLTP and CDC them over), not to index.
Recap
Read the artifacts and the layout tells you the cost: a Seq Scan reading tens of GB on a row store means a full-table aggregate that no index can fix — the fix is a column store that prunes. Normalized OLTP and denormalized low-cardinality OLAP facts are opposite-by-design, not a bug. CDC buys isolation at the price of bounded lag, and CDC alone does not make scans fast — the columnar target does. And single-row writes into a column store are pathological: keep writes in the row store, batch them into the column store. Diagnose from the plan and the layout, fix the layout, then verify the bytes scanned dropped.