Queues, Streams, Eventing QUE · 06 · 10

Change data capture: ship CDC and survive a stalled slot

Hands-on project — stand up Debezium CDC on Postgres, keep a downstream view fresh, then deliberately stall the slot and prove your monitoring and safety cap save the primary.

QUE Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading about a slot filling a disk is not the same as watching it happen and stopping it. Stand up real CDC, keep a downstream view fresh from it, then break the consumer on purpose and prove your alerts and safety cap catch it before the primary goes read-only.

Goal

Turn the unit’s mental model into a reproducible CDC pipeline plus an incident drill: capture every change with Debezium, build an idempotent consumer, monitor slot lag, and demonstrate that a stalled consumer is contained by your cap instead of taking down Postgres.

Project

0 of 7

Objective

Stand up log-based CDC from a Postgres orders table to a downstream consumer that maintains a derived view, then run a stalled-slot incident drill and prove your monitoring + max_slot_wal_keep_size cap protect the primary — with evidence at every step.

Requirements

Acceptance criteria

A before/after demo: a row inserted, updated, and deleted on the primary appears correctly in the downstream view within seconds, including the full pre-delete row.
Replaying the same change event twice leaves the downstream view identical — proving the consumer is idempotent, not just lucky.
A monitoring panel or log showing retained-WAL bytes climbing during the stall and the alert firing before disk pressure — measured, not assumed.
Evidence that max_slot_wal_keep_size invalidated the stalled slot before disk-full, plus a short write-up of how you recovered (re-snapshot) and why the cap was worth the re-snapshot cost.

Senior stretch

Add an on-call runbook: triage steps from the slot-lag panel, how to tell a lagging consumer from a long-running transaction (pg_stat_activity), the drop-vs-wait decision, and the recovery checklist.
Route changes through an outbox table instead of capturing the domain table directly, and show the downstream event contract stays stable when you change the orders table schema.
Demonstrate per-key ordering vs no global order: capture two related tables and show a cross-table consumer that does NOT assume commit order across them.
Swap the source to MySQL (binlog) and compare incremental snapshot vs blocking snapshot on a large table, recording the write-freeze window for each.

Recap

This is the discipline you will run in every real CDC rollout: capture with a slot, bootstrap with snapshot-then-stream, make the consumer idempotent because delivery is at-least-once, set the right REPLICA IDENTITY for deletes, and — most important — treat the slot as a loaded gun by monitoring its lag and capping it with max_slot_wal_keep_size. Driving a slot into a stall on a toy system and watching your cap save the primary turns the 3am page into muscle memory.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.