awesome-everything RU
↑ Back to the climb

Queues, Streams, Eventing

Change data capture: ship CDC and survive a stalled slot

Crux Hands-on project — stand up Debezium CDC on Postgres, keep a downstream view fresh, then deliberately stall the slot and prove your monitoring and safety cap save the primary.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 240 min

Reading about a slot filling a disk is not the same as watching it happen and stopping it. Stand up real CDC, keep a downstream view fresh from it, then break the consumer on purpose and prove your alerts and safety cap catch it before the primary goes read-only.

Goal

Turn the unit’s mental model into a reproducible CDC pipeline plus an incident drill: capture every change with Debezium, build an idempotent consumer, monitor slot lag, and demonstrate that a stalled consumer is contained by your cap instead of taking down Postgres.

Project
0 of 7
Objective

Stand up log-based CDC from a Postgres orders table to a downstream consumer that maintains a derived view, then run a stalled-slot incident drill and prove your monitoring + max_slot_wal_keep_size cap protect the primary — with evidence at every step.

Requirements
Acceptance criteria
  • A before/after demo: a row inserted, updated, and deleted on the primary appears correctly in the downstream view within seconds, including the full pre-delete row.
  • Replaying the same change event twice leaves the downstream view identical — proving the consumer is idempotent, not just lucky.
  • A monitoring panel or log showing retained-WAL bytes climbing during the stall and the alert firing before disk pressure — measured, not assumed.
  • Evidence that max_slot_wal_keep_size invalidated the stalled slot before disk-full, plus a short write-up of how you recovered (re-snapshot) and why the cap was worth the re-snapshot cost.
Senior stretch
  • Add an on-call runbook: triage steps from the slot-lag panel, how to tell a lagging consumer from a long-running transaction (pg_stat_activity), the drop-vs-wait decision, and the recovery checklist.
  • Route changes through an outbox table instead of capturing the domain table directly, and show the downstream event contract stays stable when you change the orders table schema.
  • Demonstrate per-key ordering vs no global order: capture two related tables and show a cross-table consumer that does NOT assume commit order across them.
  • Swap the source to MySQL (binlog) and compare incremental snapshot vs blocking snapshot on a large table, recording the write-freeze window for each.
Recap

This is the discipline you will run in every real CDC rollout: capture with a slot, bootstrap with snapshot-then-stream, make the consumer idempotent because delivery is at-least-once, set the right REPLICA IDENTITY for deletes, and — most important — treat the slot as a loaded gun by monitoring its lag and capping it with max_slot_wal_keep_size. Driving a slot into a stall on a toy system and watching your cap save the primary turns the 3am page into muscle memory.

Continue the climb ↑Designing UX over async backends: optimistic UI, pending states, read-your-writes
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources2
expand
  1. 01
  2. 02

Trademarks belong to their respective owners. Editorial reference only.