Queues, Streams, Eventing
Change data capture: ship CDC and survive a stalled slot
Reading about a slot filling a disk is not the same as watching it happen and stopping it. Stand up real CDC, keep a downstream view fresh from it, then break the consumer on purpose and prove your alerts and safety cap catch it before the primary goes read-only.
Turn the unit’s mental model into a reproducible CDC pipeline plus an incident drill: capture every change with Debezium, build an idempotent consumer, monitor slot lag, and demonstrate that a stalled consumer is contained by your cap instead of taking down Postgres.
Stand up log-based CDC from a Postgres orders table to a downstream consumer that maintains a derived view, then run a stalled-slot incident drill and prove your monitoring + max_slot_wal_keep_size cap protect the primary — with evidence at every step.
- A before/after demo: a row inserted, updated, and deleted on the primary appears correctly in the downstream view within seconds, including the full pre-delete row.
- Replaying the same change event twice leaves the downstream view identical — proving the consumer is idempotent, not just lucky.
- A monitoring panel or log showing retained-WAL bytes climbing during the stall and the alert firing before disk pressure — measured, not assumed.
- Evidence that max_slot_wal_keep_size invalidated the stalled slot before disk-full, plus a short write-up of how you recovered (re-snapshot) and why the cap was worth the re-snapshot cost.
- Add an on-call runbook: triage steps from the slot-lag panel, how to tell a lagging consumer from a long-running transaction (pg_stat_activity), the drop-vs-wait decision, and the recovery checklist.
- Route changes through an outbox table instead of capturing the domain table directly, and show the downstream event contract stays stable when you change the orders table schema.
- Demonstrate per-key ordering vs no global order: capture two related tables and show a cross-table consumer that does NOT assume commit order across them.
- Swap the source to MySQL (binlog) and compare incremental snapshot vs blocking snapshot on a large table, recording the write-freeze window for each.
This is the discipline you will run in every real CDC rollout: capture with a slot, bootstrap with snapshot-then-stream, make the consumer idempotent because delivery is at-least-once, set the right REPLICA IDENTITY for deletes, and — most important — treat the slot as a loaded gun by monitoring its lag and capping it with max_slot_wal_keep_size. Driving a slot into a stall on a toy system and watching your cap save the primary turns the 3am page into muscle memory.