Crux Read a Debezium config, replication-slot SQL, and a slot-lag log line, predict the CDC behaviour, and pick the highest-leverage fix.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min
CDC problems are diagnosed in the connector config and the database’s own catalog views. Read the config, the SQL, and the monitoring line, then choose the fix a senior engineer would make first.
Goal
Practise the loop you run in every CDC incident: read the connector setup and the slot’s state in the catalog, predict where WAL retention or delivery breaks, and reach for the highest-leverage fix.
This connector captures a low-traffic orders table. Which setting is the latent disk-filler, and why?
Heads-up snapshot.mode=initial snapshots once on first start, then streams; it does not re-snapshot on every restart. It is the right default for bootstrapping.
Heads-up Tombstones are tiny null-value records that let compacted topics drop deleted keys — they help compaction, they do not fill the source disk. The disk risk here is on Postgres, from a slot that never advances.
Heads-up pgoutput is the built-in logical-decoding plugin shipped since Postgres 10 and the recommended choice; it does not change WAL retention behaviour.
Snippet 2 — inspecting the slot in the catalog
SELECT slot_name, active, pg_size_pretty( pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) ) AS retained_walFROM pg_replication_slotsWHERE slot_name = 'orders_slot';-- slot_name | active | retained_wal-- -------------+--------+---------------- orders_slot | t | 47 GB
Quiz
Completed
The slot is active = t but retaining 47 GB of WAL. What does this tell you, and what is the first move?
Heads-up active only means a consumer is connected; it says nothing about whether restart_lsn is advancing. 47 GB of retained WAL is a disk-full risk regardless of the active flag.
Heads-up Dropping the slot discards the connector's confirmed position and forces a full re-snapshot. Diagnose why restart_lsn is stuck first — often a long transaction — before touching the slot.
Heads-up pg_wal_lsn_diff between current WAL and restart_lsn is exactly the standard way to measure a slot's retained WAL. This is the monitoring query you should be alerting on.
This alert fires at 88% disk. Reading the fields, what is the correct interpretation and response order?
Heads-up Adding disk buys time but does not fix a stalled slot — it will refill. The LSN gap tells you the slot is not advancing; you must restore consumption or let the cap invalidate the slot.
Heads-up confirmed_flush_lsn trailing current_wal_lsn is normal — that gap is just unconsumed WAL. The alarm is the size of the gap, not its existence. Restarting Postgres does nothing for a stuck slot.
Heads-up active just means a consumer is connected. A connected consumer can still lag badly or be blocked by an open transaction; the 47 GB gap is real and the alert is correct.
Snippet 4 — preparing a table for full delete capture
-- delete events currently carry only the primary key in `before`ALTER TABLE public.orders REPLICA IDENTITY FULL;
Quiz
Completed
You run this so DELETE events carry the full pre-delete row. What is the side effect a senior engineer flags before merging?
Heads-up REPLICA IDENTITY FULL is a catalog change, not a table rewrite; it does not rebuild rows. The real cost is ongoing: more WAL per update going forward.
Heads-up REPLICA IDENTITY affects what the WAL logs for changes, not query planning or index usage. Reads are unaffected.
Heads-up Inserts always log the full new row regardless of REPLICA IDENTITY. The setting only governs the before image of updates and deletes.
Recap
Every CDC incident is read in config and catalog state: a zero heartbeat starves a low-traffic slot until WAL fills; an active slot can still retain tens of GB if a long transaction freezes restart_lsn, so diagnose pg_stat_activity before dropping anything; the LSN gap in pg_replication_slots is the number you alert on, with max_slot_wal_keep_size as the backstop; and REPLICA IDENTITY FULL buys full delete images at the cost of fatter updates. Read the slot’s state first, fix the cause of non-advancement, and treat the disk as the clock.