Databases
Connection pooling: tame an exhausted pool
Reading about a 3 AM pool-exhaustion incident is not the same as pulling a service out of one. Stand up a real Postgres + PgBouncer stack, size it correctly, deliberately break it with an external call inside a transaction, diagnose the drain from SHOW POOLS and pg_stat_activity, and apply the unit’s fix ladder until the numbers come back — with evidence at every step.
Turn the unit’s mental model into a reproducible engineering loop: size both pool layers from first principles, migrate to transaction mode safely, reproduce exhaustion from idle-in-transaction, diagnose it from pooler and database signals, fix the hold, and defend the bound with safety GUCs — verified by before/after metrics under identical load.
Front a small allocation of a Postgres-backed HTTP service with PgBouncer in transaction mode, drive it into pool exhaustion via an idle-in-transaction pattern, and bring it back under SLO — proving each step with measurements, not estimates.
- A before/after table measured under the same load: cl_waiting (peak), count of idle-in-transaction backends, p99 request latency, and error/503 rate — measured, not estimated.
- Evidence that the pool now holds: at sustained load, cl_waiting stays at 0 and no backend lingers in idle-in-transaction past a few hundred ms once the external call is outside the transaction.
- A demonstration that raising default_pool_size alone did not resolve the original drain (a captured run showing the queue persisting), justifying why the hold-time fix ranked above resizing.
- A one-paragraph write-up of the sizing math you used (both layers, with the input numbers) and why pool_size is tied to the Postgres core count rather than to client count.
- Add a one-page on-call runbook: triage from SHOW POOLS and pg_stat_activity, the four root causes of exhaustion (slow query, idle-in-transaction, traffic burst, external API in transaction), and a verification checklist for each fix.
- Enable driver protocol-level prepared statements with PgBouncer 1.21+ and max_prepared_statements = 1000; load-test for 30 minutes and grep PgBouncer logs to confirm zero 'prepared statement does not exist' errors while transaction-mode multiplexing stays active.
- Simulate a NAT/firewall dropping idle TCP (kill the backend socket out-of-band) and show how tcp_keepalives_idle plus client_connection_check_interval detect the dead connection instead of failing the next query with an RST.
- Reproduce the serverless connection-storm: spawn 200 short-lived clients simultaneously against PgBouncer vs direct-to-Postgres, and show the pooler bounds Postgres backend count while direct-connect hits 'too many clients already'.
This is the loop you will run in every real pooling incident: size both layers from the formulas (client pool from concurrency, backend pool from (cores × 2) + spindles), migrate to transaction mode behind an audit, then when the pool empties, read the signals before touching config — cl_waiting at the pooler, state and xact age at Postgres — to prove it is hold-time, not size. Fix the hold (move external calls out of transactions, commit before slow work), then install idle_in_transaction_session_timeout and query_wait_timeout as the permanent safety net. Doing it once on a toy stack makes the production version muscle memory.