Databases DB · 05 · 10

Connection pooling: tame an exhausted pool

Hands-on project — front a Postgres service with PgBouncer in transaction mode, induce and diagnose pool exhaustion from idle-in-transaction, then prove the fix with before/after numbers.

DB Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading about a 3 AM pool-exhaustion incident is not the same as pulling a service out of one. Stand up a real Postgres + PgBouncer stack, size it correctly, deliberately break it with an external call inside a transaction, diagnose the drain from SHOW POOLS and pg_stat_activity, and apply the unit’s fix ladder until the numbers come back — with evidence at every step.

Goal

Turn the unit’s mental model into a reproducible engineering loop: size both pool layers from first principles, migrate to transaction mode safely, reproduce exhaustion from idle-in-transaction, diagnose it from pooler and database signals, fix the hold, and defend the bound with safety GUCs — verified by before/after metrics under identical load.

Project

0 of 8

Objective

Front a small allocation of a Postgres-backed HTTP service with PgBouncer in transaction mode, drive it into pool exhaustion via an idle-in-transaction pattern, and bring it back under SLO — proving each step with measurements, not estimates.

Requirements

Acceptance criteria

A before/after table measured under the same load: cl_waiting (peak), count of idle-in-transaction backends, p99 request latency, and error/503 rate — measured, not estimated.
Evidence that the pool now holds: at sustained load, cl_waiting stays at 0 and no backend lingers in idle-in-transaction past a few hundred ms once the external call is outside the transaction.
A demonstration that raising default_pool_size alone did not resolve the original drain (a captured run showing the queue persisting), justifying why the hold-time fix ranked above resizing.
A one-paragraph write-up of the sizing math you used (both layers, with the input numbers) and why pool_size is tied to the Postgres core count rather than to client count.

Senior stretch

Add a one-page on-call runbook: triage from SHOW POOLS and pg_stat_activity, the four root causes of exhaustion (slow query, idle-in-transaction, traffic burst, external API in transaction), and a verification checklist for each fix.
Enable driver protocol-level prepared statements with PgBouncer 1.21+ and max_prepared_statements = 1000; load-test for 30 minutes and grep PgBouncer logs to confirm zero 'prepared statement does not exist' errors while transaction-mode multiplexing stays active.
Simulate a NAT/firewall dropping idle TCP (kill the backend socket out-of-band) and show how tcp_keepalives_idle plus client_connection_check_interval detect the dead connection instead of failing the next query with an RST.
Reproduce the serverless connection-storm: spawn 200 short-lived clients simultaneously against PgBouncer vs direct-to-Postgres, and show the pooler bounds Postgres backend count while direct-connect hits 'too many clients already'.

Recap

This is the loop you will run in every real pooling incident: size both layers from the formulas (client pool from concurrency, backend pool from (cores × 2) + spindles), migrate to transaction mode behind an audit, then when the pool empties, read the signals before touching config — cl_waiting at the pooler, state and xact age at Postgres — to prove it is hold-time, not size. Fix the hold (move external calls out of transactions, commit before slow work), then install idle_in_transaction_session_timeout and query_wait_timeout as the permanent safety net. Doing it once on a toy stack makes the production version muscle memory.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.