awesome-everything RU
↑ Back to the climb

Databases

Connection pooling: tame an exhausted pool

Crux Hands-on project — front a Postgres service with PgBouncer in transaction mode, induce and diagnose pool exhaustion from idle-in-transaction, then prove the fix with before/after numbers.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 240 min

Reading about a 3 AM pool-exhaustion incident is not the same as pulling a service out of one. Stand up a real Postgres + PgBouncer stack, size it correctly, deliberately break it with an external call inside a transaction, diagnose the drain from SHOW POOLS and pg_stat_activity, and apply the unit’s fix ladder until the numbers come back — with evidence at every step.

Goal

Turn the unit’s mental model into a reproducible engineering loop: size both pool layers from first principles, migrate to transaction mode safely, reproduce exhaustion from idle-in-transaction, diagnose it from pooler and database signals, fix the hold, and defend the bound with safety GUCs — verified by before/after metrics under identical load.

Project
0 of 8
Objective

Front a small allocation of a Postgres-backed HTTP service with PgBouncer in transaction mode, drive it into pool exhaustion via an idle-in-transaction pattern, and bring it back under SLO — proving each step with measurements, not estimates.

Requirements
Acceptance criteria
  • A before/after table measured under the same load: cl_waiting (peak), count of idle-in-transaction backends, p99 request latency, and error/503 rate — measured, not estimated.
  • Evidence that the pool now holds: at sustained load, cl_waiting stays at 0 and no backend lingers in idle-in-transaction past a few hundred ms once the external call is outside the transaction.
  • A demonstration that raising default_pool_size alone did not resolve the original drain (a captured run showing the queue persisting), justifying why the hold-time fix ranked above resizing.
  • A one-paragraph write-up of the sizing math you used (both layers, with the input numbers) and why pool_size is tied to the Postgres core count rather than to client count.
Senior stretch
  • Add a one-page on-call runbook: triage from SHOW POOLS and pg_stat_activity, the four root causes of exhaustion (slow query, idle-in-transaction, traffic burst, external API in transaction), and a verification checklist for each fix.
  • Enable driver protocol-level prepared statements with PgBouncer 1.21+ and max_prepared_statements = 1000; load-test for 30 minutes and grep PgBouncer logs to confirm zero 'prepared statement does not exist' errors while transaction-mode multiplexing stays active.
  • Simulate a NAT/firewall dropping idle TCP (kill the backend socket out-of-band) and show how tcp_keepalives_idle plus client_connection_check_interval detect the dead connection instead of failing the next query with an RST.
  • Reproduce the serverless connection-storm: spawn 200 short-lived clients simultaneously against PgBouncer vs direct-to-Postgres, and show the pooler bounds Postgres backend count while direct-connect hits 'too many clients already'.
Recap

This is the loop you will run in every real pooling incident: size both layers from the formulas (client pool from concurrency, backend pool from (cores × 2) + spindles), migrate to transaction mode behind an audit, then when the pool empties, read the signals before touching config — cl_waiting at the pooler, state and xact age at Postgres — to prove it is hold-time, not size. Fix the hold (move external calls out of transactions, commit before slow work), then install idle_in_transaction_session_timeout and query_wait_timeout as the permanent safety net. Doing it once on a toy stack makes the production version muscle memory.

Continue the climb ↑What a schema migration is and why it replaces ad-hoc DDL
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.