Databases DB · 05 · 04

Pool exhaustion and idle-in-transaction: the 3 AM failure mode

Pool exhaustion is almost always a hold-time problem, not a sizing problem; idle-in-transaction is the silent killer that drains a pool in seconds and also blocks VACUUM.

DB Middle ◷ 14 min

Level

FoundationsJuniorMiddleSenior

At 02:14 the PgBouncer alert fires: pool = 24, cl_waiting = 180. The app is returning 503s. pg_stat_activity shows 19 backends in state “idle in transaction” with max age 8 minutes. The pool was sized correctly this morning.

What pool exhaustion actually means

Pool exhaustion happens when every backend in the pool is checked out and no new connection can be borrowed. At the application layer: pool.acquire() times out or is rejected. At the Postgres layer: the backends behind those checked-out connections are in one of these states:

active — running a slow query or waiting on a lock
idle in transaction — between statements inside a BEGIN that has not COMMITted or ROLLBACKed
idle — genuinely idle (then why is the pool full? answer: wrong pool_size setting or connection leak)

The pool is exhausted in the sense that no slot is free. The database is not exhausted — it is waiting on the application.

Four root causes

Long-running queries — a missing index makes a query take 5 s; at 1,000 QPS that is 5,000 connections in flight. Fix: EXPLAIN ANALYZE, add the index, paginate large scans.
Idle-in-transaction — the most common. An application opened a transaction (BEGIN) and did not COMMIT or ROLLBACK: an unhandled exception, an external API call inside the transaction, a missing try/finally. The backend sits “idle in transaction” holding locks, an MVCC snapshot, and a pool slot.
Traffic burst — demand exceeds steady-state pool sizing. Fix: reserve_pool_size in PgBouncer, autoscaling at the application layer, or queue-and-degrade (return 503 with Retry-After before pool exhaustion).
External dependency inside a transaction — a Stripe charge call, an S3 upload, a webhook delivery all inside BEGIN…COMMIT. The transaction holds open for the duration of the external call. If that service slows down, every in-flight transaction holds a pool slot for seconds to minutes.

All four root causes share the same shape: something holds a backend longer than the operation itself warrants. Without root-cause diagnosis you cannot distinguish a traffic spike (cause 3) from an idle-in-transaction bug (cause 2) — both look like “pool full” in PgBouncer’s cl_waiting counter, but the fixes are completely different.

Root cause	pg_stat_activity signal	Fix
Missing index / slow query	Many rows in state=active with old query_start	EXPLAIN ANALYZE + add index
Idle-in-transaction bug	Many rows in state=idle in transaction with old xact_start	Fix code path + idle_in_transaction_session_timeout
External API inside transaction	state=idle in transaction, last query = BEGIN	Move call outside transaction; use outbox pattern
Lock contention	SELECT * FROM pg_locks WHERE NOT granted	Find and terminate blocking backend; review DDL

Idle-in-transaction is a double failure

An idle in transaction backend holds:

A pool slot — visible immediately; causes pool exhaustion
Row locks — any locked rows cannot be written by other transactions
An MVCC snapshot — VACUUM cannot clean dead tuples older than this backend’s xmin; tables bloat

Together, these three effects compound: the pool exhaustion surfaces within seconds and triggers an incident, while the MVCC bloat grows silently over hours and causes query slowdowns long after the original bug is fixed. Without the snapshot effect you might notice pool exhaustion immediately; without the lock effect you might not notice until a write starts queuing — but in practice all three arrive together.

One bug that opens a transaction and forgets to commit on an error path can drain a 20-connection pool in seconds and cause table bloat that develops over hours.

Diagnosing pool exhaustion in five steps

Trace it

1/5

Pool exhaustion alarm fires at 02:14. Walk the diagnosis.

Step 1 of 5

Step 1: what does PgBouncer SHOW POOLS tell you?

Locked

Step 2: what does pg_stat_activity show?

Locked

Step 3: identify the offending queries.

Locked

Step 4: check lock contention.

Locked

Step 5: durable mitigations.

The safety net: idle_in_transaction_session_timeout

Setting this Postgres GUC kills any backend that stays in idle in transaction past the threshold:

ALTER SYSTEM SET idle_in_transaction_session_timeout = '60s';
SELECT pg_reload_conf();

With this set, the Stripe outage scenario from the hook becomes: Stripe timeouts cause individual transaction aborts (logged, alertable) rather than pool exhaustion. The pool survives; the application logs the 504s and retries.

This is a free safety net. Every production Postgres deployment should have it. Set it to a value above your slowest legitimate transaction (typically 30–120 s for OLTP).

Nested guards, inner trips first: lock_timeout (5 s) before statement_timeout (30 s) before idle_in_transaction_session_timeout (60 s) — so a failure aborts and logs instead of draining the pool.

▸Why this works

Why does idle-in-transaction also block VACUUM? MVCC (04-mvcc-isolation) showed that each transaction holds an xmin snapshot — the horizon before which VACUUM cannot clean dead tuples. A backend stuck idle-in-transaction holds that snapshot indefinitely. One forgotten COMMIT in a rarely-executed code path can prevent VACUUM from cleaning a hot table for hours, causing unbounded bloat. The pool exhaustion is the visible symptom; the table bloat is the silent one that compounds over time.

Idle-in-transaction safety numbers

idle_in_transaction_session_timeout recommended: 30–60 s
statement_timeout recommended: 10–30 s
lock_timeout recommended: 5 s
PgBouncer query_wait_timeout: 10–30 s
Alert: idle-in-tx age threshold: > 30 s
Alert: cl_waiting sustained: > 0 for 60 s

Quiz

What is the most common root cause of PgBouncer pool exhaustion in production?

Quiz

An application calls a third-party payment API inside a BEGIN...COMMIT block. The payment API starts timing out at 30 s. What happens to the connection pool?

A forgotten COMMIT (often an external API call inside the transaction) pins all three at once: the pool drains in seconds while table bloat builds over hours.

Recall before you leave

01
Walk through what pool exhaustion means at each layer — application, pooler, Postgres — and why raising pool_size usually fails.
02
Why does an idle-in-transaction backend harm more than just the pool?
03
What Postgres and PgBouncer settings act as safety nets against idle-in-transaction exhaustion?

Recap

Pool exhaustion fires when every backend is checked out and no new slot is available. The dominant root cause is hold-time: a transaction held open across a slow query, lock wait, or external API call keeps the backend occupied. Raising pool_size does not fix hold-time — it buys more in-flight transactions at the same hold, adding Postgres contention. The correct sequence is: identify the hold via pg_stat_activity and SHOW POOLS; fix the cause (index, move API call outside transaction, add COMMIT/ROLLBACK on error paths); add idle_in_transaction_session_timeout = 60s as a permanent safety net; alert on cl_waiting > 0 and idle-in-transaction age > 30s. A backend stuck idle-in-transaction also pins an MVCC snapshot, silently blocking VACUUM and growing table bloat — pool exhaustion and MVCC bloat share the same root cause. Now when you see cl_waiting climb at 3 AM, your first pg_stat_activity query targets idle-in-transaction age, not pool size — that query tells you the hold-time story in seconds.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Pool sizing: the (cores × 2) + spindles formula and the two-layer stackmiddle

unlocks

Migrating to transaction mode: rollout playbook and PgBouncer 1.21 prepared statementsmiddle

deepens into

appears again in287

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.