awesome-everything RU
↑ Back to the climb

Databases

Pool exhaustion and idle-in-transaction: the 3 AM failure mode

Crux Pool exhaustion is almost always a hold-time problem, not a sizing problem; idle-in-transaction is the silent killer that drains a pool in seconds and also blocks VACUUM.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 14 min

At 02:14 the PgBouncer alert fires: pool = 24, cl_waiting = 180. The app is returning 503s. pg_stat_activity shows 19 backends in state “idle in transaction” with max age 8 minutes. The pool was sized correctly this morning.

What pool exhaustion actually means

Pool exhaustion happens when every backend in the pool is checked out and no new connection can be borrowed. At the application layer: pool.acquire() times out or is rejected. At the Postgres layer: the backends behind those checked-out connections are in one of these states:

  • active — running a slow query or waiting on a lock
  • idle in transaction — between statements inside a BEGIN that has not COMMITted or ROLLBACKed
  • idle — genuinely idle (then why is the pool full? answer: wrong pool_size setting or connection leak)

The pool is exhausted in the sense that no slot is free. The database is not exhausted — it is waiting on the application.

Four root causes

  1. Long-running queries — a missing index makes a query take 5 s; at 1,000 QPS that is 5,000 connections in flight. Fix: EXPLAIN ANALYZE, add the index, paginate large scans.

  2. Idle-in-transaction — the most common. An application opened a transaction (BEGIN) and did not COMMIT or ROLLBACK: an unhandled exception, an external API call inside the transaction, a missing try/finally. The backend sits “idle in transaction” holding locks, an MVCC snapshot, and a pool slot.

  3. Traffic burst — demand exceeds steady-state pool sizing. Fix: reserve_pool_size in PgBouncer, autoscaling at the application layer, or queue-and-degrade (return 503 with Retry-After before pool exhaustion).

  4. External dependency inside a transaction — a Stripe charge call, an S3 upload, a webhook delivery all inside BEGIN…COMMIT. The transaction holds open for the duration of the external call. If that service slows down, every in-flight transaction holds a pool slot for seconds to minutes.

Root causepg_stat_activity signalFix
Missing index / slow queryMany rows in state=active with old query_startEXPLAIN ANALYZE + add index
Idle-in-transaction bugMany rows in state=idle in transaction with old xact_startFix code path + idle_in_transaction_session_timeout
External API inside transactionstate=idle in transaction, last query = BEGINMove call outside transaction; use outbox pattern
Lock contentionSELECT * FROM pg_locks WHERE NOT grantedFind and terminate blocking backend; review DDL

Idle-in-transaction is a double failure

An idle in transaction backend holds:

  1. A pool slot — visible immediately; causes pool exhaustion
  2. Row locks — any locked rows cannot be written by other transactions
  3. An MVCC snapshot — VACUUM cannot clean dead tuples older than this backend’s xmin; tables bloat

One bug that opens a transaction and forgets to commit on an error path can drain a 20-connection pool in seconds and cause table bloat that develops over hours.

Diagnosing pool exhaustion in five steps

Trace it
1/5

Pool exhaustion alarm fires at 02:14. Walk the diagnosis.

1
Step 1 of 5
Step 1: what does PgBouncer SHOW POOLS tell you?
2
Locked
Step 2: what does pg_stat_activity show?
3
Locked
Step 3: identify the offending queries.
4
Locked
Step 4: check lock contention.
5
Locked
Step 5: durable mitigations.

The safety net: idle_in_transaction_session_timeout

Setting this Postgres GUC kills any backend that stays in idle in transaction past the threshold:

ALTER SYSTEM SET idle_in_transaction_session_timeout = '60s';
SELECT pg_reload_conf();

With this set, the Stripe outage scenario from the hook becomes: Stripe timeouts cause individual transaction aborts (logged, alertable) rather than pool exhaustion. The pool survives; the application logs the 504s and retries.

This is a free safety net. Every production Postgres deployment should have it. Set it to a value above your slowest legitimate transaction (typically 30–120 s for OLTP).

Why this works

Why does idle-in-transaction also block VACUUM? MVCC (04-mvcc-isolation) showed that each transaction holds an xmin snapshot — the horizon before which VACUUM cannot clean dead tuples. A backend stuck idle-in-transaction holds that snapshot indefinitely. One forgotten COMMIT in a rarely-executed code path can prevent VACUUM from cleaning a hot table for hours, causing unbounded bloat. The pool exhaustion is the visible symptom; the table bloat is the silent one that compounds over time.

Idle-in-transaction safety numbers
idle_in_transaction_session_timeout recommended
30–60 s
statement_timeout recommended
10–30 s
lock_timeout recommended
5 s
PgBouncer query_wait_timeout
10–30 s
Alert: idle-in-tx age threshold
> 30 s
Alert: cl_waiting sustained
> 0 for 60 s
Quiz

What is the most common root cause of PgBouncer pool exhaustion in production?

Quiz

An application calls a third-party payment API inside a BEGIN...COMMIT block. The payment API starts timing out at 30 s. What happens to the connection pool?

Recall before you leave
  1. 01
    Walk through what pool exhaustion means at each layer — application, pooler, Postgres — and why raising pool_size usually fails.
  2. 02
    Why does an idle-in-transaction backend harm more than just the pool?
  3. 03
    What Postgres and PgBouncer settings act as safety nets against idle-in-transaction exhaustion?
Recap

Pool exhaustion fires when every backend is checked out and no new slot is available. The dominant root cause is hold-time: a transaction held open across a slow query, lock wait, or external API call keeps the backend occupied. Raising pool_size does not fix hold-time — it buys more in-flight transactions at the same hold, adding Postgres contention. The correct sequence is: identify the hold via pg_stat_activity and SHOW POOLS; fix the cause (index, move API call outside transaction, add COMMIT/ROLLBACK on error paths); add idle_in_transaction_session_timeout = 60s as a permanent safety net; alert on cl_waiting > 0 and idle-in-transaction age > 30s. A backend stuck idle-in-transaction also pins an MVCC snapshot, silently blocking VACUUM and growing table bloat — pool exhaustion and MVCC bloat share the same root cause.

Connected lessons
appears again in258
Continue the climb ↑Migrating to transaction mode: rollout playbook and PgBouncer 1.21 prepared statements
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources4
expand
  1. 01
  2. 02
  3. 03
  4. 04

Trademarks belong to their respective owners. Editorial reference only.