Backend Architecture BE · 04 · 10

Pooling: diagnose and tame an exhausted pool

Hands-on project — instrument a pool, drive it into starvation, exhaust it with a planted leak, then size it, bound the wait, plug the leak, and prove each fix with before/after pool metrics.

BE Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading about starvation and leaks is not the same as pulling a service out of one. Build a small service against a real database, drive it into pool starvation and then into a leak, and apply the unit’s levers — size, wait, lifecycle, return — until the four pool gauges come back, with evidence at every step.

Goal

Turn the unit’s mental model into a reproducible engineering loop: instrument the four pool gauges, reproduce starvation and a leak, size the pool from hardware, bound the acquisition wait, guarantee the return, and verify each fix with before/after numbers under identical load.

Project

0 of 8

Objective

Take a small HTTP service backed by a real Postgres or MySQL pool, deliberately drive it into pool starvation and connection-leak exhaustion, then bring it to a healthy steady state — right-sized pool, fail-fast acquisition, no leaks, fresh connections — proving each step with the active/idle/total/waiting gauges.

Requirements

Acceptance criteria

A before/after table of the four gauges (active, idle, total, waiting), acquisition wait p99, and request p99 — measured under the same load, not estimated, for both the starvation and the leak scenarios.
The leak is provably gone: with the try/finally fix, active returns to baseline after the error-path traffic stops, and leak detection emits no warnings under sustained load.
The pool is sized to (cores x 2) + spindles (or the measured saturation point) and a short load-test shows it reaches peak throughput at lower latency than a 2x-larger pool on the same hardware.
A one-paragraph write-up naming which lever fixed each scenario (size, acquisition timeout, try/finally, maxLifetime) and why enlarging the pool was the wrong instinct for the leak.

Senior stretch

Add a one-page on-call runbook: triage from the four gauges (leak vs true overload vs hoarding), the sizing formula, the acquisition-timeout rule, and a verification checklist.
Simulate fan-in: run several instances of the service against one database so (instances x pool size) exceeds max_connections, reproduce 'FATAL: sorry, too many clients already', then put PgBouncer in transaction-pooling mode in front and show the demanded connections collapse to a few dozen backends.
Demonstrate the transaction-pooling tradeoff: prove a query path that relies on a server-side prepared statement or a SET session variable breaks under PgBouncer transaction mode, then rewrite it to be session-state-free.
Add the async-boundary trap and its fix: hold a connection across an unrelated external HTTP call, show the pool exhausting under load with nothing leaking, then move the acquire after the call and show concurrency recover.

Recap

This is the loop you will run in every real pooling incident: instrument the four gauges first, reproduce the failure (starvation vs leak vs hoarding), then fix at the right lever — size from cores, bound the acquisition wait so a starved pool fails fast, guarantee the return with try/finally, and rotate connections under the DB’s own timeout — and verify with before/after numbers under identical load. Doing it once on a toy service, including the fan-in case that forces PgBouncer, makes the production version muscle memory: the connection is a hard, shared, expensive resource, and you bound it deliberately rather than enlarging the buffer that only delays the outage.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.