Backend Architecture BE · 04 · 04

Connection lifecycle: stale connections and how to age them out

A pooled connection is reused for hours, which is the whole point — but the database, firewalls, and load balancers all reserve the right to kill it underneath you. Without max-lifetime, idle eviction, and validation, the pool happily hands out dead connections.

BE Middle ◷ 15 min

Level

FoundationsJuniorMiddleSenior

A service runs fine all day, then at 3 a.m. — during a traffic lull — the first requests of the morning fail with “connection reset by peer.” Nothing deployed, the database is healthy. What happened is that during the quiet night the database’s wait_timeout, a firewall idle rule, and a load balancer all silently dropped the long-idle pooled connections, but the pool did not notice. It kept those dead sockets and confidently handed one to the first morning request, which wrote a query into a closed pipe. The pool’s greatest strength — keeping connections open for reuse — is also a liability: a connection it holds may have died without telling it.

A held connection can die without the pool knowing

A pool keeps connections open precisely so requests skip the handshake — but “open on our side” is not “alive end to end.” A TCP connection has two ends and several middleboxes, any of which can tear it down while the pool’s side still looks ESTABLISHED:

The database itself. Postgres and MySQL enforce idle and lifetime limits server-side (idle_in_transaction_session_timeout, MySQL’s wait_timeout defaulting to 8 hours). When the server closes a backend, the client socket is not notified until it next writes.
Firewalls and NAT gateways. Stateful firewalls drop idle flows after a few minutes to reclaim table space; a common cloud NAT idle timeout is around 350 seconds. After that the connection is a black hole — packets vanish.
Load balancers and proxies. An LB in front of the database (or a PgBouncer) has its own idle and lifetime limits and recycles backends on deploys.

The result: a connection sitting idle in the pool can be quietly dead, and the pool only discovers this when a request tries to use it and gets a reset or a hang.

Three controls keep the pool fresh

A good pool defends against staleness with three cooperating settings:

Max lifetime. Retire and replace every connection after a fixed age (HikariCP maxLifetime, default 30 minutes) whether or not it looks healthy. This proactively rotates connections out before middleboxes or the server kill them. The rule that matters: maxLifetime must be shorter than the database’s own connection timeout (e.g. a few seconds under MySQL’s wait_timeout), so the pool always closes a connection before the server does.
Idle timeout. Shrink the pool back toward a minimum during quiet periods by evicting connections idle beyond idleTimeout (default 10 minutes). This only does anything if minimumIdle is set below maximum — otherwise the pool holds all connections forever.
Keepalive / validation. Periodically probe idle connections (keepaliveTime) and/or test a connection at checkout before lending it. A modern pool validates with a lightweight protocol-level ping; if it fails, the connection is quietly discarded and a fresh one is created — so the request never sees the dead socket.

All three work together: maxLifetime prevents the in-use death, idleTimeout keeps the pool lean during quiet windows, and validation catches whatever slips through. Without maxLifetime in particular you are relying on an external system — a firewall, a load balancer, the database — to decide when your connections die, and it will always pick the worst possible moment.

The pool's own timeouts (keepalive, idleTimeout, maxLifetime) must each trip before an external killer — the NAT at ~350 s or the database at 8 h — does it at the worst moment.

Validate at the right moment, cheaply

Validation has a cost: a check on every single checkout adds a round trip to every query, which can erase the latency win pooling exists for. The modern compromise is validate on borrow, but skip the check if the connection was used very recently — HikariCP’s aliveBypassWindow of 500 ms means a connection returned and re-borrowed within half a second is trusted without a probe, on the reasoning that it cannot have gone stale that fast. This keeps hot connections fast while still catching the ones that have been sitting idle long enough to be at risk.

▸Why this works

Why retire connections by maxLifetime proactively instead of just validating them when borrowed? Because validation only catches a connection that is already dead at the moment of checkout — it does nothing for a connection that dies after you hand it out, mid-query, when a firewall finally drops the long-lived flow or the database recycles the backend. Proactive lifetime rotation attacks the root cause: it ensures no connection ever lives long enough to hit those external limits in the first place, so the dangerous in-use death becomes vanishingly rare. The two controls are complementary, not redundant — validation handles the connection that went stale while idle in the pool, and maxLifetime handles the connection that would have gone stale while held by a request. There is also a stability benefit: rotating connections steadily spreads reconnection cost evenly over time, rather than letting the whole pool age together and then reconnect in a thundering herd when the database finally closes them all at once. Bounding a connection’s age is the same discipline as bounding the wait queue — you cap a resource deliberately rather than letting an external system cap it for you at the worst moment.

Control	What it prevents	Key constraint
maxLifetime (~30 min)	Connection killed by DB/firewall while held	Must be < DB wait_timeout
idleTimeout (~10 min)	Holding excess idle connections off-peak	Only acts if minimumIdle < max
keepalive probe	Idle flow dropped by NAT/firewall	Interval < middlebox idle limit
validate on borrow	Handing a dead socket to a request	Skip within aliveBypassWindow (500 ms)

Quiz

After an overnight lull, the first morning requests fail with 'connection reset' even though nothing deployed and the database is healthy. What happened?

Quiz

Why must maxLifetime be set shorter than the database's own connection timeout (e.g. MySQL wait_timeout)?

Quiz

Why do modern pools validate on borrow but skip the check within a short window like aliveBypassWindow (500 ms)?

Connections cycle between idle and in-use (happy path). Three controls intercept the unhappy path: maxLifetime retires connections by age; validation on borrow discards dead sockets; and the pool replaces both with a fresh connection so the request never sees a stale one.

key takeaway

Keeping connections open for reuse is pooling’s whole value and also its liability: the database (MySQL wait_timeout defaults to 8 hours), stateful firewalls and NAT (often ~350 s idle), and load balancers can all kill a pooled connection while the pool’s side still looks ESTABLISHED, so the pool happily lends a dead socket and the request fails with a reset. Three cooperating controls keep it fresh: maxLifetime (~30 min) proactively retires connections and must be shorter than the DB’s own timeout so the pool closes first; idleTimeout (~10 min) shrinks the pool off-peak but only acts when minimumIdle is below max; and validation on borrow plus keepalive probes catch already-dead sockets — skipping the check within aliveBypassWindow (500 ms) to keep hot connections fast. Proactive lifetime rotation and validation are complementary: one handles connections that die while idle, the other connections that would die while held.

Recall before you leave

01
Why can a pooled connection be dead even though the pool thinks it is open?
02
What three controls keep a pool's connections fresh, and what is the key constraint on each?
03
Why retire connections by maxLifetime proactively instead of relying on validation alone?

Recap

The reuse that makes pooling fast also makes it fragile: a connection the pool holds can be killed silently by the database (MySQL wait_timeout defaults to 8 hours), by stateful firewalls and NAT dropping idle flows after minutes, or by load balancers recycling backends, all while the pool’s socket still looks ESTABLISHED — so it lends a dead connection and the request fails with a reset, classically the first request after a quiet night. Three controls keep the pool fresh: maxLifetime (~30 min) proactively rotates connections and must be shorter than the database’s own timeout so the pool always closes first; idleTimeout (~10 min) shrinks the pool off-peak but only when minimumIdle is below max; and validation on borrow plus keepalive probes catch already-dead sockets, skipping the check within a 500 ms aliveBypassWindow so hot connections stay fast. Proactive rotation and validation are complementary — one catches connections that die idle, the other those that would die in use — and steady rotation also avoids a thundering-herd reconnect. Now when you see “connection reset by peer” errors that appear after a traffic lull and disappear on reconnect, you can name the cause immediately: a middlebox killed your idle connections and maxLifetime wasn’t set to retire them first. Fresh, bounded, validated connections keep the happy path healthy; the next lesson confronts what happens when they run out for the wrong reason: leaks, exhaustion, and the metrics that catch them before users do.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 7 done

Connected lessons

builds on

Acquisition and timeouts: the wait queue is the real latency dialmiddle

unlocks

Pool exhaustion: leaks, and why a bigger pool won''''t save yousenior

deepens into

Pool exhaustion: leaks, and why a bigger pool won''''t save yousenior

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.