Backend Architecture
Connection lifecycle: stale connections and how to age them out
A service runs fine all day, then at 3 a.m. — during a traffic lull — the first requests of the morning fail with “connection reset by peer.” Nothing deployed, the database is healthy. What happened is that during the quiet night the database’s wait_timeout, a firewall idle rule, and a load balancer all silently dropped the long-idle pooled connections, but the pool did not notice. It kept those dead sockets and confidently handed one to the first morning request, which wrote a query into a closed pipe. The pool’s greatest strength — keeping connections open for reuse — is also a liability: a connection it holds may have died without telling it.
A held connection can die without the pool knowing
A pool keeps connections open precisely so requests skip the handshake — but “open on our side” is not “alive end to end.” A TCP connection has two ends and several middleboxes, any of which can tear it down while the pool’s side still looks ESTABLISHED:
- The database itself. Postgres and MySQL enforce idle and lifetime limits server-side (
idle_in_transaction_session_timeout, MySQL’swait_timeoutdefaulting to 8 hours). When the server closes a backend, the client socket is not notified until it next writes. - Firewalls and NAT gateways. Stateful firewalls drop idle flows after a few minutes to reclaim table space; a common cloud NAT idle timeout is around 350 seconds. After that the connection is a black hole — packets vanish.
- Load balancers and proxies. An LB in front of the database (or a PgBouncer) has its own idle and lifetime limits and recycles backends on deploys.
The result: a connection sitting idle in the pool can be quietly dead, and the pool only discovers this when a request tries to use it and gets a reset or a hang.
Three controls keep the pool fresh
A good pool defends against staleness with three cooperating settings:
- Max lifetime. Retire and replace every connection after a fixed age (HikariCP
maxLifetime, default 30 minutes) whether or not it looks healthy. This proactively rotates connections out before middleboxes or the server kill them. The rule that matters: maxLifetime must be shorter than the database’s own connection timeout (e.g. a few seconds under MySQL’swait_timeout), so the pool always closes a connection before the server does. - Idle timeout. Shrink the pool back toward a minimum during quiet periods by evicting connections idle beyond
idleTimeout(default 10 minutes). This only does anything ifminimumIdleis set below maximum — otherwise the pool holds all connections forever. - Keepalive / validation. Periodically probe idle connections (
keepaliveTime) and/or test a connection at checkout before lending it. A modern pool validates with a lightweight protocol-level ping; if it fails, the connection is quietly discarded and a fresh one is created — so the request never sees the dead socket.
Validate at the right moment, cheaply
Validation has a cost: a check on every single checkout adds a round trip to every query, which can erase the latency win pooling exists for. The modern compromise is validate on borrow, but skip the check if the connection was used very recently — HikariCP’s aliveBypassWindow of 500 ms means a connection returned and re-borrowed within half a second is trusted without a probe, on the reasoning that it cannot have gone stale that fast. This keeps hot connections fast while still catching the ones that have been sitting idle long enough to be at risk.
Why this works
Why retire connections by maxLifetime proactively instead of just validating them when borrowed? Because validation only catches a connection that is already dead at the moment of checkout — it does nothing for a connection that dies after you hand it out, mid-query, when a firewall finally drops the long-lived flow or the database recycles the backend. Proactive lifetime rotation attacks the root cause: it ensures no connection ever lives long enough to hit those external limits in the first place, so the dangerous in-use death becomes vanishingly rare. The two controls are complementary, not redundant — validation handles the connection that went stale while idle in the pool, and maxLifetime handles the connection that would have gone stale while held by a request. There is also a stability benefit: rotating connections steadily spreads reconnection cost evenly over time, rather than letting the whole pool age together and then reconnect in a thundering herd when the database finally closes them all at once. Bounding a connection’s age is the same discipline as bounding the wait queue — you cap a resource deliberately rather than letting an external system cap it for you at the worst moment.
| Control | What it prevents | Key constraint |
|---|---|---|
| maxLifetime (~30 min) | Connection killed by DB/firewall while held | Must be < DB wait_timeout |
| idleTimeout (~10 min) | Holding excess idle connections off-peak | Only acts if minimumIdle < max |
| keepalive probe | Idle flow dropped by NAT/firewall | Interval < middlebox idle limit |
| validate on borrow | Handing a dead socket to a request | Skip within aliveBypassWindow (500 ms) |
After an overnight lull, the first morning requests fail with 'connection reset' even though nothing deployed and the database is healthy. What happened?
Why must maxLifetime be set shorter than the database's own connection timeout (e.g. MySQL wait_timeout)?
Why do modern pools validate on borrow but skip the check within a short window like aliveBypassWindow (500 ms)?
- 01Why can a pooled connection be dead even though the pool thinks it is open?
- 02What three controls keep a pool's connections fresh, and what is the key constraint on each?
- 03Why retire connections by maxLifetime proactively instead of relying on validation alone?
The reuse that makes pooling fast also makes it fragile: a connection the pool holds can be killed silently by the database (MySQL wait_timeout defaults to 8 hours), by stateful firewalls and NAT dropping idle flows after minutes, or by load balancers recycling backends, all while the pool’s socket still looks ESTABLISHED — so it lends a dead connection and the request fails with a reset, classically the first request after a quiet night. Three controls keep the pool fresh: maxLifetime (~30 min) proactively rotates connections and must be shorter than the database’s own timeout so the pool always closes first; idleTimeout (~10 min) shrinks the pool off-peak but only when minimumIdle is below max; and validation on borrow plus keepalive probes catch already-dead sockets, skipping the check within a 500 ms aliveBypassWindow so hot connections stay fast. Proactive rotation and validation are complementary — one catches connections that die idle, the other those that would die in use — and steady rotation also avoids a thundering-herd reconnect. Fresh, bounded, validated connections keep the happy path healthy; the next lesson confronts what happens when they run out for the wrong reason: leaks, exhaustion, and the metrics that catch them before users do.