awesome-everything RU
↑ Back to the climb

Backend Architecture

Acquisition and timeouts: the wait queue is the real latency dial

Crux When every connection is busy, a request does not fail — it waits in a queue. The acquisition timeout decides how long it waits before giving up, and that one number is the difference between failing fast and a pile-up that takes the whole service down.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 15 min

A downstream query that normally takes 5 ms slows to 500 ms during an incident. Within seconds, every connection in the pool is occupied by one of these slow queries. The next request asks for a connection and there are none free — so it waits. The pool has a default acquisition timeout of 30 seconds, so it waits up to 30 seconds before failing. Now requests are piling up behind an empty pool, each holding a web-server thread hostage for 30 seconds, and the thread pool fills too. The database was merely slow; the acquisition timeout turned slow into down, because nobody decided how long a request should be willing to wait.

Checkout is not always instant

The happy-path story is “check out a connection, it is free, run your query.” But a fixed-size pool has a second state: all connections are busy. When that happens, checkout does not error and it does not magically create a new connection — it blocks, putting the caller into a wait queue until a connection is returned or a timeout fires. This waiting is invisible in normal times because the pool is rarely empty, but it is the single most important behaviour to understand, because every pool-related outage lives here.

So a request’s total time is no longer just queue time on the database; it is now wait-for-connection time + query time. Under load, the wait-for-connection part can dwarf the query itself, and it is completely hidden unless you measure it separately.

The acquisition timeout is a latency dial

The acquisition timeout (HikariCP’s connectionTimeout, default 30 seconds) is how long a caller will sit in the wait queue before the pool gives up and throws. This number is not a safety detail to leave at default — it is a deliberate latency budget for the worst case. Setting it well means choosing what should happen when the pool is starved:

  • Too long (e.g. 30 s default). Requests wait a small eternity. Each waiter holds an upstream resource — a web-server worker thread, an HTTP connection — for the whole time. The pool empties, then the thread pool fills with waiters, then the service stops accepting requests at all. One slow dependency cascades into a full outage.
  • Too short (e.g. 50 ms). Requests fail the instant the pool is briefly full, including normal micro-bursts that would have cleared in 60 ms. You convert transient pressure into a flood of errors.
  • Right (often 1–3× a normal query’s time, e.g. a few hundred ms to ~2 s). Long enough to ride out a normal burst, short enough that a real starvation fails fast and frees the upstream thread to do something useful — return a 503, shed load, trip a breaker.
Why this works

Why is failing fast better than waiting a long time when the pool is starved? Because a waiting request is not free — it pins resources all the way up the stack. While it sits in the acquisition queue it still holds a web-server thread, a socket, request memory, and often an upstream caller blocked on it. A 30-second wait is 30 seconds of holding all of that for a request that will probably fail anyway. Multiply by hundreds of concurrent requests and the upstream thread pool fills with waiters, so the service can no longer even accept new connections — the classic thread starvation spiral, where a slow database takes down a healthy web tier. A short timeout converts that slow-motion collapse into immediate, cheap failures: the request errors in a few hundred milliseconds, the thread is freed, and the system can apply its real overload strategy (retry elsewhere, shed load, return a degraded response) instead of locking up. Fast failure preserves capacity; slow failure consumes it. This is the same head-of-line-blocking lesson from the throughput unit — one stuck stage poisons everything queued behind it — so you cap the wait deliberately.

The pile-up is a feedback loop

The dangerous part of an empty pool is that it is self-reinforcing. Slow queries hold connections longer → the pool drains → new requests queue → those requests hold upstream threads while queued → the upstream tier saturates → retries pile on more requests → the database, now under even more pressure, gets slower still. Each step makes the next worse. This is why a small latency blip on a dependency can become a total outage minutes later: the pool’s wait behaviour amplifies it. The defences are all about bounding the wait: a sane acquisition timeout, a separately-monitored “threads waiting” metric, and fast failure so upstream capacity is never consumed by doomed waiters.

Acquisition timeoutBehaviour on a starved poolRisk
30 s (default)Every waiter holds a thread for 30 sThread starvation, full outage
50 ms (too short)Normal micro-bursts failError flood under benign load
~250 ms – 2 s (tuned)Rides bursts, fails fast on real starvationFrees upstream to shed load
None / infiniteWaiters block foreverPermanent deadlock under pressure
Quiz

A downstream slowdown fills the pool, and within seconds the whole web tier stops accepting requests even though the database is still up. What is the mechanism?

Quiz

Why is a short, deliberate acquisition timeout usually safer than the 30-second default?

Order the steps

Order the cascade when a slow dependency starves a pool with a long acquisition timeout:

  1. 1 A dependency slows, so queries hold pooled connections far longer than usual
  2. 2 The pool drains until no connection is free
  3. 3 New requests enter the wait queue, each pinning a web-server thread
  4. 4 The upstream thread pool fills with waiters and the service stops accepting requests
Recall before you leave
  1. 01
    What happens when a request needs a connection but every connection in the pool is busy?
  2. 02
    What is the acquisition timeout and how should you set it?
  3. 03
    Why is failing fast better than waiting, and how does an empty pool become a feedback loop?
Recap

A fixed pool has a quiet failure mode that lives entirely in its empty state: when every connection is busy, checkout blocks in a wait queue instead of erroring or growing, so a request’s latency becomes wait-for-connection plus query time — and the acquisition timeout decides the worst case. HikariCP’s 30-second default is a trap, because each waiter pins a web-server thread for the full wait, and under a slow dependency the thread pool fills with doomed waiters until a healthy web tier stops accepting requests; too short a timeout instead fails benign micro-bursts. Tuned to roughly one to three times a normal query, it rides bursts yet fails fast on real starvation, freeing upstream capacity to shed load. The empty-pool pile-up is a self-reinforcing loop — slow queries drain the pool, queued requests pin upstream threads, retries add load, the database slows further — so the defence is to bound the wait deliberately and monitor threads-waiting as a first-class metric. Bounding the wait assumes the connections you do hand out are healthy — and the next lesson shows they are not free forever: connections go stale, get killed by the database, and must be aged out and validated before they silently break a request.

Connected lessons
appears again in185
Continue the climb ↑Connection lifecycle: stale connections and how to age them out
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.