Backend Architecture BE · 06 · 04

Bulkheads: isolating failure domains

A breaker reacts after failures accumulate, but a shared thread pool can be drained by one slow dependency before it trips. A bulkhead caps concurrency per dependency so one sick downstream can''''t take the whole budget — thread-pool vs semaphore isolation is a real tradeoff.

BE Middle ◷ 15 min

Level

FoundationsJuniorMiddleSenior

Your service calls three downstreams — payments, recommendations, and search — from one shared worker pool of 50 threads. Recommendations, the least important of the three, slows to 5 s. Within seconds all 50 threads are parked on recommendations calls, and now payments and search fail too, because there are no threads left to run them. The breaker on recommendations will trip — but only after enough failures accumulate, and by then the damage is done: a non-critical dependency took down a critical one because they drew from the same bucket. A breaker limits how long you call a sick dependency; a bulkhead limits how much of your capacity any one dependency can ever hold.

The shared-budget problem

A breaker is reactive — it trips after a failure rate accumulates over a window. But the cascade can happen before it reacts, because while the breaker is still counting, every in-flight call to the slow dependency is holding a thread from a pool shared with everyone else. If that pool is shared across all dependencies, a single slow downstream can occupy all of it, and now calls to perfectly healthy dependencies fail for lack of a thread to run on. The breaker eventually trips, but the blast radius already covered the whole service.

The name comes from shipbuilding: a hull is divided into watertight bulkhead compartments so that a breach floods one compartment, not the entire ship. The software pattern is identical — partition your resources so a failure is contained to one compartment instead of sinking everything.

Partition the budget per dependency

A bulkhead caps concurrency per dependency. Instead of one pool of 50 threads shared across payments, recommendations, and search, you give each its own bounded allowance — say 20 / 10 / 20. Now if recommendations slows, it can pin at most its own 10; the other 40 stay available for payments and search. Recommendations calls beyond 10 are rejected immediately (or queued briefly), which itself feeds the breaker’s failure signal — but critically, the failure is confined to recommendations. One dependency can no longer spend the whole service’s capacity.

This is the same bounded-concurrency idea as pool sizing, applied for isolation rather than throughput: the limit’s job is not to maximize work but to cap the blast radius of any single dependency.

Two ways to isolate: threads vs semaphores

When you reach for a bulkhead, you face an immediate question: how does the isolation mechanism actually work, and does it stop a hung call from pinning a thread, or only stop new calls from starting? The answer determines which one to use.

There are two implementations, and the choice is a genuine tradeoff:

Thread-pool isolation. Each dependency gets its own dedicated thread pool (Hystrix’s default model, coreSize = 10). The caller hands the work to that pool and waits with its own timeout. The big advantage: because the call runs on a separate thread, the caller can walk away from a hung call — if the dependency blocks forever, the caller’s timeout still fires and frees the caller. The cost is overhead: every call pays a thread hand-off and context switch, and you maintain many pools.
Semaphore isolation. A simple counter caps how many calls run concurrently (resilience4j’s SemaphoreBulkhead, maxConcurrentCalls = 25, maxWaitDuration = 0). The call runs on the caller’s own thread — no hand-off, almost no overhead. The catch: a semaphore can only count; it cannot interrupt a call that is already blocking. If the dependency hangs, the calling thread hangs with it, and the semaphore just stops new calls — it cannot rescue the ones already stuck.

The rule of thumb: use semaphore isolation for fast, in-process, or non-blocking calls where overhead matters and hangs are impossible; use thread-pool isolation for network calls that can hang, because only a separate thread lets you actually abandon a stuck call.

Both cap concurrency per dependency — but only a separate thread lets the caller walk away from a hung call, which is why thread pools guard network I/O and semaphores guard fast non-blocking calls.

▸Why this works

Why can’t a semaphore protect you from a hanging dependency, when it clearly limits concurrency? Because a semaphore is only a counter — it grants a permit before the call and releases it after. It has no thread of its own and no way to reach into a call that has already started and stop it. If a dependency accepts your request and then never responds, the thread that took the permit sits blocked in the network read indefinitely, still holding the permit. The semaphore faithfully prevents a new call from starting once all permits are taken, so it does cap how many threads can be stuck at once — but it cannot unstick the ones already there. A thread-pool bulkhead can, because the call runs on a pool thread while the caller waits separately with a timeout; when the timeout fires, the caller stops waiting and reclaims its own thread even though the pool thread is still stuck on the dead dependency. The cost of that power is real — a thread hand-off and context switch on every single call, plus the memory and scheduling overhead of many pools — so you do not pay it everywhere. You pay it exactly where calls can hang, which is network I/O, and you use the cheap semaphore everywhere a hang is impossible. The distinction is the same one from the timeouts lesson: only an independent waiter can enforce a deadline on a call it does not control.

	Thread-pool isolation	Semaphore isolation
Mechanism	Dedicated pool per dependency	Concurrency counter
Runs on	A pool thread (hand-off)	The caller’s own thread
Overhead	Higher (context switch, many pools)	Near-zero
Can abandon a hung call?	Yes — caller times out independently	No — caller blocks with the call
Best for	Network calls that can hang	Fast, in-process, non-blocking calls
Example default	Hystrix coreSize 10	resilience4j maxConcurrentCalls 25

Quiz

Three downstreams share one 50-thread pool. The least important one slows to 5 s and the whole service fails, even though its breaker eventually trips. What does a bulkhead add that the breaker alone doesn't?

Quiz

Why does thread-pool isolation protect against a hanging network call when semaphore isolation does not?

Payments pool 20 threads — critical; isolated from recommendations

Search pool 20 threads — critical; isolated from recommendations

Recommendations pool 10 threads — if slow, only this pool fills (blast radius capped)

Without a bulkhead all 50 threads are shared — slow recommendations starve payments and search before the breaker trips. With per-dependency pools each failure domain is walled off; full load on one cannot drain the others.

key takeaway

A breaker is reactive — it trips only after a failure rate accumulates — so a shared thread pool can be fully drained by one slow dependency before the breaker even reacts, failing calls to healthy dependencies for lack of a thread. A bulkhead fixes this by capping concurrency per dependency: each gets its own bounded allowance, so a slow downstream can pin at most its own share and the rest stay available — failure contained to one compartment, like a ship’s hull. It is bounded concurrency applied for isolation, not throughput. Two implementations trade off: thread-pool isolation (Hystrix coreSize 10) runs each call on a dedicated pool thread so the caller can abandon a hung call via its own timeout, at the cost of a hand-off and context switch per call; semaphore isolation (resilience4j maxConcurrentCalls 25) is a near-zero-overhead counter that runs on the caller’s thread but cannot interrupt a call already blocking. Use semaphores for fast non-blocking calls and thread pools for network calls that can hang.

Recall before you leave

01
Why isn't a circuit breaker enough on its own, and what does a bulkhead add?
02
What is the tradeoff between thread-pool and semaphore isolation?

Recap

A circuit breaker limits how long you keep calling a sick dependency, but it is reactive — it trips only after failures accumulate, and in the meantime a shared thread pool can be fully drained by one slow downstream, failing calls to healthy dependencies for want of a thread. A bulkhead closes that gap by capping concurrency per dependency, the way a ship’s watertight compartments confine a breach: give payments, recommendations, and search their own bounded allowances and a slow one can pin at most its own share, containing the failure to a single compartment. It is bounded concurrency applied for isolation rather than throughput. The implementation is a real tradeoff: thread-pool isolation runs each call on a dedicated pool thread so the caller can abandon a hung call through its own timeout, paying a hand-off and context switch per call; semaphore isolation is a near-zero-overhead counter on the caller’s own thread that caps concurrency but cannot interrupt a call already blocking. Use semaphores for fast non-blocking work and thread pools for network calls that can hang. Now when you design a service that calls multiple downstreams, you have two questions to answer per dependency: how long is the breaker allowed to keep calling before it trips, and how much of the shared thread budget can that dependency ever hold? Both questions have answers here. Breaker and bulkhead together limit how long and how much — but neither answers what the caller should return when a call is rejected. The next lesson covers timeouts as the trigger and fallbacks as the answer.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

What trips it: failure rate, windows, and a volume floormiddle

unlocks

Timeouts and fallbacks: what to return when it''''s opensenior

deepens into

Timeouts and fallbacks: what to return when it''''s opensenior

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.