Backend Architecture
The state machine: closed, open, half-open
“Fail fast when the dependency is sick” sounds simple until you ask the two hard questions: when exactly does it start rejecting calls, and how does it ever start trusting the dependency again? Reject too eagerly and one blip locks out a healthy service; trust too eagerly and you slam a still-broken service the instant the cooldown ends, knocking it back over. A breaker answers both with a small, precise state machine — three states and one timer — and almost every production breaker, from Netflix Hystrix to resilience4j, is a variation on it. Get the states and the transitions right and the rest is tuning.
Three states, four transitions
A circuit breaker is a finite state machine wrapping each call to a dependency:
- Closed — the normal state. Calls pass straight through, and the breaker counts failures. When failures cross the trip condition, it transitions to open and starts a cooldown timer.
- Open — the tripped state. Every call is rejected instantly (an exception or a fallback), without touching the dependency at all. This is the fast-fail from the last lesson. When the cooldown timer expires, it transitions to half-open.
- Half-open — the probing state. A limited number of trial calls are allowed through to test whether the dependency has recovered. If they succeed, the breaker transitions back to closed and resets its counters. If any fails, it goes straight back to open and restarts the cooldown.
That is the whole machine: closed → open on too many failures, open → half-open on the timer, half-open → closed on success, half-open → open on failure. The transitions matter as much as the states, because each one is a decision about how much load to send a dependency in an uncertain state.
The cooldown timer is the recovery dial
The single most consequential setting is how long the breaker stays open before probing — Hystrix calls it sleepWindowInMilliseconds (default 5 s), resilience4j calls it waitDurationInOpenState (default 60 s). It is a direct trade-off:
- Too short. The breaker probes again almost immediately, before the dependency has had time to recover, so the trial fails and it reopens. Worse, if it flips open → half-open too fast it can oscillate (flap) between states, sending bursts of doomed calls.
- Too long. The dependency recovered seconds ago but the breaker keeps rejecting everyone, turning a short downstream blip into a long self-inflicted outage.
There is no universal right answer; it tracks how long the dependency typically takes to recover. A breaker in front of a service that restarts in ~10 s wants a cooldown near that, not 60 s and not 1 s.
Why half-open exists
The half-open state is the clever part. Without it you would have only two options when the timer fires: stay closed-or-open as a guess, or reopen the gates fully and send all traffic at once. The second is dangerous — a service that just came back is fragile, and a sudden flood of the full backlog can time it out and knock it straight back down. This is the thundering herd on a recovering service.
Half-open solves it by sending only a trickle — resilience4j’s permittedNumberOfCallsInHalfOpenState defaults to 10 — and gating the decision on those. The recovering service proves itself on a handful of calls before the breaker reopens fully. One subtlety: by default resilience4j does not move open → half-open on a timer alone (automaticTransitionFromOpenToHalfOpenEnabled = false); it waits for the next call to arrive after the cooldown, so an idle breaker does not probe a dependency nobody is using.
Why this works
Why a separate half-open state instead of just going closed and watching the failure counter again? Because “go closed” means “send all traffic,” and the moment of recovery is exactly when the dependency can least handle all traffic. A service that just restarted has cold caches, empty connection pools, and possibly a backlog of queued work; full production load on it in the first second is how a recovery becomes a re-failure. Half-open is a controlled, low-stakes experiment: send a handful of calls, and let their outcome — not a guess, and not the full firehose — decide whether the dependency is really healthy. It also makes the decision cheap to reverse: if the trial fails, you have spent only a few calls discovering the dependency is still sick, versus discovering it by overloading it again. The pattern is the same bounded-probe idea you see in TCP slow-start and in cache warming: when you are unsure a resource can take load, you ramp into it with a small test rather than committing everything at once, because the cost of being wrong is asymmetric — a failed probe is cheap, a re-collapse is not.
| State | Calls to dependency | Counts | Exits to | On |
|---|---|---|---|---|
| Closed | All pass through | Failures vs threshold | Open | Failures cross threshold |
| Open | None — instant reject | Cooldown timer | Half-open | Timer expires (or next call after it) |
| Half-open | A few trial calls only | Trial outcomes | Closed / Open | All succeed / any fails |
In which state does a circuit breaker reject every call instantly without touching the dependency at all?
Why does the breaker use a half-open state with only a few trial calls instead of fully reopening when the cooldown ends?
Order the lifecycle of a breaker through a downstream incident and recovery:
- 1 Closed: calls pass, failures climb past the threshold
- 2 Open: every call rejected instantly while the cooldown timer runs
- 3 Half-open: a few trial calls test whether the dependency recovered
- 4 Closed again: trials succeeded, counters reset, full traffic resumes
- 01What are the three states of a circuit breaker and the transitions between them?
- 02Why is the open-state cooldown the most consequential setting, and why does half-open exist?
A circuit breaker is a small finite state machine with three states and one timer. Closed is normal: calls pass and failures are counted, and crossing the trip condition moves it to open. Open is tripped: every call is rejected instantly without touching the dependency — the fast-fail from the previous lesson — until a cooldown expires, when it moves to half-open. Half-open probes: a limited number of trial calls test recovery, all succeeding returns it to closed with counters reset and any failing snaps it back to open and restarts the cooldown. The cooldown is the recovery dial — too short re-probes a broken service and can flap, too long extends a blip into a self-inflicted outage, and the right value matches the dependency’s real recovery time. Half-open exists to prevent a thundering herd on a fragile, just-recovered service by ramping in with a trickle rather than the full firehose, so a failed probe is cheap and a re-collapse is avoided. The states are settled; the next lesson asks the harder question of what counts as enough failure to trip — failure rate over a sliding window, a minimum-volume floor, and slow calls counted as failures.