awesome-everything RU
↑ Back to the climb

Backend Architecture

The state machine: closed, open, half-open

Crux A circuit breaker is a three-state machine: closed passes calls and counts failures, open rejects every call instantly for a cooldown, half-open lets a few trial calls test recovery. The cooldown timer is the recovery dial; half-open stops a recovering service being flooded.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 14 min

“Fail fast when the dependency is sick” sounds simple until you ask the two hard questions: when exactly does it start rejecting calls, and how does it ever start trusting the dependency again? Reject too eagerly and one blip locks out a healthy service; trust too eagerly and you slam a still-broken service the instant the cooldown ends, knocking it back over. A breaker answers both with a small, precise state machine — three states and one timer — and almost every production breaker, from Netflix Hystrix to resilience4j, is a variation on it. Get the states and the transitions right and the rest is tuning.

Three states, four transitions

A circuit breaker is a finite state machine wrapping each call to a dependency:

  • Closed — the normal state. Calls pass straight through, and the breaker counts failures. When failures cross the trip condition, it transitions to open and starts a cooldown timer.
  • Open — the tripped state. Every call is rejected instantly (an exception or a fallback), without touching the dependency at all. This is the fast-fail from the last lesson. When the cooldown timer expires, it transitions to half-open.
  • Half-open — the probing state. A limited number of trial calls are allowed through to test whether the dependency has recovered. If they succeed, the breaker transitions back to closed and resets its counters. If any fails, it goes straight back to open and restarts the cooldown.

That is the whole machine: closed → open on too many failures, open → half-open on the timer, half-open → closed on success, half-open → open on failure. The transitions matter as much as the states, because each one is a decision about how much load to send a dependency in an uncertain state.

The cooldown timer is the recovery dial

The single most consequential setting is how long the breaker stays open before probing — Hystrix calls it sleepWindowInMilliseconds (default 5 s), resilience4j calls it waitDurationInOpenState (default 60 s). It is a direct trade-off:

  • Too short. The breaker probes again almost immediately, before the dependency has had time to recover, so the trial fails and it reopens. Worse, if it flips open → half-open too fast it can oscillate (flap) between states, sending bursts of doomed calls.
  • Too long. The dependency recovered seconds ago but the breaker keeps rejecting everyone, turning a short downstream blip into a long self-inflicted outage.

There is no universal right answer; it tracks how long the dependency typically takes to recover. A breaker in front of a service that restarts in ~10 s wants a cooldown near that, not 60 s and not 1 s.

Why half-open exists

The half-open state is the clever part. Without it you would have only two options when the timer fires: stay closed-or-open as a guess, or reopen the gates fully and send all traffic at once. The second is dangerous — a service that just came back is fragile, and a sudden flood of the full backlog can time it out and knock it straight back down. This is the thundering herd on a recovering service.

Half-open solves it by sending only a trickle — resilience4j’s permittedNumberOfCallsInHalfOpenState defaults to 10 — and gating the decision on those. The recovering service proves itself on a handful of calls before the breaker reopens fully. One subtlety: by default resilience4j does not move open → half-open on a timer alone (automaticTransitionFromOpenToHalfOpenEnabled = false); it waits for the next call to arrive after the cooldown, so an idle breaker does not probe a dependency nobody is using.

Why this works

Why a separate half-open state instead of just going closed and watching the failure counter again? Because “go closed” means “send all traffic,” and the moment of recovery is exactly when the dependency can least handle all traffic. A service that just restarted has cold caches, empty connection pools, and possibly a backlog of queued work; full production load on it in the first second is how a recovery becomes a re-failure. Half-open is a controlled, low-stakes experiment: send a handful of calls, and let their outcome — not a guess, and not the full firehose — decide whether the dependency is really healthy. It also makes the decision cheap to reverse: if the trial fails, you have spent only a few calls discovering the dependency is still sick, versus discovering it by overloading it again. The pattern is the same bounded-probe idea you see in TCP slow-start and in cache warming: when you are unsure a resource can take load, you ramp into it with a small test rather than committing everything at once, because the cost of being wrong is asymmetric — a failed probe is cheap, a re-collapse is not.

StateCalls to dependencyCountsExits toOn
ClosedAll pass throughFailures vs thresholdOpenFailures cross threshold
OpenNone — instant rejectCooldown timerHalf-openTimer expires (or next call after it)
Half-openA few trial calls onlyTrial outcomesClosed / OpenAll succeed / any fails
Quiz

In which state does a circuit breaker reject every call instantly without touching the dependency at all?

Quiz

Why does the breaker use a half-open state with only a few trial calls instead of fully reopening when the cooldown ends?

Order the steps

Order the lifecycle of a breaker through a downstream incident and recovery:

  1. 1 Closed: calls pass, failures climb past the threshold
  2. 2 Open: every call rejected instantly while the cooldown timer runs
  3. 3 Half-open: a few trial calls test whether the dependency recovered
  4. 4 Closed again: trials succeeded, counters reset, full traffic resumes
Recall before you leave
  1. 01
    What are the three states of a circuit breaker and the transitions between them?
  2. 02
    Why is the open-state cooldown the most consequential setting, and why does half-open exist?
Recap

A circuit breaker is a small finite state machine with three states and one timer. Closed is normal: calls pass and failures are counted, and crossing the trip condition moves it to open. Open is tripped: every call is rejected instantly without touching the dependency — the fast-fail from the previous lesson — until a cooldown expires, when it moves to half-open. Half-open probes: a limited number of trial calls test recovery, all succeeding returns it to closed with counters reset and any failing snaps it back to open and restarts the cooldown. The cooldown is the recovery dial — too short re-probes a broken service and can flap, too long extends a blip into a self-inflicted outage, and the right value matches the dependency’s real recovery time. Half-open exists to prevent a thundering herd on a fragile, just-recovered service by ramping in with a trickle rather than the full firehose, so a failed probe is cheap and a re-collapse is avoided. The states are settled; the next lesson asks the harder question of what counts as enough failure to trip — failure rate over a sliding window, a minimum-volume floor, and slow calls counted as failures.

Connected lessons
Continue the climb ↑What trips it: failure rate, windows, and a volume floor
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.