Backend Architecture BE · 06 · 02

The state machine: closed, open, half-open

A circuit breaker is a three-state machine: closed passes calls and counts failures, open rejects every call instantly for a cooldown, half-open lets a few trial calls test recovery. The cooldown timer is the recovery dial; half-open stops a recovering service being flooded.

BE Middle ◷ 14 min

Level

FoundationsJuniorMiddleSenior

“Fail fast when the dependency is sick” sounds simple until you ask the two hard questions: when exactly does it start rejecting calls, and how does it ever start trusting the dependency again? Reject too eagerly and one blip locks out a healthy service; trust too eagerly and you slam a still-broken service the instant the cooldown ends, knocking it back over. A breaker answers both with a small, precise state machine — three states and one timer — and almost every production breaker, from Netflix Hystrix to resilience4j, is a variation on it. Get the states and the transitions right and the rest is tuning.

Three states, four transitions

A circuit breaker is a finite state machine wrapping each call to a dependency:

Closed — the normal state. Calls pass straight through, and the breaker counts failures. When failures cross the trip condition, it transitions to open and starts a cooldown timer.
Open — the tripped state. Every call is rejected instantly (an exception or a fallback), without touching the dependency at all. This is the fast-fail from the last lesson. When the cooldown timer expires, it transitions to half-open.
Half-open — the probing state. A limited number of trial calls are allowed through to test whether the dependency has recovered. If they succeed, the breaker transitions back to closed and resets its counters. If any fails, it goes straight back to open and restarts the cooldown.

Together, these three states encode a single discipline: before committing full traffic to a dependency, make sure it can handle it — and if you’re not sure, err on the side of protecting the caller. Without all three, you’d either cry wolf on every blip (no minimum-volume floor) or flood a fragile recovery (no half-open probe).

That is the whole machine: closed → open on too many failures, open → half-open on the timer, half-open → closed on success, half-open → open on failure. The transitions matter as much as the states, because each one is a decision about how much load to send a dependency in an uncertain state.

The cooldown timer is the recovery dial

The single most consequential setting is how long the breaker stays open before probing — Hystrix calls it sleepWindowInMilliseconds (default 5 s), resilience4j calls it waitDurationInOpenState (default 60 s). It is a direct trade-off:

Too short. The breaker probes again almost immediately, before the dependency has had time to recover, so the trial fails and it reopens. Worse, if it flips open → half-open too fast it can oscillate (flap) between states, sending bursts of doomed calls.
Too long. The dependency recovered seconds ago but the breaker keeps rejecting everyone, turning a short downstream blip into a long self-inflicted outage.

There is no universal right answer; it tracks how long the dependency typically takes to recover. A breaker in front of a service that restarts in ~10 s wants a cooldown near that, not 60 s and not 1 s.

The cooldown is the recovery dial, and both extremes are wrong: too short re-probes a still-broken service and can flap, too long turns a brief blip into a long self-inflicted outage. Tune it to the dependency's real recovery time.

Why half-open exists

The half-open state is the clever part. Without it you would have only two options when the timer fires: stay closed-or-open as a guess, or reopen the gates fully and send all traffic at once. The second is dangerous — a service that just came back is fragile, and a sudden flood of the full backlog can time it out and knock it straight back down. This is the thundering herd on a recovering service.

Half-open solves it by sending only a trickle — resilience4j’s permittedNumberOfCallsInHalfOpenState defaults to 10 — and gating the decision on those. The recovering service proves itself on a handful of calls before the breaker reopens fully. One subtlety: by default resilience4j does not move open → half-open on a timer alone (automaticTransitionFromOpenToHalfOpenEnabled = false); it waits for the next call to arrive after the cooldown, so an idle breaker does not probe a dependency nobody is using.

▸Why this works

Why a separate half-open state instead of just going closed and watching the failure counter again? Because “go closed” means “send all traffic,” and the moment of recovery is exactly when the dependency can least handle all traffic. A service that just restarted has cold caches, empty connection pools, and possibly a backlog of queued work; full production load on it in the first second is how a recovery becomes a re-failure. Half-open is a controlled, low-stakes experiment: send a handful of calls, and let their outcome — not a guess, and not the full firehose — decide whether the dependency is really healthy. It also makes the decision cheap to reverse: if the trial fails, you have spent only a few calls discovering the dependency is still sick, versus discovering it by overloading it again. The pattern is the same bounded-probe idea you see in TCP slow-start and in cache warming: when you are unsure a resource can take load, you ramp into it with a small test rather than committing everything at once, because the cost of being wrong is asymmetric — a failed probe is cheap, a re-collapse is not.

State	Calls to dependency	Counts	Exits to	On
Closed	All pass through	Failures vs threshold	Open	Failures cross threshold
Open	None — instant reject	Cooldown timer	Half-open	Timer expires (or next call after it)
Half-open	A few trial calls only	Trial outcomes	Closed / Open	All succeed / any fails

Quiz

In which state does a circuit breaker reject every call instantly without touching the dependency at all?

Quiz

Why does the breaker use a half-open state with only a few trial calls instead of fully reopening when the cooldown ends?

Order the steps

Order the lifecycle of a breaker through a downstream incident and recovery:

1 Closed: calls pass, failures climb past the threshold
2 Open: every call rejected instantly while the cooldown timer runs
3 Half-open: a few trial calls test whether the dependency recovered
4 Closed again: trials succeeded, counters reset, full traffic resumes

Four transitions: threshold trips closed→open, cooldown timer moves open→half-open, trial success returns half-open→closed, any trial failure snaps half-open→open and restarts the cooldown.

key takeaway

A circuit breaker is a three-state machine wrapping each call. Closed is normal — calls pass and failures are counted; crossing the trip condition moves it to open. Open is tripped — every call is rejected instantly without touching the dependency, until a cooldown timer expires. Half-open is probing — a limited number of trial calls test recovery; all succeed and it returns to closed with counters reset, any fails and it snaps back to open and restarts the cooldown. The cooldown (Hystrix sleepWindow ~5 s, resilience4j waitDurationInOpenState ~60 s) is the recovery dial: too short re-probes a still-broken service and can flap, too long extends a downstream blip into a self-inflicted outage, and the right value tracks the dependency’s real recovery time. Half-open exists to stop a thundering herd on a fragile, just-recovered service: it ramps in with a trickle (resilience4j permits 10 trial calls) so a failed probe is cheap while a full-traffic re-collapse is not.

Recall before you leave

01
What are the three states of a circuit breaker and the transitions between them?
02
Why is the open-state cooldown the most consequential setting, and why does half-open exist?

Recap

A circuit breaker is a small finite state machine with three states and one timer. Closed is normal: calls pass and failures are counted, and crossing the trip condition moves it to open. Open is tripped: every call is rejected instantly without touching the dependency — the fast-fail from the previous lesson — until a cooldown expires, when it moves to half-open. Half-open probes: a limited number of trial calls test recovery, all succeeding returns it to closed with counters reset and any failing snaps it back to open and restarts the cooldown. The cooldown is the recovery dial — too short re-probes a broken service and can flap, too long extends a blip into a self-inflicted outage, and the right value matches the dependency’s real recovery time. Half-open exists to prevent a thundering herd on a fragile, just-recovered service by ramping in with a trickle rather than the full firehose, so a failed probe is cheap and a re-collapse is avoided. Now when you encounter a library like resilience4j or Hystrix, you have a mental model for every configuration parameter you will see — they are all dials on one of these three states or one of four transitions. The states are settled; the next lesson asks the harder question of what counts as enough failure to trip — failure rate over a sliding window, a minimum-volume floor, and slow calls counted as failures.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Why a circuit breaker: a slow dependency takes down the callerjunior

unlocks

What trips it: failure rate, windows, and a volume floormiddle

deepens into

What trips it: failure rate, windows, and a volume floormiddle

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.