awesome-everything RU
↑ Back to the climb

Backend Architecture

What trips it: failure rate, windows, and a volume floor

Crux A breaker should not trip on a single failure or stay closed through a meltdown. What trips it is a failure rate over a sliding window, gated by a minimum-volume floor so it ignores noise — and a slow call counted as a failure, because slow is the state that hurts.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 15 min

A naive breaker trips on the first failed call. So a single timeout at 3 a.m. — a one-off network blip — opens the breaker, and now every user is fast-failed for the full cooldown over one transient error. Tighten it the other way and a breaker that needs “50 failures” never trips on a low-traffic endpoint that only sees 10 requests a minute, so it sits closed through a total meltdown. The trip condition is not a single number; it is a rate over a window with a volume floor, and getting those two right is the difference between a breaker that protects you and one that either cries wolf or sleeps through the fire.

A rate, not a count

The trip condition is a failure rate, not an absolute number of failures: resilience4j’s failureRateThreshold defaults to 50%, Hystrix’s errorThresholdPercentage to 50. Once the rate of failing calls in the current window crosses the threshold, the breaker opens. Using a rate rather than a raw count is what makes the breaker scale-invariant — 50% failures is meaningful whether the endpoint sees 10 calls a second or 10,000, while “50 failures” means very different things at those two volumes.

The window: count-based vs time-based

The rate is measured over a sliding window of recent calls, and there are two ways to define “recent”:

  • Count-based — the last N calls. resilience4j’s default is slidingWindowType = COUNT_BASED with slidingWindowSize = 100, so it judges the last 100 calls. Simple and predictable, but on a quiet endpoint those 100 calls might span a long time, so the window reacts slowly.
  • Time-based — all calls in the last T seconds, usually split into buckets. Hystrix uses a 10 s rolling window divided into 10 one-second buckets, rolling forward each second. This reacts in bounded wall-clock time regardless of traffic, which is what you usually want for a latency-sensitive breaker.

Either way the window slides: old calls age out, so the breaker reflects the dependency’s recent health, not its lifetime average. A dependency that failed an hour ago and has been fine since should not be holding the breaker open.

The volume floor stops false trips

A rate alone is dangerous at low traffic: 1 failure out of 1 call is 100%, which would trip instantly on a single blip. The fix is a minimum-volume floor — a breaker must see at least M calls in the window before it is even allowed to trip. Hystrix’s requestVolumeThreshold defaults to 20; resilience4j’s minimumNumberOfCalls to 100. Below the floor the breaker stays closed no matter the rate, because a handful of failures is not a statistically meaningful signal. This is the single most important guard against a breaker that flaps on noise.

Slow is a failure too

The subtlest rule: a call that succeeds slowly should often count as a failure. From the first lesson, slow is the dangerous state — a dependency answering in 5 s is doing as much damage as one that errors. So mature breakers track a slow-call rate separately: resilience4j’s slowCallDurationThreshold defines what “slow” means and slowCallRateThreshold (default 100%) is the rate of slow calls that trips the breaker independently of outright errors. Without this, a dependency that never errors but crawls would keep the breaker closed while it starves your threads — exactly the failure mode the breaker was bought to prevent.

Why this works

Why gate the breaker on a minimum volume at all — isn’t a 100% failure rate always a real problem? Because at low volume a “rate” is statistical noise, not a measurement. One failed call out of one is 100%, but it tells you almost nothing: it could be a transient packet drop, a single slow GC pause downstream, a one-off deploy hiccup. Tripping on it punishes every subsequent user for a cooldown over a sample size of one. The minimum-volume floor is a confidence requirement: only act on the failure rate once you have seen enough calls that the rate is meaningful. It is the same reason you do not conclude a coin is biased after one flip. The cost structure is asymmetric and informs the numbers — a false trip on a healthy dependency is a self-inflicted outage, while waiting for a few more calls before tripping costs only those few calls’ worth of waiting on a dependency that, if truly broken, will keep failing and cross the floor in moments anyway. So the floor buys real protection against flapping at almost no cost in reaction time, which is why every production breaker has one and why tuning it for low-traffic endpoints matters more than the rate threshold itself.

SettingHystrix defaultresilience4j defaultWhat it controls
Failure-rate threshold50%50%Rate of failures that trips the breaker
Window typeTime-based (10 s)Count-based (last 100)How “recent” is defined
Window size10 s / 10 buckets100 callsThe span the rate is measured over
Minimum volume20 calls100 callsFloor before tripping is allowed
Slow-call handlingTimeout → failure (1 s)slowCallRate 100% / duration 60 sCounts slow as failure
Quiz

A low-traffic endpoint sees one timeout at 3 a.m. and the breaker opens for the full cooldown, fast-failing everyone over a single transient error. Which setting prevents this?

Quiz

Why should a breaker count a slow-but-successful call as a failure?

Recall before you leave
  1. 01
    Why does a breaker trip on a failure rate over a sliding window rather than a raw failure count, and what are the two window types?
  2. 02
    What does the minimum-volume floor do, and why must a breaker count slow calls as failures?
Recap

The trip condition is the heart of breaker tuning, and it is never a single number. A breaker trips on a failure rate — resilience4j and Hystrix both default to 50% — because a rate is scale-invariant where a raw count is not. That rate is measured over a sliding window of recent calls, either count-based (the last N, resilience4j’s default of 100) or time-based (the last T seconds in buckets, Hystrix’s 10 s split into ten one-second buckets), and the window slides so only recent health counts. The minimum-volume floor — Hystrix requestVolumeThreshold 20, resilience4j minimumNumberOfCalls 100 — keeps the breaker closed until it has seen enough calls for the rate to be meaningful, the single most important guard against flapping on noise. And because slow is the dangerous dependency state, a slow-but-successful call should count as a failure on its own slow-call rate, or a crawling dependency keeps the breaker closed while starving your threads. The breaker now trips correctly — but it still shares one thread pool across every dependency, so a single sick downstream can drain the budget before the breaker even reacts. The next lesson adds bulkheads to isolate that.

Connected lessons
Continue the climb ↑Bulkheads: isolating failure domains
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.