Backend Architecture BE · 06 · 03

What trips it: failure rate, windows, and a volume floor

A breaker should not trip on a single failure or stay closed through a meltdown. What trips it is a failure rate over a sliding window, gated by a minimum-volume floor so it ignores noise — and a slow call counted as a failure, because slow is the state that hurts.

BE Middle ◷ 15 min

Level

FoundationsJuniorMiddleSenior

A naive breaker trips on the first failed call. So a single timeout at 3 a.m. — a one-off network blip — opens the breaker, and now every user is fast-failed for the full cooldown over one transient error. Tighten it the other way and a breaker that needs “50 failures” never trips on a low-traffic endpoint that only sees 10 requests a minute, so it sits closed through a total meltdown. The trip condition is not a single number; it is a rate over a window with a volume floor, and getting those two right is the difference between a breaker that protects you and one that either cries wolf or sleeps through the fire.

A rate, not a count

Why does this matter enough to get right? Because a misconfigured trip condition is the most common way a breaker silently fails you: it trips on noise and causes self-inflicted outages, or it stays closed through a meltdown because it never sees the right number.

The trip condition is a failure rate, not an absolute number of failures: resilience4j’s failureRateThreshold defaults to 50%, Hystrix’s errorThresholdPercentage to 50. Once the rate of failing calls in the current window crosses the threshold, the breaker opens. Using a rate rather than a raw count is what makes the breaker scale-invariant — 50% failures is meaningful whether the endpoint sees 10 calls a second or 10,000, while “50 failures” means very different things at those two volumes.

The window: count-based vs time-based

The rate is measured over a sliding window of recent calls, and there are two ways to define “recent”:

Count-based — the last N calls. resilience4j’s default is slidingWindowType = COUNT_BASED with slidingWindowSize = 100, so it judges the last 100 calls. Simple and predictable, but on a quiet endpoint those 100 calls might span a long time, so the window reacts slowly.
Time-based — all calls in the last T seconds, usually split into buckets. Hystrix uses a 10 s rolling window divided into 10 one-second buckets, rolling forward each second. This reacts in bounded wall-clock time regardless of traffic, which is what you usually want for a latency-sensitive breaker.

Either way the window slides: old calls age out, so the breaker reflects the dependency’s recent health, not its lifetime average. A dependency that failed an hour ago and has been fine since should not be holding the breaker open.

Both windows slide so old calls age out; they differ in how 'recent' is bounded — count-based by call count, time-based by wall-clock seconds. Time-based reacts in bounded time at any traffic level, which is why it suits latency-sensitive breakers.

The volume floor stops false trips

A rate alone is dangerous at low traffic: 1 failure out of 1 call is 100%, which would trip instantly on a single blip. The fix is a minimum-volume floor — a breaker must see at least M calls in the window before it is even allowed to trip. Hystrix’s requestVolumeThreshold defaults to 20; resilience4j’s minimumNumberOfCalls to 100. Below the floor the breaker stays closed no matter the rate, because a handful of failures is not a statistically meaningful signal. This is the single most important guard against a breaker that flaps on noise.

Slow is a failure too

The subtlest rule: a call that succeeds slowly should often count as a failure. From the first lesson, slow is the dangerous state — a dependency answering in 5 s is doing as much damage as one that errors. So mature breakers track a slow-call rate separately: resilience4j’s slowCallDurationThreshold defines what “slow” means and slowCallRateThreshold (default 100%) is the rate of slow calls that trips the breaker independently of outright errors. Without this, a dependency that never errors but crawls would keep the breaker closed while it starves your threads — exactly the failure mode the breaker was bought to prevent.

▸Why this works

Why gate the breaker on a minimum volume at all — isn’t a 100% failure rate always a real problem? Because at low volume a “rate” is statistical noise, not a measurement. One failed call out of one is 100%, but it tells you almost nothing: it could be a transient packet drop, a single slow GC pause downstream, a one-off deploy hiccup. Tripping on it punishes every subsequent user for a cooldown over a sample size of one. The minimum-volume floor is a confidence requirement: only act on the failure rate once you have seen enough calls that the rate is meaningful. It is the same reason you do not conclude a coin is biased after one flip. The cost structure is asymmetric and informs the numbers — a false trip on a healthy dependency is a self-inflicted outage, while waiting for a few more calls before tripping costs only those few calls’ worth of waiting on a dependency that, if truly broken, will keep failing and cross the floor in moments anyway. So the floor buys real protection against flapping at almost no cost in reaction time, which is why every production breaker has one and why tuning it for low-traffic endpoints matters more than the rate threshold itself.

Setting	Hystrix default	resilience4j default	What it controls
Failure-rate threshold	50%	50%	Rate of failures that trips the breaker
Window type	Time-based (10 s)	Count-based (last 100)	How “recent” is defined
Window size	10 s / 10 buckets	100 calls	The span the rate is measured over
Minimum volume	20 calls	100 calls	Floor before tripping is allowed
Slow-call handling	Timeout → failure (1 s)	slowCallRate 100% / duration 60 s	Counts slow as failure

Quiz

A low-traffic endpoint sees one timeout at 3 a.m. and the breaker opens for the full cooldown, fast-failing everyone over a single transient error. Which setting prevents this?

Quiz

Why should a breaker count a slow-but-successful call as a failure?

Failure rate > 50% failureRateThreshold — scale-invariant trip condition

Volume floor met minimumNumberOfCalls 100 — rejects noise from sparse traffic

Window active count-based 100 calls or time-based 10 s sliding window

The breaker trips only when all three layers are satisfied. A slow call that exceeds slowCallDurationThreshold counts as a failure and feeds the rate layer — so a crawling dependency still trips the breaker even without errors.

key takeaway

A breaker should trip on neither a single failure nor never — the trip condition is a failure rate over a sliding window, gated by a minimum-volume floor. Using a rate (resilience4j failureRateThreshold and Hystrix errorThresholdPercentage both default to 50%) makes the breaker scale-invariant, meaningful at 10 calls/s or 10,000. The window defines “recent”: count-based judges the last N calls (resilience4j default 100), time-based judges the last T seconds in buckets (Hystrix 10 s / 10 one-second buckets), and either way it slides so old calls age out and only recent health counts. The minimum-volume floor (Hystrix requestVolumeThreshold 20, resilience4j minimumNumberOfCalls 100) is the key guard against flapping: below it the breaker stays closed regardless of rate, because a few failures are not a meaningful signal. Finally, a slow-but-successful call should count as a failure (resilience4j slowCallRateThreshold / slowCallDurationThreshold), because slow is the state that starves your threads even when nothing errors.

Recall before you leave

01
Why does a breaker trip on a failure rate over a sliding window rather than a raw failure count, and what are the two window types?
02
What does the minimum-volume floor do, and why must a breaker count slow calls as failures?

Recap

The trip condition is the heart of breaker tuning, and it is never a single number. A breaker trips on a failure rate — resilience4j and Hystrix both default to 50% — because a rate is scale-invariant where a raw count is not. That rate is measured over a sliding window of recent calls, either count-based (the last N, resilience4j’s default of 100) or time-based (the last T seconds in buckets, Hystrix’s 10 s split into ten one-second buckets), and the window slides so only recent health counts. The minimum-volume floor — Hystrix requestVolumeThreshold 20, resilience4j minimumNumberOfCalls 100 — keeps the breaker closed until it has seen enough calls for the rate to be meaningful, the single most important guard against flapping on noise. And because slow is the dangerous dependency state, a slow-but-successful call should count as a failure on its own slow-call rate, or a crawling dependency keeps the breaker closed while starving your threads. Now when you open a breaker configuration file for the first time, the parameters that matter most are the three you learned here — not the cooldown, not the state names, but the rate threshold, the window, and the volume floor. Get those right and the breaker protects you; get them wrong and you either have no protection or a breaker that creates outages instead of preventing them. The next lesson adds bulkheads to isolate that.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 7 done

Connected lessons

builds on

The state machine: closed, open, half-openmiddle

unlocks

Bulkheads: isolating failure domainsmiddle

deepens into

Bulkheads: isolating failure domainsmiddle

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.