awesome-everything RU
↑ Back to the climb

Backend Architecture

Timeouts and fallbacks: what to return when it''''s open

Crux A breaker only works if a timeout makes a hang look like a failure, and an open breaker only helps if the caller has an answer ready. The senior view: a timeout is the trigger, a fallback is the answer, graceful degradation should be rare, and load shedding is the last line.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 16 min

A team adds a circuit breaker to a flaky recommendations service and ships it. The next incident, the breaker never trips — the recommendations calls do not error, they just take 30 seconds, and the breaker was only counting errors. With no timeout, a hang is invisible to the breaker, so it sat closed through the whole outage. The team adds a 1-second timeout; now the breaker trips correctly. But the homepage starts returning a hard 500 the instant the breaker opens, because nobody decided what to show when there are no recommendations. Two missing pieces, both essential: a timeout is what turns a hang into a countable failure, and a fallback is what the caller returns once the breaker has tripped. A breaker without both is theater.

The timeout is the trigger

A breaker counts failures, so something has to produce a failure. An erroring dependency does that itself, but the dangerous case from lesson one — the slow dependency — produces no error at all. It just takes forever. Without a timeout, that hang is invisible: the call is neither a success nor a failure, it is simply pending, and the breaker keeps the circuit closed while every caller piles up behind the hang.

So the timeout is what converts a hang into a countable failure. Hystrix wires this in by default — execution.isolation.thread.timeoutInMilliseconds = 1000 — so any call exceeding 1 s is failed and fed to the breaker. The timeout must be set deliberately: long enough to allow a legitimate slow-but-normal response, short enough that a real hang is caught before it pins resources. A breaker on top of a dependency with no timeout is the single most common way a breaker silently does nothing.

The fallback is the answer

Once the breaker is open, every call is rejected instantly — but rejected into what? A raw exception bubbling to the user is a hard failure; the breaker only converted a slow failure into a fast one. The value comes from giving the caller a fallback: a useful answer when the real one is unavailable. Common fallbacks, roughly best to worst:

  • A degraded-but-correct response. Hide the recommendations carousel and render the rest of the page. The user loses a feature, not the page.
  • Last-known-good cache. Serve a slightly stale value (yesterday’s recommendations, a cached price) instead of nothing.
  • A static default. A generic “popular items” list, an empty array, a sensible zero.
  • Queue for later. For writes, accept the request and process it asynchronously once the dependency recovers (the outbox pattern from the idempotency unit).
  • Fail fast with a clear error. When no fallback is meaningful, a clean 503 with Retry-After beats a hung 30-second request.

The art is choosing per call site. A missing recommendations carousel should never fail the page; a missing payment authorization must fail the order, because a fake success here is worse than an error.

Degradation should be rare, and shedding is the floor

A subtle senior point from Google’s SRE practice: graceful degradation should not trigger often. A fallback path that runs constantly is under-tested, hides the real failure rate, and can mask a chronic problem until it becomes acute. Degradation is for genuine incidents, and you should alert when too many servers enter degraded or fallback modes — that frequency is itself a signal.

When even fallbacks cannot keep up — the whole service is overloaded, not just one dependency — the last line is load shedding: deliberately reject a fraction of requests with a 503 to protect the rest, rather than letting everything degrade into timeouts. Google’s SRE book pairs this with LIFO/CoDel-style queuing that drops requests already queued long enough (~10 s) to have missed their deadline anyway — there is no point spending capacity on a request the user has given up on.

Why this works

Why is a fallback that runs all the time a problem rather than a feature? Because a constantly-firing fallback quietly redefines “working.” If the homepage serves stale recommendations from cache every single request because the live service has been broken for a week, the page looks fine, the dashboards look fine, and nobody is paged — the failure has been laundered into normal operation. Three things rot underneath. First, the real dependency’s failure rate is now invisible, so a small problem grows into a large one undetected. Second, the fallback path itself is now load-bearing yet rarely scrutinized; the day the cache also fails, you discover the fallback had a bug nobody hit in months. Third, you have lost the signal that distinguishes “degraded” from “healthy,” which is the signal on-call decisions depend on. The discipline is to treat fallbacks as exceptional and instrument their rate: a fallback firing 0.1% of the time during a blip is the system working as designed, while the same fallback firing 40% of the time is an incident wearing a healthy mask. This is why mature teams alert on degradation frequency, not just on hard errors — the absence of errors is not the same as health when a fallback is silently absorbing the failures.

LayerTriggerWhat it returnsWhen to use
TimeoutCall exceeds budget (~1 s)Failure fed to breakerAlways, on every network call
Degraded responseBreaker openPage minus the broken featureNon-critical feature
Stale cacheBreaker openLast-known-good valueRead tolerant of staleness
Static defaultBreaker openGeneric safe valueNo fresher option
Fail-fast 503Breaker open, no fallbackClean error + Retry-AfterCritical call (e.g. payment)
Load sheddingWhole service overloaded503 for a fractionLast line, protect the rest
Quiz

A breaker is added to a flaky service but never trips during an outage. The dependency isn't erroring — it's taking 30 s per call. What's missing?

Quiz

Why do experienced teams alert when a graceful-degradation fallback fires too often, instead of treating it as the system simply working?

Recall before you leave
  1. 01
    Why is a timeout essential for a circuit breaker to work, and what fallback options does an open breaker have?
  2. 02
    Why should graceful degradation be rare, and what is load shedding?
Recap

A breaker is only as good as the two pieces around it. The timeout is the trigger: because the breaker counts failures and the dangerous slow dependency produces none, a hang without a timeout is invisible and the breaker sits closed through the outage — Hystrix’s default 1-second timeout is what converts a hang into a countable failure, and a breaker over an un-timed-out dependency silently does nothing. The fallback is the answer: an open breaker rejects instantly, and the value is what the caller returns instead — a degraded-but-correct page, a last-known-good cache, a static default, a queued write via the outbox, or a clean 503 with Retry-After — chosen per call site, since a missing carousel must never fail the page while a missing payment authorization must fail the order. Graceful degradation should be rare, because a constantly-firing fallback hides the true failure rate and leaves an untested path load-bearing, so alert on its frequency. And when the whole service is overloaded, load shedding with deadline-aware queuing is the floor. Everything so far assumed one process — the final lesson scales out to many instances, where breakers flap, herd on half-open, and collide with retries.

Connected lessons
Continue the climb ↑At scale: per-instance state, retry storms, and coordinated shedding
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.