Backend Architecture BE · 06 · 05

Timeouts and fallbacks: what to return when it''''s open

A breaker only works if a timeout makes a hang look like a failure, and an open breaker only helps if the caller has an answer ready. The senior view: a timeout is the trigger, a fallback is the answer, graceful degradation should be rare, and load shedding is the last line.

BE Senior ◷ 16 min

Level

FoundationsJuniorMiddleSenior

A team adds a circuit breaker to a flaky recommendations service and ships it. The next incident, the breaker never trips — the recommendations calls do not error, they just take 30 seconds, and the breaker was only counting errors. With no timeout, a hang is invisible to the breaker, so it sat closed through the whole outage. The team adds a 1-second timeout; now the breaker trips correctly. But the homepage starts returning a hard 500 the instant the breaker opens, because nobody decided what to show when there are no recommendations. Two missing pieces, both essential: a timeout is what turns a hang into a countable failure, and a fallback is what the caller returns once the breaker has tripped. A breaker without both is theater.

The timeout is the trigger

A breaker counts failures, so something has to produce a failure. An erroring dependency does that itself, but the dangerous case from lesson one — the slow dependency — produces no error at all. It just takes forever. Without a timeout, that hang is invisible: the call is neither a success nor a failure, it is simply pending, and the breaker keeps the circuit closed while every caller piles up behind the hang.

So the timeout is what converts a hang into a countable failure. Hystrix wires this in by default — execution.isolation.thread.timeoutInMilliseconds = 1000 — so any call exceeding 1 s is failed and fed to the breaker. The timeout must be set deliberately: long enough to allow a legitimate slow-but-normal response, short enough that a real hang is caught before it pins resources. A breaker on top of a dependency with no timeout is the single most common way a breaker silently does nothing.

The fallback is the answer

Once the breaker is open, every call is rejected instantly — but rejected into what? A raw exception bubbling to the user is a hard failure; the breaker only converted a slow failure into a fast one. The value comes from giving the caller a fallback: a useful answer when the real one is unavailable. Common fallbacks, roughly best to worst:

A degraded-but-correct response. Hide the recommendations carousel and render the rest of the page. The user loses a feature, not the page.
Last-known-good cache. Serve a slightly stale value (yesterday’s recommendations, a cached price) instead of nothing.
A static default. A generic “popular items” list, an empty array, a sensible zero.
Queue for later. For writes, accept the request and process it asynchronously once the dependency recovers (the outbox pattern from the idempotency unit).
Fail fast with a clear error. When no fallback is meaningful, a clean 503 with Retry-After beats a hung 30-second request.

The art is choosing per call site. A missing recommendations carousel should never fail the page; a missing payment authorization must fail the order, because a fake success here is worse than an error.

An open breaker must return something. Reach for the highest fallback the call site allows; a clean fail-fast 503 is the floor, never a raw hang.

Degradation should be rare, and shedding is the floor

Here is where most teams stop — they add the fallback and feel safe. The senior trap is believing that a fallback firing often means the system is resilient. It is the opposite.

A subtle senior point from Google’s SRE practice: graceful degradation should not trigger often. A fallback path that runs constantly is under-tested, hides the real failure rate, and can mask a chronic problem until it becomes acute. Degradation is for genuine incidents, and you should alert when too many servers enter degraded or fallback modes — that frequency is itself a signal.

When even fallbacks cannot keep up — the whole service is overloaded, not just one dependency — the last line is load shedding: deliberately reject a fraction of requests with a 503 to protect the rest, rather than letting everything degrade into timeouts. Google’s SRE book pairs this with LIFO/CoDel-style queuing that drops requests already queued long enough (~10 s) to have missed their deadline anyway — there is no point spending capacity on a request the user has given up on.

▸Why this works

Why is a fallback that runs all the time a problem rather than a feature? Because a constantly-firing fallback quietly redefines “working.” If the homepage serves stale recommendations from cache every single request because the live service has been broken for a week, the page looks fine, the dashboards look fine, and nobody is paged — the failure has been laundered into normal operation. Three things rot underneath. First, the real dependency’s failure rate is now invisible, so a small problem grows into a large one undetected. Second, the fallback path itself is now load-bearing yet rarely scrutinized; the day the cache also fails, you discover the fallback had a bug nobody hit in months. Third, you have lost the signal that distinguishes “degraded” from “healthy,” which is the signal on-call decisions depend on. The discipline is to treat fallbacks as exceptional and instrument their rate: a fallback firing 0.1% of the time during a blip is the system working as designed, while the same fallback firing 40% of the time is an incident wearing a healthy mask. This is why mature teams alert on degradation frequency, not just on hard errors — the absence of errors is not the same as health when a fallback is silently absorbing the failures.

Layer	Trigger	What it returns	When to use
Timeout	Call exceeds budget (~1 s)	Failure fed to breaker	Always, on every network call
Degraded response	Breaker open	Page minus the broken feature	Non-critical feature
Stale cache	Breaker open	Last-known-good value	Read tolerant of staleness
Static default	Breaker open	Generic safe value	No fresher option
Fail-fast 503	Breaker open, no fallback	Clean error + Retry-After	Critical call (e.g. payment)
Load shedding	Whole service overloaded	503 for a fraction	Last line, protect the rest

Quiz

A breaker is added to a flaky service but never trips during an outage. The dependency isn't erroring — it's taking 30 s per call. What's missing?

Quiz

Why do experienced teams alert when a graceful-degradation fallback fires too often, instead of treating it as the system simply working?

Without a timeout the hang is invisible to the breaker. Once the timeout feeds failures to the breaker and it trips, the fallback path must return something useful — stale cache, degraded response, or a clean 503 with Retry-After.

key takeaway

A breaker needs two pieces around it or it is theater. The timeout is the trigger: a breaker counts failures, but the dangerous slow dependency produces no error — without a timeout (Hystrix defaults to 1 s) a hang is neither success nor failure, invisible to the breaker, which sits closed through the outage. The fallback is the answer: once open, calls are rejected instantly, but into what? Best-to-worst options are a degraded-but-correct response (drop the broken feature, keep the page), a last-known-good cache, a static default, queue-for-later (outbox) for writes, and a clean fail-fast 503 with Retry-After when no fallback is meaningful — chosen per call site, since a missing carousel must never fail the page while a missing payment authorization must fail the order. Graceful degradation should be rare: a fallback that fires constantly hides the real failure rate and leaves an untested path load-bearing, so alert on degradation frequency. When the whole service is overloaded, load shedding (503 a fraction, drop requests queued past their deadline) is the floor.

Recall before you leave

01
Why is a timeout essential for a circuit breaker to work, and what fallback options does an open breaker have?
02
Why should graceful degradation be rare, and what is load shedding?

Recap

A breaker is only as good as the two pieces around it. The timeout is the trigger: because the breaker counts failures and the dangerous slow dependency produces none, a hang without a timeout is invisible and the breaker sits closed through the outage — Hystrix’s default 1-second timeout is what converts a hang into a countable failure, and a breaker over an un-timed-out dependency silently does nothing. The fallback is the answer: an open breaker rejects instantly, and the value is what the caller returns instead — a degraded-but-correct page, a last-known-good cache, a static default, a queued write via the outbox, or a clean 503 with Retry-After — chosen per call site, since a missing carousel must never fail the page while a missing payment authorization must fail the order. Graceful degradation should be rare, because a constantly-firing fallback hides the true failure rate and leaves an untested path load-bearing, so alert on its frequency. And when the whole service is overloaded, load shedding with deadline-aware queuing is the floor. Now when you review a service’s resilience posture, two questions are the diagnostic: does every outbound network call have an explicit timeout, and does every call site have a defined fallback? If either answer is no, the breaker is theater — and you know how to fix it. Everything so far assumed one process — the final lesson scales out to many instances, where breakers flap, herd on half-open, and collide with retries.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 6 done

Connected lessons

builds on

Bulkheads: isolating failure domainsmiddle

unlocks

At scale: per-instance state, retry storms, and coordinated sheddingsenior

deepens into

At scale: per-instance state, retry storms, and coordinated sheddingsenior

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.