Backend Architecture
Timeouts and fallbacks: what to return when it''''s open
A team adds a circuit breaker to a flaky recommendations service and ships it. The next incident, the breaker never trips — the recommendations calls do not error, they just take 30 seconds, and the breaker was only counting errors. With no timeout, a hang is invisible to the breaker, so it sat closed through the whole outage. The team adds a 1-second timeout; now the breaker trips correctly. But the homepage starts returning a hard 500 the instant the breaker opens, because nobody decided what to show when there are no recommendations. Two missing pieces, both essential: a timeout is what turns a hang into a countable failure, and a fallback is what the caller returns once the breaker has tripped. A breaker without both is theater.
The timeout is the trigger
A breaker counts failures, so something has to produce a failure. An erroring dependency does that itself, but the dangerous case from lesson one — the slow dependency — produces no error at all. It just takes forever. Without a timeout, that hang is invisible: the call is neither a success nor a failure, it is simply pending, and the breaker keeps the circuit closed while every caller piles up behind the hang.
So the timeout is what converts a hang into a countable failure. Hystrix wires this in by default — execution.isolation.thread.timeoutInMilliseconds = 1000 — so any call exceeding 1 s is failed and fed to the breaker. The timeout must be set deliberately: long enough to allow a legitimate slow-but-normal response, short enough that a real hang is caught before it pins resources. A breaker on top of a dependency with no timeout is the single most common way a breaker silently does nothing.
The fallback is the answer
Once the breaker is open, every call is rejected instantly — but rejected into what? A raw exception bubbling to the user is a hard failure; the breaker only converted a slow failure into a fast one. The value comes from giving the caller a fallback: a useful answer when the real one is unavailable. Common fallbacks, roughly best to worst:
- A degraded-but-correct response. Hide the recommendations carousel and render the rest of the page. The user loses a feature, not the page.
- Last-known-good cache. Serve a slightly stale value (yesterday’s recommendations, a cached price) instead of nothing.
- A static default. A generic “popular items” list, an empty array, a sensible zero.
- Queue for later. For writes, accept the request and process it asynchronously once the dependency recovers (the outbox pattern from the idempotency unit).
- Fail fast with a clear error. When no fallback is meaningful, a clean 503 with
Retry-Afterbeats a hung 30-second request.
The art is choosing per call site. A missing recommendations carousel should never fail the page; a missing payment authorization must fail the order, because a fake success here is worse than an error.
Degradation should be rare, and shedding is the floor
A subtle senior point from Google’s SRE practice: graceful degradation should not trigger often. A fallback path that runs constantly is under-tested, hides the real failure rate, and can mask a chronic problem until it becomes acute. Degradation is for genuine incidents, and you should alert when too many servers enter degraded or fallback modes — that frequency is itself a signal.
When even fallbacks cannot keep up — the whole service is overloaded, not just one dependency — the last line is load shedding: deliberately reject a fraction of requests with a 503 to protect the rest, rather than letting everything degrade into timeouts. Google’s SRE book pairs this with LIFO/CoDel-style queuing that drops requests already queued long enough (~10 s) to have missed their deadline anyway — there is no point spending capacity on a request the user has given up on.
Why this works
Why is a fallback that runs all the time a problem rather than a feature? Because a constantly-firing fallback quietly redefines “working.” If the homepage serves stale recommendations from cache every single request because the live service has been broken for a week, the page looks fine, the dashboards look fine, and nobody is paged — the failure has been laundered into normal operation. Three things rot underneath. First, the real dependency’s failure rate is now invisible, so a small problem grows into a large one undetected. Second, the fallback path itself is now load-bearing yet rarely scrutinized; the day the cache also fails, you discover the fallback had a bug nobody hit in months. Third, you have lost the signal that distinguishes “degraded” from “healthy,” which is the signal on-call decisions depend on. The discipline is to treat fallbacks as exceptional and instrument their rate: a fallback firing 0.1% of the time during a blip is the system working as designed, while the same fallback firing 40% of the time is an incident wearing a healthy mask. This is why mature teams alert on degradation frequency, not just on hard errors — the absence of errors is not the same as health when a fallback is silently absorbing the failures.
| Layer | Trigger | What it returns | When to use |
|---|---|---|---|
| Timeout | Call exceeds budget (~1 s) | Failure fed to breaker | Always, on every network call |
| Degraded response | Breaker open | Page minus the broken feature | Non-critical feature |
| Stale cache | Breaker open | Last-known-good value | Read tolerant of staleness |
| Static default | Breaker open | Generic safe value | No fresher option |
| Fail-fast 503 | Breaker open, no fallback | Clean error + Retry-After | Critical call (e.g. payment) |
| Load shedding | Whole service overloaded | 503 for a fraction | Last line, protect the rest |
A breaker is added to a flaky service but never trips during an outage. The dependency isn't erroring — it's taking 30 s per call. What's missing?
Why do experienced teams alert when a graceful-degradation fallback fires too often, instead of treating it as the system simply working?
- 01Why is a timeout essential for a circuit breaker to work, and what fallback options does an open breaker have?
- 02Why should graceful degradation be rare, and what is load shedding?
A breaker is only as good as the two pieces around it. The timeout is the trigger: because the breaker counts failures and the dangerous slow dependency produces none, a hang without a timeout is invisible and the breaker sits closed through the outage — Hystrix’s default 1-second timeout is what converts a hang into a countable failure, and a breaker over an un-timed-out dependency silently does nothing. The fallback is the answer: an open breaker rejects instantly, and the value is what the caller returns instead — a degraded-but-correct page, a last-known-good cache, a static default, a queued write via the outbox, or a clean 503 with Retry-After — chosen per call site, since a missing carousel must never fail the page while a missing payment authorization must fail the order. Graceful degradation should be rare, because a constantly-firing fallback hides the true failure rate and leaves an untested path load-bearing, so alert on its frequency. And when the whole service is overloaded, load shedding with deadline-aware queuing is the floor. Everything so far assumed one process — the final lesson scales out to many instances, where breakers flap, herd on half-open, and collide with retries.