awesome-everything RU
↑ Back to the climb

Backend Architecture

Circuit breakers: contain a cascading failure

Crux Hands-on project — build a service that cascades when a dependency slows, then add a breaker, bulkhead, timeout, and fallback until one sick downstream no longer takes the whole service down, proving each step with numbers.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 240 min

Reading about cascading failures is not the same as pulling a service out of one. Build a small service with three downstreams, slow the least important one until the whole service falls over, then add the unit’s tools one at a time — timeout, breaker, bulkhead, fallback — and prove with numbers that one sick downstream can no longer take everything down.

Goal

Turn the unit’s mental model into a reproducible engineering loop: reproduce a cascade from a slow dependency, then layer in fast-fail, isolation, and degradation until the blast radius is contained — verifying each defence with before/after measurements under identical load.

Project
0 of 8
Objective

Build a service that calls three downstreams from a shared worker pool, drive it into a cascading failure by slowing the least-critical one, then add a timeout, circuit breaker, bulkhead, and fallback until that slow dependency can no longer starve the critical routes — proving each step with measurements, not assertions.

Requirements
Acceptance criteria
  • A before/after table: per-route p99 latency and error rate, plus pool saturation, measured under the same injected-fault load — not estimated.
  • Logs or a state trace showing the breaker moving closed to open on the injected fault, waiting the cooldown, probing in half-open with a limited number of trial calls, and returning to closed once the fault is removed.
  • With all defences on, the critical routes (payments, search) stay within SLO while recommendations is slowed for the entire test — the cascade is contained to the one compartment.
  • A one-paragraph write-up naming which defence stopped which failure: timeout converts the hang to a countable failure, breaker fast-fails it, bulkhead isolates the budget, fallback decides what to return.
Senior stretch
  • Add a retry layer beneath the breaker and reproduce a retry storm, then show that exponential backoff with jitter plus an absolute per-process retry budget (e.g. 60/min) and keeping the breaker above the retries bounds the amplification.
  • Scale to multiple instances behind a load balancer and observe per-instance breaker state — show one instance tripped while another keeps calling, then jitter the half-open probe timing and demonstrate the probes no longer herd.
  • Add a one-page on-call runbook: how to read the four panels, how to tell a cascade from a single-dependency failure, the order to apply the defences, and a verification checklist.
  • Add coordinated, priority-based load shedding with deadline-aware queuing, and show that under whole-service overload the fleet sheds low-priority traffic consistently instead of shuffling load between instances.
Recap

This is the loop you will run in every real resilience incident: reproduce the cascade so you can see it, then add the defences in order and prove each one with numbers — a timeout converts a hang into a countable failure, a breaker fast-fails the sick dependency, a bulkhead isolates its budget so it cannot starve the critical routes, and a fallback decides what the caller returns when the breaker is open. Doing it once on a toy service, with before/after measurements under identical load, makes the production version muscle memory — and the stretch goals carry it into the fleet, where retries, herding probes, and shedding need their own answers.

Continue the climb ↑Why graceful shutdown: the abrupt kill drops in-flight work
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.