Backend Architecture BE · 06 · 10

Circuit breakers: contain a cascading failure

Hands-on project — build a service that cascades when a dependency slows, then add a breaker, bulkhead, timeout, and fallback until one sick downstream no longer takes the whole service down, proving each step with numbers.

BE Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading about cascading failures is not the same as pulling a service out of one. Build a small service with three downstreams, slow the least important one until the whole service falls over, then add the unit’s tools one at a time — timeout, breaker, bulkhead, fallback — and prove with numbers that one sick downstream can no longer take everything down.

Goal

Turn the unit’s mental model into a reproducible engineering loop: reproduce a cascade from a slow dependency, then layer in fast-fail, isolation, and degradation until the blast radius is contained — verifying each defence with before/after measurements under identical load.

Project

0 of 8

Objective

Build a service that calls three downstreams from a shared worker pool, drive it into a cascading failure by slowing the least-critical one, then add a timeout, circuit breaker, bulkhead, and fallback until that slow dependency can no longer starve the critical routes — proving each step with measurements, not assertions.

Requirements

Acceptance criteria

A before/after table: per-route p99 latency and error rate, plus pool saturation, measured under the same injected-fault load — not estimated.
Logs or a state trace showing the breaker moving closed to open on the injected fault, waiting the cooldown, probing in half-open with a limited number of trial calls, and returning to closed once the fault is removed.
With all defences on, the critical routes (payments, search) stay within SLO while recommendations is slowed for the entire test — the cascade is contained to the one compartment.
A one-paragraph write-up naming which defence stopped which failure: timeout converts the hang to a countable failure, breaker fast-fails it, bulkhead isolates the budget, fallback decides what to return.

Senior stretch

Add a retry layer beneath the breaker and reproduce a retry storm, then show that exponential backoff with jitter plus an absolute per-process retry budget (e.g. 60/min) and keeping the breaker above the retries bounds the amplification.
Scale to multiple instances behind a load balancer and observe per-instance breaker state — show one instance tripped while another keeps calling, then jitter the half-open probe timing and demonstrate the probes no longer herd.
Add a one-page on-call runbook: how to read the four panels, how to tell a cascade from a single-dependency failure, the order to apply the defences, and a verification checklist.
Add coordinated, priority-based load shedding with deadline-aware queuing, and show that under whole-service overload the fleet sheds low-priority traffic consistently instead of shuffling load between instances.

Recap

This is the loop you will run in every real resilience incident: reproduce the cascade so you can see it, then add the defences in order and prove each one with numbers — a timeout converts a hang into a countable failure, a breaker fast-fails the sick dependency, a bulkhead isolates its budget so it cannot starve the critical routes, and a fallback decides what the caller returns when the breaker is open. Doing it once on a toy service, with before/after measurements under identical load, makes the production version muscle memory — and the stretch goals carry it into the fleet, where retries, herding probes, and shedding need their own answers.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.