Backend Architecture
Putting it together: build a resilient backend service
Reading about the cascade is not the same as building a service that survives it. Stand up a small payment-style service that wires all seven mechanisms onto one code path, then inject the exact faults from this unit — a slow downstream, a retry storm, a deploy mid-load — and prove with numbers that it degrades gracefully instead of collapsing.
Turn the unit’s mental model into a working system: compose pooling, idempotent retries, a circuit breaker, graceful shutdown, and edge load shedding into one service, instrument the seams, and demonstrate under fault-injected load that it bends instead of breaking.
Build a small POST /charge service (Go, JVM, or Node) that fronts a flaky downstream 'payment provider' and a database, combining a bounded pool, idempotent retries, a circuit breaker, graceful shutdown, and edge load shedding — then prove under a fault-injected load test that it keeps goodput high and loses no data instead of cascading into collapse.
- Under the slow-provider injection, p99 of unrelated requests stays bounded and the pool does not deadlock: the acquire timeout fires and the breaker trips, shown in the metrics rather than asserted.
- Under over-saturation load the shedder returns fast 503s and goodput (requests completed within their deadline) stays high while the service never falls to zero — demonstrated with a before/after table of goodput, p99, and 503 rate.
- A retry-storm test shows capped retries plus the breaker prevent a metastable collapse: when the provider recovers, the service recovers on its own without a manual bounce.
- A SIGTERM during sustained load drops zero in-flight charges (the idempotency table shows no duplicates and no lost writes) and the drain completes inside the grace period, with drain duration logged.
- A short write-up naming, for each requirement, which mechanism closed which failure mode, and which gates you consciously left partly open given the service's blast radius.
- Add a one-page on-call runbook: triage from the RED + saturation + breaker dashboards, the order of intervention for a metastable cascade (shed, cap retries, reset breakers), and a verification checklist.
- Add a second downstream and a bulkhead so a fault in one dependency cannot consume the whole pool — show that exhausting one bulkhead leaves the other endpoint healthy.
- Define an SLO and error budget for /charge, compute the budget burn from the load-test metrics, and wire an alert that would gate feature shipping when the budget is nearly spent.
- Run the full readiness checklist from lesson 06 against your service, score each of the eight gates, and document the calibration to blast radius with explicit, honest caveats.
This is the system the whole track was building toward: one service where the pool, idempotent retries, the breaker, graceful shutdown, the shedder, and observability all act on the same request under the same load. Building it once — and watching your own fault injection drive it toward the cascade only for the mechanisms to hold goodput, lose no data, and self-recover — converts the unit’s mental model into something you can stand behind when a real service is about to take traffic, and say yes, and here is how I know.