Backend Architecture
The service under overload: load shedding and graceful degradation
There is a load level above which your service cannot serve every request, no matter how it is written — and the only choice you have is how it fails when it gets there. The naive instinct is to try to serve everyone: accept every connection, queue every request, retry every failure. That instinct is exactly what produces the collapse from two lessons ago, because accepting work you cannot finish does not help the user whose request you accepted and steals the resources of the users whose requests you could have finished. The counterintuitive truth of overload is that rejecting some requests is how you serve the rest — a service that sheds 20% of load at the door and serves the other 80% fast is vastly healthier than one that accepts 100% and serves all of it too slowly to be useful, then falls over and serves 0%. This is the shift from thinking about throughput (requests handled) to goodput (requests handled usefully, within their deadline). And here is the synthesis: every mechanism you learned is secretly a load-control knob. The pool bounds concurrent work. Timeouts free resources from doomed requests. The breaker sheds load from a failing dependency. Shutdown sheds load from a departing instance. Overload is the condition that makes all of them act at once, and operating a backend means tuning them so the service bends instead of breaking.
Throughput is the wrong target; goodput is the right one
The metric that matters under overload is not how many requests you accept but how many you complete usefully — fast enough that the answer still matters to the caller. Call that goodput. A request that you accept, hold for 8 seconds, and finally answer after the client has already timed out and retried is pure waste: it consumed a connection, a loop slot, and CPU, and produced negative value, because it also delayed real work. Beyond the saturation point, raising accepted load lowers goodput — the classic overload curve where throughput climbs, peaks, then falls off a cliff as the system spends itself on work that no longer matters. The entire discipline of overload handling is keeping the system at the top of that curve instead of sliding down the far side, and that means refusing work you cannot complete in time.
Load shedding: reject early, reject cheaply
Load shedding is the deliberate, early rejection of excess requests so the accepted ones succeed. Three properties make it work:
- Early. Shed at the edge, before the request consumes a pooled connection or a downstream call. A request rejected at admission costs almost nothing; a request rejected after it has acquired resources has already done the damage. This is why an admission check or concurrency limiter sits in front of the expensive layers.
- Cheap and honest. The rejection is a fast
503 Service Unavailablewith aRetry-After, not a slow error. It tells the client don’t hammer me — which is the opposite of the silent slowness that triggers retry storms. - Prioritized. Not all load is equal. Shed the least valuable first — background refreshes before user-facing reads, anonymous before paid, retries before first attempts. A good shedder drops the right 20%, preserving goodput where it counts.
Backpressure: pushing the limit upstream
The dual of shedding is backpressure: instead of accepting work and dropping it, you signal upstream to slow down. A bounded queue that refuses new entries when full, a pool that makes acquisition fail fast instead of queueing unboundedly, an HTTP layer that stops reading the socket — each propagates “I am full” back toward the source, so the pressure is felt where it can actually be reduced (the client backs off, the upstream service routes elsewhere) rather than absorbed silently until something breaks. Shedding and backpressure are two halves of the same idea: make the limit explicit and enforce it at the boundary, rather than letting an implicit limit (memory, file descriptors, the pool) enforce itself by collapsing.
Every mechanism is a load-control knob
The synthesis the whole unit has been building toward: the seven mechanisms are not just correctness tools, they are the actuators of load control, and overload is when they combine:
- Pool / concurrency limit — the hard cap on work in flight. The single most important overload defense: it converts “unbounded slowdown” into “bounded work plus explicit rejection.”
- Timeouts — free resources from requests that will not finish in time, so doomed work stops stealing capacity from viable work.
- Circuit breaker — sheds load from a failing dependency, which is overload localized to one downstream.
- Retries (bounded, with backoff and jitter) — the load amplifier you must cap, because uncapped retries turn overload into collapse.
- Graceful shutdown — sheds load from a departing instance without dropping its in-flight work onto the floor.
- Observability — the feedback that tells you where you are on the goodput curve, so shedding can be triggered by real saturation, not guesses.
Tuning these together — pool sized to the slowest dependency, timeouts nested in a budget, retries capped, a shedder at the edge keyed off saturation metrics — is what makes a service degrade gracefully: serve less, but serve it well, and never fall to zero.
Why this works
Why is rejecting requests — deliberately failing some users — the correct behavior under overload, when it feels like the one thing a service exists to avoid? Because the alternative is not “serve everyone,” it is “serve no one,” and the arithmetic is unforgiving. A server has a finite capacity C of useful work per second. When demand D exceeds C, you cannot complete D; you can only choose what happens to the D − C requests you have no capacity for. If you accept them anyway, they do not vanish — they sit in queues and hold resources while they wait, which means even the C requests you could have served now wait behind them, so their latency rises past the point of usefulness too. Accepting work beyond capacity does not add served requests; it subtracts them, because the excess degrades the requests that were within capacity. This is the cruel inversion at the heart of overload: past the saturation point, trying harder to serve everyone serves fewer people, because the bottleneck resource gets spent on coordination, queueing, and timed-out work instead of completion. Shedding the D − C excess at the door — cheaply, before it acquires anything — is what protects the C: it keeps the queue short, so accepted requests stay fast, so they finish within their deadline and become goodput instead of waste. The reason it feels wrong is that a single rejected request looks like a failure, while the cost it would have imposed on others is diffuse and invisible — you see the 503 you returned but not the fifty timeouts you prevented. Senior judgment is precisely the ability to value the invisible many over the visible one: to shed the least-valuable load early and on purpose, so that the system stays on the productive side of the overload curve. And it connects every prior lesson — the pool is the mechanism that makes C explicit, the timeout is what enforces the deadline that defines goodput, the breaker is shedding applied to a dependency, and observability is what tells you D has crossed C in time to act. Overload handling is not a separate feature bolted on; it is what all seven mechanisms are, seen from the angle of a system being asked for more than it can give.
| Mechanism | Its load-control role | What it prevents |
|---|---|---|
| Pool / concurrency cap | Hard bound on work in flight | Unbounded slowdown from accepting everything |
| Timeout | Free resources from doomed work | Doomed requests stealing viable capacity |
| Circuit breaker | Shed load from a failing dependency | Hammering a downstream that’s already down |
| Bounded retries + backoff | Cap the load amplifier | Retry storm turning overload into collapse |
| Load shedding (edge) | Reject excess cheaply, early | Queue buildup that erases goodput |
| Backpressure | Signal upstream to slow down | Silent absorption until something breaks |
| Graceful shutdown | Shed load from a departing node | Dropped in-flight work on deploy |
A service is past its saturation point: demand exceeds capacity. It currently accepts and queues every request. Why does adding deliberate load shedding (fast 503s at the edge) increase the number of users served well?
Why is 'goodput' a better target than 'throughput' when handling overload?
Order the layers of overload defense from the request edge inward:
- 1 Shed excess at admission: fast 503 + Retry-After, dropping the least-valuable load first
- 2 Bound work in flight with a pool / concurrency cap so accepted load is finite
- 3 Enforce nested timeouts so doomed requests free their resources quickly
- 4 Apply the breaker to failing dependencies and cap retries so they can't amplify
- 01Why is rejecting requests the correct behavior under overload, and what is goodput?
- 02What makes load shedding effective, how does backpressure relate, and how is each mechanism a load-control knob?
Above a certain load no service can serve everyone, and the only real choice is how it fails — so the discipline of overload is to reject some requests on purpose so the rest succeed. The target shifts from throughput (accepted) to goodput (completed usefully within deadline), because past saturation accepted-but-unfinishable work clogs queues and drags the in-capacity requests past their deadlines too: serving everyone serves fewer. Load shedding rejects excess early, cheaply, and by priority; backpressure pushes the limit upstream so it is felt where it can be reduced — two halves of making the limit explicit at the boundary. And the synthesis lands: every mechanism in the track is a load-control knob — the pool bounds work, timeouts free doomed requests, the breaker sheds a failing dependency, capped retries tame the amplifier, shutdown sheds a departing node, observability tells you where you are on the curve — tuned together so the service degrades gracefully instead of collapsing. We have now seen the seven mechanisms cooperate, fail, be observed, and be tuned under pressure. The final lesson collects all of it into a production-readiness review: the checklist that turns the whole track into a launch gate.