Backend Architecture BE · 08 · 05

The service under overload: load shedding and graceful degradation

Every mechanism is also a load-control knob, and overload is where they combine: pool and concurrency caps bound work in flight, timeouts and breakers cut losses, and load shedding rejects excess early with backpressure — so the service degrades gracefully instead of collapsing.

BE Senior ◷ 18 min

Level

FoundationsJuniorMiddleSenior

There is a load level above which your service cannot serve every request, no matter how it is written — and the only choice you have is how it fails when it gets there. The naive instinct is to try to serve everyone: accept every connection, queue every request, retry every failure. That instinct is exactly what produces the collapse from two lessons ago, because accepting work you cannot finish does not help the user whose request you accepted and steals the resources of the users whose requests you could have finished. The counterintuitive truth of overload is that rejecting some requests is how you serve the rest — a service that sheds 20% of load at the door and serves the other 80% fast is vastly healthier than one that accepts 100% and serves all of it too slowly to be useful, then falls over and serves 0%. This is the shift from thinking about throughput (requests handled) to goodput (requests handled usefully, within their deadline). And here is the synthesis: every mechanism you learned is secretly a load-control knob. The pool bounds concurrent work. Timeouts free resources from doomed requests. The breaker sheds load from a failing dependency. Shutdown sheds load from a departing instance. Overload is the condition that makes all of them act at once, and operating a backend means tuning them so the service bends instead of breaking.

Throughput is the wrong target; goodput is the right one

The metric that matters under overload is not how many requests you accept but how many you complete usefully — fast enough that the answer still matters to the caller. Call that goodput. A request that you accept, hold for 8 seconds, and finally answer after the client has already timed out and retried is pure waste: it consumed a connection, a loop slot, and CPU, and produced negative value, because it also delayed real work. Beyond the saturation point, raising accepted load lowers goodput — the classic overload curve where throughput climbs, peaks, then falls off a cliff as the system spends itself on work that no longer matters. The entire discipline of overload handling is keeping the system at the top of that curve instead of sliding down the far side, and that means refusing work you cannot complete in time.

Below capacity, goodput tracks accepted load; the saturation point at 1×C is the peak. Past it, accepting more lowers goodput — the discipline of overload is staying at the top, not sliding down the cliff.

Load shedding: reject early, reject cheaply

Load shedding is the deliberate, early rejection of excess requests so the accepted ones succeed. Three properties make it work:

Early. Shed at the edge, before the request consumes a pooled connection or a downstream call. A request rejected at admission costs almost nothing; a request rejected after it has acquired resources has already done the damage. This is why an admission check or concurrency limiter sits in front of the expensive layers.
Cheap and honest. The rejection is a fast 503 Service Unavailable with a Retry-After, not a slow error. It tells the client don’t hammer me — which is the opposite of the silent slowness that triggers retry storms.
Prioritized. Not all load is equal. Shed the least valuable first — background refreshes before user-facing reads, anonymous before paid, retries before first attempts. A good shedder drops the right 20%, preserving goodput where it counts.

Together these three properties mean that load shedding is not a fallback for when things go wrong — it is a first-class design decision you make before traffic arrives. Without the “early” property, the damage is already done by the time you shed; without “prioritized,” you may be protecting background jobs while dropping paying customers.

Backpressure: pushing the limit upstream

The dual of shedding is backpressure: instead of accepting work and dropping it, you signal upstream to slow down. A bounded queue that refuses new entries when full, a pool that makes acquisition fail fast instead of queueing unboundedly, an HTTP layer that stops reading the socket — each propagates “I am full” back toward the source, so the pressure is felt where it can actually be reduced (the client backs off, the upstream service routes elsewhere) rather than absorbed silently until something breaks. Shedding and backpressure are two halves of the same idea: make the limit explicit and enforce it at the boundary, rather than letting an implicit limit (memory, file descriptors, the pool) enforce itself by collapsing.

Every mechanism is a load-control knob

The synthesis the whole unit has been building toward: the seven mechanisms are not just correctness tools, they are the actuators of load control, and overload is when they combine:

Pool / concurrency limit — the hard cap on work in flight. The single most important overload defense: it converts “unbounded slowdown” into “bounded work plus explicit rejection.”
Timeouts — free resources from requests that will not finish in time, so doomed work stops stealing capacity from viable work.
Circuit breaker — sheds load from a failing dependency, which is overload localized to one downstream.
Retries (bounded, with backoff and jitter) — the load amplifier you must cap, because uncapped retries turn overload into collapse.
Graceful shutdown — sheds load from a departing instance without dropping its in-flight work onto the floor.
Observability — the feedback that tells you where you are on the goodput curve, so shedding can be triggered by real saturation, not guesses.

Tuning these together — pool sized to the slowest dependency, timeouts nested in a budget, retries capped, a shedder at the edge keyed off saturation metrics — is what makes a service degrade gracefully: serve less, but serve it well, and never fall to zero.

▸Why this works

Why is rejecting requests — deliberately failing some users — the correct behavior under overload, when it feels like the one thing a service exists to avoid? Because the alternative is not “serve everyone,” it is “serve no one,” and the arithmetic is unforgiving. A server has a finite capacity C of useful work per second. When demand D exceeds C, you cannot complete D; you can only choose what happens to the D − C requests you have no capacity for. If you accept them anyway, they do not vanish — they sit in queues and hold resources while they wait, which means even the C requests you could have served now wait behind them, so their latency rises past the point of usefulness too. Accepting work beyond capacity does not add served requests; it subtracts them, because the excess degrades the requests that were within capacity. This is the cruel inversion at the heart of overload: past the saturation point, trying harder to serve everyone serves fewer people, because the bottleneck resource gets spent on coordination, queueing, and timed-out work instead of completion. Shedding the D − C excess at the door — cheaply, before it acquires anything — is what protects the C: it keeps the queue short, so accepted requests stay fast, so they finish within their deadline and become goodput instead of waste. The reason it feels wrong is that a single rejected request looks like a failure, while the cost it would have imposed on others is diffuse and invisible — you see the 503 you returned but not the fifty timeouts you prevented. Senior judgment is precisely the ability to value the invisible many over the visible one: to shed the least-valuable load early and on purpose, so that the system stays on the productive side of the overload curve. And it connects every prior lesson — the pool is the mechanism that makes C explicit, the timeout is what enforces the deadline that defines goodput, the breaker is shedding applied to a dependency, and observability is what tells you D has crossed C in time to act. Overload handling is not a separate feature bolted on; it is what all seven mechanisms are, seen from the angle of a system being asked for more than it can give.

Mechanism	Its load-control role	What it prevents
Pool / concurrency cap	Hard bound on work in flight	Unbounded slowdown from accepting everything
Timeout	Free resources from doomed work	Doomed requests stealing viable capacity
Circuit breaker	Shed load from a failing dependency	Hammering a downstream that’s already down
Bounded retries + backoff	Cap the load amplifier	Retry storm turning overload into collapse
Load shedding (edge)	Reject excess cheaply, early	Queue buildup that erases goodput
Backpressure	Signal upstream to slow down	Silent absorption until something breaks
Graceful shutdown	Shed load from a departing node	Dropped in-flight work on deploy

Quiz

A service is past its saturation point: demand exceeds capacity. It currently accepts and queues every request. Why does adding deliberate load shedding (fast 503s at the edge) increase the number of users served well?

Quiz

Why is 'goodput' a better target than 'throughput' when handling overload?

Order the steps

Order the layers of overload defense from the request edge inward:

1 Shed excess at admission: fast 503 + Retry-After, dropping the least-valuable load first
2 Bound work in flight with a pool / concurrency cap so accepted load is finite
3 Enforce nested timeouts so doomed requests free their resources quickly
4 Apply the breaker to failing dependencies and cap retries so they can't amplify

Past saturation, accepting all load lowers goodput. Shedding the excess at admission keeps accepted requests fast and completable.

key takeaway

Above some load level a service cannot serve every request, and the only choice is how it fails — so the counterintuitive truth of overload is that rejecting some requests is how you serve the rest. The right target is not throughput (requests accepted) but goodput (requests completed usefully, within their deadline): past the saturation point, raising accepted load lowers goodput, because accepted-but-unfinishable work holds connections, loop slots, and CPU while it waits, delaying even the requests that were within capacity until they too miss their deadline — so trying to serve everyone serves fewer. Load shedding is the deliberate, early rejection of excess so the accepted requests succeed: shed at the edge before the request acquires resources (a request rejected at admission costs almost nothing), reject cheaply and honestly with a fast 503 + Retry-After (the opposite of silent slowness that triggers retry storms), and prioritize — drop the least-valuable load first (background before user-facing, anonymous before paid, retries before first attempts). Its dual is backpressure: signal upstream to slow down (bounded queues that refuse when full, fail-fast acquisition, stop reading the socket) so pressure is felt where it can be reduced rather than absorbed silently until collapse. The synthesis: every mechanism is a load-control knob, and overload is when they combine — the pool/concurrency cap is the hard bound on work in flight (the single most important defense), timeouts free resources from doomed work, the breaker sheds load from a failing dependency, bounded retries with backoff cap the amplifier, graceful shutdown sheds load from a departing instance, and observability tells you where you are on the goodput curve so shedding triggers on real saturation. Tuned together — pool sized to the slowest dependency, timeouts nested in a budget, retries capped, a shedder at the edge keyed off saturation — they make the service degrade gracefully: serve less, serve it well, never fall to zero.

Recall before you leave

01
Why is rejecting requests the correct behavior under overload, and what is goodput?
02
What makes load shedding effective, how does backpressure relate, and how is each mechanism a load-control knob?

Recap

Above a certain load no service can serve everyone, and the only real choice is how it fails — so the discipline of overload is to reject some requests on purpose so the rest succeed. The target shifts from throughput (accepted) to goodput (completed usefully within deadline), because past saturation accepted-but-unfinishable work clogs queues and drags the in-capacity requests past their deadlines too: serving everyone serves fewer. Load shedding rejects excess early, cheaply, and by priority; backpressure pushes the limit upstream so it is felt where it can be reduced — two halves of making the limit explicit at the boundary. And the synthesis lands: every mechanism in the track is a load-control knob — the pool bounds work, timeouts free doomed requests, the breaker sheds a failing dependency, capped retries tame the amplifier, shutdown sheds a departing node, observability tells you where you are on the curve — tuned together so the service degrades gracefully instead of collapsing. We have now seen the seven mechanisms cooperate, fail, be observed, and be tuned under pressure. The final lesson collects all of it into a production-readiness review: the checklist that turns the whole track into a launch gate. Now when you see a service that accepts everything and queues everything under load — you can name what will happen next: the queue grows, accepted-but-unfinishable work delays in-capacity work, goodput collapses, and the retry storm follows. The fix is not more servers; it is a shedder at the edge and a bounded concurrency cap behind it.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Seeing the system: RED metrics, the p99 tail, and breaker statesenior

unlocks

Production readiness: the launch checklist that is the whole tracksenior

deepens into

Production readiness: the launch checklist that is the whole tracksenior

appears again in1

Real-world winssenior

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.