Backend Architecture
Seeing the system: RED metrics, the p99 tail, and breaker state
The last lesson described a cascade you could narrate after the fact — but the engineer living through it at 3am sees none of that story. They see a dashboard. And if the dashboard shows only “average latency: 180ms, error rate: 0.2%,” they are blind, because the cascade hides in exactly the places an average erases. The mean latency stays calm while the p99 quietly triples, because the slow tail is a small fraction of requests drowned in the average. The pool is one acquisition away from empty, but “connections: 47” without the limit of 50 next to it tells you nothing. The breaker flapped open and closed four times in the last minute — the single most important signal that a downstream is failing — and it appears nowhere, because nobody emitted it as a metric. Every mechanism you built in this track has an internal state that is the early warning, and a backend that does not surface that state is a backend you are operating with your eyes closed. The previous lessons made the system; this one makes it visible — because you cannot operate, debug, or escape a failure you cannot see, and the composed failures from the last lesson are invisible precisely until you instrument the seams.
RED: the three numbers every service owes you
Start with the request-level view. The RED method says every service should emit three things, and together they answer “is this service healthy?” before you know anything about its internals:
- Rate — requests per second. The load the service is carrying.
- Errors — the rate (or fraction) of requests that fail. Not just 5xx; anything the caller experiences as failure, including breaker rejections.
- Duration — the distribution of response times, not the average. This is where the tail lives.
RED is deliberately request-centric: it describes what a caller feels, which is the right outermost view. Underneath it, you add resource-level signals — saturation of the pool, the loop, the queue — that explain why the RED numbers move.
The mean is a liar; watch percentiles
The single most important habit this lesson teaches: never trust the average latency. A mean blends fast and slow requests into one number that describes none of them. If 99 requests take 20ms and one takes 4 seconds, the mean is ~60ms — a number that looks fine and is experienced by nobody. The metric that matters is the percentile: p50 (median, the typical request), p99 (the slow tail, one request in a hundred), p999 (the rare disaster). The tail is not noise — it is the signal, because the cascade from the last lesson begins as a rising p99 long before the mean moves at all. A pool starting to saturate, a downstream starting to slow, a GC pause — all show up in the tail first. Watch p99 and you see the cascade forming; watch the mean and you see it only after collapse.
Each mechanism has a state worth emitting
The track’s mechanisms are not just code — each has an internal state that is a leading indicator, and the job is to surface it:
- Pool — emit in-use vs. limit and acquisition wait time / queue depth. Saturation (in-use / limit approaching 1.0) is the earliest sign of the cascade. “47 connections” is meaningless; “47 of 50, 200ms acquire wait” is an alarm.
- Event loop — emit loop lag. Rising lag means something is blocking the loop, starving every concurrent request.
- Breaker — emit state and transitions (closed → open → half-open). A flapping breaker is the clearest possible statement that a dependency is unhealthy; it must be a first-class metric, not a log line.
- Retries — emit retry rate separately from request rate. A retry rate climbing toward the request rate is a storm forming.
- Shutdown — emit drain duration and forced-kill count. Drains creeping toward the grace period predict dropped work on the next deploy.
The error budget: turning telemetry into a decision
Observability is not just for debugging; it drives a shipping decision. An SLO (service level objective) sets the target — say, 99.9% of requests succeed under 300ms. The gap between that target and 100% is the error budget: the failures you are allowed to spend. When the budget is healthy, you ship features fast and take risk. When telemetry shows the budget is nearly spent, you stop shipping features and spend the engineering on reliability instead. This converts the whole RED/percentile/saturation picture from passive dashboards into an active governor on the team’s behavior — the bridge from “we can see the system” to “the system’s health controls what we do next.”
Why this works
Why insist on percentiles and reject the average so absolutely — surely the mean latency is some useful summary of how the service is doing? Because the average answers a question nobody is asking and hides the one that matters. No user experiences “the mean”; each user experiences their own request, and the distribution of those individual experiences is the entire point. The mean collapses that distribution into a single number that is mathematically dominated by the bulk and structurally blind to the tail — and the tail is where every interesting failure lives. Consider the arithmetic: at scale, p99 is not a rare curiosity. A user who makes 100 requests to load one page hits their personal p99 on almost every page load — the “one in a hundred” slow request is a near-certainty across a session, so p99 is closer to “the experience of an active user” than the median is. Worse, the mean actively conceals the cascade: when a pool starts saturating, a handful of requests get slow while the rest stay fast, so the tail lifts while the mean barely twitches — by the time a rising mean forces your attention, the tail has already been catastrophic for minutes. There is a deeper structural reason too: latency distributions in real systems are not Gaussian, they are heavy-tailed and often multimodal (a fast path and a slow path, e.g. cache hit vs. miss, pool-available vs. pool-wait), and for such distributions the mean is not even a meaningful central tendency — it is an artifact sitting between two humps, describing neither. This is why every mechanism in the track must emit its state as a distribution or a discrete event, never as an average: an averaged breaker state is meaningless, an averaged queue depth hides the spikes that cause drops, an averaged drain time hides the deploy that nearly missed the deadline. The principle generalizes into a rule of operational maturity: monitor the experience at the tail, because the tail is both where users feel pain and where the system tells you — earliest and most clearly — that it is about to compose a failure.
| Signal | What it measures | Why it’s the early warning |
|---|---|---|
| Rate | Requests / second | The load the rest of the picture explains |
| Errors | Failed-request fraction | Includes breaker rejections, not just 5xx |
| Duration (p99) | Slow-tail latency | Cascade shows here before the mean moves |
| Pool saturation | in-use / limit, acquire wait | Earliest sign of pool-driven cascade |
| Loop lag | Event-loop delay | Something is blocking the loop |
| Breaker state | closed / open / half-open | Clearest signal a dependency is failing |
| Error budget | SLO target vs. actual | Turns telemetry into a ship/stop decision |
A dashboard shows mean latency steady at 180ms and error rate at 0.2%, yet users are complaining the app is slow. What is the most likely blind spot?
Why is an error budget more than just a dashboard — what decision does it drive?
Order how you'd instrument a service from outermost signal to shipping decision:
- 1 Emit RED at the request edge: rate, errors, and the duration distribution
- 2 Watch percentile latency (p99/p999), never the mean, to catch the slow tail
- 3 Add resource signals — pool saturation, loop lag, breaker state — to explain why RED moves
- 4 Define an SLO and track the error budget to gate feature shipping against reliability
- 01What is the RED method, and why must you watch percentiles instead of the mean latency?
- 02Which internal state should each mechanism emit, and how does the error budget turn telemetry into a decision?
The cascade of the last lesson is invisible at 3am if your dashboard shows only averages, so this lesson makes the system visible. RED — rate, errors, duration — is the outermost, caller-centric view, and the iron rule is to watch the duration distribution, never the mean: the cascade surfaces as a rising p99 long before the average twitches, because latency is heavy-tailed and the mean describes a request nobody makes. Every mechanism in the track owns an internal state that is a leading indicator and must be emitted as a metric: pool saturation and acquire wait, event-loop lag, breaker transitions, retry rate, drain duration. And observability is not passive — an SLO and its error budget turn the whole picture into a governor: ship features while the budget is healthy, switch to reliability work when it is nearly spent. Now you can both reason about composed failures and see them forming. The next lesson uses that visibility under the hardest condition — deliberate overload — where every mechanism doubles as a load-control knob and the question is whether the service degrades gracefully or collapses.