Observability OBS · 04 · 01

RED and USE: two checklists, one triage discipline

RED (Rate, Errors, Duration) describes what the user felt. USE (Utilization, Saturation, Errors) describes which resource caused it. Running both in sequence is the senior engineer''''s first reflex when the pager fires.

OBS Junior ◷ 10 min

Level

FoundationsJuniorMiddleSenior

Already know this unit? Take a 1-minute quick check →

The pager fires. p99 latency on the checkout service spiked from 80 ms to 1.2 s. You have seconds to find the right dashboard before the incident commander is asking for updates. The wrong move is opening every panel. The right move is two short checklists.

What RED is

Tom Wilkie introduced RED at a London Prometheus meetup in 2015. RED describes a service from the caller’s perspective:

Rate — requests per second arriving at the service.
Errors — failed requests per second (HTTP 5xx, gRPC non-OK, timeouts).
Duration — the latency distribution (p50 / p95 / p99) of the requests that did complete.

If RED looks sick, the user is sick. Rate tells you whether traffic is normal. Errors tell you whether requests are failing. Duration tells you whether surviving requests are slow.

What USE is

Brendan Gregg named USE in 2012 as an emergency checklist for performance debugging. USE describes every resource (CPU, memory, disk, network, locks, thread pools) from three angles:

Utilization — average percentage of time the resource was busy.
Saturation — amount of queued work the resource cannot service yet (run-queue length, wait time).
Errors — count of error events on the resource (ECC errors, disk EIO, NIC CRC, ENOSPC).

If USE looks sick on a box, that resource is a candidate cause for the RED symptom.

The layered mental model

Method	What it measures	Answers	Origin
RED	Services (request-driven)	Is the user unhappy? Which symptom?	Tom Wilkie, 2015
USE	Resources (CPU, memory, disk, …)	Which resource caused it?	Brendan Gregg, 2012

The reading rhythm during an incident is always RED first, USE second:

Open the RED dashboard for the affected service.
Identify which of R / E / D is abnormal — that names the symptom.
Switch to the USE dashboard for the resources under that service.
Find the resource where utilization or saturation jumped — that names the candidate cause.
Drill into traces, logs, or profiling only after RED and USE have narrowed the search space.

Together, these five steps turn a chaotic pager fire into a directed investigation: without step 2 you are just guessing which panel to open, and without step 4 you know the symptom but not the cause.

RED watches the service the user touches; USE watches the resources under it. Same shape — three metrics each — different lens: RED names the symptom, USE finds the cause.

The hospital metaphor

Think of a hospital. RED is the patient’s vital signs — pulse, blood pressure, temperature — measured from the outside. USE is the ICU monitoring of each life-support machine — oxygen flow, pump pressure, error LEDs — measured at the equipment itself. You need both. If the patient’s vitals crash you act fast, but to know why you walk over and check the machines. Doctors who watch only one side miss obvious problems.

A concrete triage

On-call engineer Bea gets paged: p99 latency on the checkout service spiked from 80 ms to 1.2 s. The RED dashboard shows Rate steady at 400 req/s, Errors under 0.1%, Duration p99 15× worse. That is the RED triage — requests still arrive and mostly succeed, but they are slow. Bea switches to USE on the boxes: CPU at 92%, run queue jumped from 0 to 14. The boxes are CPU-saturated; threads queue for cycles. Fix: scale out. Diagnosis took under a minute.

▸Why this works

USE’s Saturation is the most diagnostic signal of the three. Utilization tells you how busy a resource was on average — a CPU at 100% utilization is fine if no work is waiting (it is just keeping pace). What matters is the run-queue length. A disk at 80% utilization with queue depth 50 is far worse than a disk at 95% utilization with queue depth 1, because the queue depth is the leading indicator of latency: every job in line pays the queueing delay.

Counterintuitively, the 80%-utilized disk is the slow one: with 50 jobs queued it is saturated, while the 95%-utilized disk with a queue of 1 keeps pace. Read saturation before utilization.

Quiz

A monitoring dashboard shows the service Rate, Errors, and Duration. Which methodology is this?

Quiz

USE method says: for every resource, check Utilization, Saturation, and Errors. What does 'saturation' mean in USE?

Complete the analogy

Fill in the blank: RED is the methodology for measuring _______, while USE is the methodology for measuring resources.

Order the steps

Order the steps of a typical RED+USE incident response:

1 Pager fires — a symptom is reported (latency, errors, capacity)
2 Open the RED dashboard for the affected service
3 Identify which of R / E / D is abnormal — that names the symptom
4 Switch to the USE dashboard for the resources under that service
5 Find the resource where utilization or saturation jumped — that names the cause
6 Drill into traces, logs, or profiling only after RED and USE have narrowed scope
7 Apply the fix (scale up, restart, throttle, rollback) and watch RED return to baseline

Recall before you leave

01
In two sentences, why is running RED alone or USE alone usually not enough for incident response?
02
What are the three letters of USE and what does each measure?
03
Who introduced RED and when? Who introduced USE and when?

Recap

RED and USE are two short checklists written two years apart that together cover both the service the user touches and the resources that service stands on. RED (Rate, Errors, Duration) measures request flow from the caller’s perspective — if RED looks sick, the user is sick. USE (Utilization, Saturation, Errors) measures every physical and logical resource from three angles — if USE looks sick on a box, that is a candidate cause. The senior engineer’s reflex on any incident is RED first (name the symptom), USE second (find the cause), and everything else — logs, traces, profiles — only after those two checklists have narrowed the space. Saturation, not utilization, is the most diagnostic dimension of USE: a queue of waiting work is the leading indicator of user-visible latency even when average utilization is moderate. Now when you see a pager fire, your first two moves are predetermined — open RED, then USE — and the guesswork starts only after both checklists have spoken.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

What is OpenTelemetry: API, SDK, Collector, OTLPjunior

unlocks

deepens into

appears again in170

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Mini OAuth 2.0 + PKCE loginImplement the authorization-code + PKCE flow end to end against a real provider, so you understand every redirect and token instead of trusting a library.Distributed rate limiterBuild a token-bucket limiter that holds across many app instances by keeping the counter in Redis, not in process memory.