awesome-everything RU
↑ Back to the climb

Observability

RED and USE: two checklists, one triage discipline

Crux RED (Rate, Errors, Duration) describes what the user felt. USE (Utilization, Saturation, Errors) describes which resource caused it. Running both in sequence is the senior engineer''''s first reflex when the pager fires.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at junior altitude — the surface
◷ 10 min

The pager fires. p99 latency on the checkout service spiked from 80 ms to 1.2 s. You have seconds to find the right dashboard before the incident commander is asking for updates. The wrong move is opening every panel. The right move is two short checklists.

What RED is

Tom Wilkie introduced RED at a London Prometheus meetup in 2015. RED describes a service from the caller’s perspective:

  • Rate — requests per second arriving at the service.
  • Errors — failed requests per second (HTTP 5xx, gRPC non-OK, timeouts).
  • Duration — the latency distribution (p50 / p95 / p99) of the requests that did complete.

If RED looks sick, the user is sick. Rate tells you whether traffic is normal. Errors tell you whether requests are failing. Duration tells you whether surviving requests are slow.

What USE is

Brendan Gregg named USE in 2012 as an emergency checklist for performance debugging. USE describes every resource (CPU, memory, disk, network, locks, thread pools) from three angles:

  • Utilization — average percentage of time the resource was busy.
  • Saturation — amount of queued work the resource cannot service yet (run-queue length, wait time).
  • Errors — count of error events on the resource (ECC errors, disk EIO, NIC CRC, ENOSPC).

If USE looks sick on a box, that resource is a candidate cause for the RED symptom.

The layered mental model

MethodWhat it measuresAnswersOrigin
REDServices (request-driven)Is the user unhappy? Which symptom?Tom Wilkie, 2015
USEResources (CPU, memory, disk, …)Which resource caused it?Brendan Gregg, 2012

The reading rhythm during an incident is always RED first, USE second:

  1. Open the RED dashboard for the affected service.
  2. Identify which of R / E / D is abnormal — that names the symptom.
  3. Switch to the USE dashboard for the resources under that service.
  4. Find the resource where utilization or saturation jumped — that names the candidate cause.
  5. Drill into traces, logs, or profiling only after RED and USE have narrowed the search space.

The hospital metaphor

Think of a hospital. RED is the patient’s vital signs — pulse, blood pressure, temperature — measured from the outside. USE is the ICU monitoring of each life-support machine — oxygen flow, pump pressure, error LEDs — measured at the equipment itself. You need both. If the patient’s vitals crash you act fast, but to know why you walk over and check the machines. Doctors who watch only one side miss obvious problems.

A concrete triage

On-call engineer Bea gets paged: p99 latency on the checkout service spiked from 80 ms to 1.2 s. The RED dashboard shows Rate steady at 400 req/s, Errors under 0.1%, Duration p99 15× worse. That is the RED triage — requests still arrive and mostly succeed, but they are slow. Bea switches to USE on the boxes: CPU at 92%, run queue jumped from 0 to 14. The boxes are CPU-saturated; threads queue for cycles. Fix: scale out. Diagnosis took under a minute.

Why this works

USE’s Saturation is the most diagnostic signal of the three. Utilization tells you how busy a resource was on average — a CPU at 100% utilization is fine if no work is waiting (it is just keeping pace). What matters is the run-queue length. A disk at 80% utilization with queue depth 50 is far worse than a disk at 95% utilization with queue depth 1, because the queue depth is the leading indicator of latency: every job in line pays the queueing delay.

Quiz

A monitoring dashboard shows the service Rate, Errors, and Duration. Which methodology is this?

Quiz

USE method says: for every resource, check Utilization, Saturation, and Errors. What does 'saturation' mean in USE?

Complete the analogy

Fill in the blank: RED is the methodology for measuring _______, while USE is the methodology for measuring resources.

Order the steps

Order the steps of a typical RED+USE incident response:

  1. 1 Pager fires — a symptom is reported (latency, errors, capacity)
  2. 2 Open the RED dashboard for the affected service
  3. 3 Identify which of R / E / D is abnormal — that names the symptom
  4. 4 Switch to the USE dashboard for the resources under that service
  5. 5 Find the resource where utilization or saturation jumped — that names the cause
  6. 6 Drill into traces, logs, or profiling only after RED and USE have narrowed scope
  7. 7 Apply the fix (scale up, restart, throttle, rollback) and watch RED return to baseline
Recall before you leave
  1. 01
    In two sentences, why is running RED alone or USE alone usually not enough for incident response?
  2. 02
    What are the three letters of USE and what does each measure?
  3. 03
    Who introduced RED and when? Who introduced USE and when?
Recap

RED and USE are two short checklists written two years apart that together cover both the service the user touches and the resources that service stands on. RED (Rate, Errors, Duration) measures request flow from the caller’s perspective — if RED looks sick, the user is sick. USE (Utilization, Saturation, Errors) measures every physical and logical resource from three angles — if USE looks sick on a box, that is a candidate cause. The senior engineer’s reflex on any incident is RED first (name the symptom), USE second (find the cause), and everything else — logs, traces, profiles — only after those two checklists have narrowed the space. Saturation, not utilization, is the most diagnostic dimension of USE: a queue of waiting work is the leading indicator of user-visible latency even when average utilization is moderate.

Connected lessons
appears again in167
Continue the climb ↑Instrumenting RED in Prometheus: counters, histograms, and cardinality discipline
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.