Observability OBS · 04 · 07

RED and USE: multiple-choice review

Multiple-choice synthesis across the RED+USE unit — symptom vs cause triage, saturation, histogram aggregation, cardinality, and alert-severity discipline.

OBS Senior ◷ 13 min

Level

FoundationsJuniorMiddleSenior

Six questions that cut across the whole unit. None is a definition to recite — each mirrors a decision you make mid-incident, when the pager is loud and you have to pick the right dashboard, the right signal, and the right alert grade.

Goal

Confirm you can fuse the two checklists into one triage discipline: RED names the symptom from the caller’s side, USE finds the cause from the kernel’s side, and the supporting decisions — saturation, histogram aggregation, cardinality, alert severity — all serve that reading rhythm.

Quiz

The pager fires on checkout p99 (80 ms to 1.2 s). RED shows Rate steady, Errors under 0.1%, Duration p99 15x worse. What is the correct next move, and why?

Quiz

A disk sits at 80% utilization with I/O queue depth 50; another disk sits at 95% utilization with queue depth 1. Which is the worse signal, and which USE dimension tells you?

Quiz

A node's memory utilization is moderate and MemAvailable still shows 500 MB free, yet a worker stalls for minutes. Which signal would have caught it, and why do free-RAM dashboards miss it?

Quiz

A team computes fleet-wide p99 by averaging each replica's pre-computed p99 (Prometheus summaries). Why is this wrong, and what is the correct method?

Quiz

To jump from a p99 histogram spike to the exact slow request, an engineer wants to add trace_id as a metric label. What breaks, and what is the designed alternative?

Quiz

Designing alerts for a service, how should RED and USE signals map to severity, and why is the split the strongest lever against alert fatigue?

Recap

The through-line of the unit is one decision tree: RED names the symptom from the caller’s side (Rate, Errors, Duration), USE finds the cause from the kernel’s side (Utilization, Saturation, Errors), and saturation — queue depth, PSI — is the most diagnostic dimension because waiting work, not average busy-time, is what users feel. Around that core sit the supporting disciplines: Duration must be a histogram aggregated with sum by (le) (never an averaged percentile); cardinality is bounded by labelling only route/method/status_class and bridging to traces via exemplars; and alerts page on RED, warn on USE. Every production failure in the unit reduces to one missed signal in that tree.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.