awesome-everything RU
↑ Back to the climb

Observability

RED and USE: multiple-choice review

Crux Multiple-choice synthesis across the RED+USE unit — symptom vs cause triage, saturation, histogram aggregation, cardinality, and alert-severity discipline.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 13 min

Six questions that cut across the whole unit. None is a definition to recite — each mirrors a decision you make mid-incident, when the pager is loud and you have to pick the right dashboard, the right signal, and the right alert grade.

Goal

Confirm you can fuse the two checklists into one triage discipline: RED names the symptom from the caller’s side, USE finds the cause from the kernel’s side, and the supporting decisions — saturation, histogram aggregation, cardinality, alert severity — all serve that reading rhythm.

Quiz

The pager fires on checkout p99 (80 ms to 1.2 s). RED shows Rate steady, Errors under 0.1%, Duration p99 15x worse. What is the correct next move, and why?

Quiz

A disk sits at 80% utilization with I/O queue depth 50; another disk sits at 95% utilization with queue depth 1. Which is the worse signal, and which USE dimension tells you?

Quiz

A node's memory utilization is moderate and MemAvailable still shows 500 MB free, yet a worker stalls for minutes. Which signal would have caught it, and why do free-RAM dashboards miss it?

Quiz

A team computes fleet-wide p99 by averaging each replica's pre-computed p99 (Prometheus summaries). Why is this wrong, and what is the correct method?

Quiz

To jump from a p99 histogram spike to the exact slow request, an engineer wants to add trace_id as a metric label. What breaks, and what is the designed alternative?

Quiz

Designing alerts for a service, how should RED and USE signals map to severity, and why is the split the strongest lever against alert fatigue?

Recap

The through-line of the unit is one decision tree: RED names the symptom from the caller’s side (Rate, Errors, Duration), USE finds the cause from the kernel’s side (Utilization, Saturation, Errors), and saturation — queue depth, PSI — is the most diagnostic dimension because waiting work, not average busy-time, is what users feel. Around that core sit the supporting disciplines: Duration must be a histogram aggregated with sum by (le) (never an averaged percentile); cardinality is bounded by labelling only route/method/status_class and bridging to traces via exemplars; and alerts page on RED, warn on USE. Every production failure in the unit reduces to one missed signal in that tree.

Continue the climb ↑RED and USE: free-recall review
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.