Observability
RED and USE: multiple-choice review
Six questions that cut across the whole unit. None is a definition to recite — each mirrors a decision you make mid-incident, when the pager is loud and you have to pick the right dashboard, the right signal, and the right alert grade.
Confirm you can fuse the two checklists into one triage discipline: RED names the symptom from the caller’s side, USE finds the cause from the kernel’s side, and the supporting decisions — saturation, histogram aggregation, cardinality, alert severity — all serve that reading rhythm.
The pager fires on checkout p99 (80 ms to 1.2 s). RED shows Rate steady, Errors under 0.1%, Duration p99 15x worse. What is the correct next move, and why?
A disk sits at 80% utilization with I/O queue depth 50; another disk sits at 95% utilization with queue depth 1. Which is the worse signal, and which USE dimension tells you?
A node's memory utilization is moderate and MemAvailable still shows 500 MB free, yet a worker stalls for minutes. Which signal would have caught it, and why do free-RAM dashboards miss it?
A team computes fleet-wide p99 by averaging each replica's pre-computed p99 (Prometheus summaries). Why is this wrong, and what is the correct method?
To jump from a p99 histogram spike to the exact slow request, an engineer wants to add trace_id as a metric label. What breaks, and what is the designed alternative?
Designing alerts for a service, how should RED and USE signals map to severity, and why is the split the strongest lever against alert fatigue?
The through-line of the unit is one decision tree: RED names the symptom from the caller’s side (Rate, Errors, Duration), USE finds the cause from the kernel’s side (Utilization, Saturation, Errors), and saturation — queue depth, PSI — is the most diagnostic dimension because waiting work, not average busy-time, is what users feel. Around that core sit the supporting disciplines: Duration must be a histogram aggregated with sum by (le) (never an averaged percentile); cardinality is bounded by labelling only route/method/status_class and bridging to traces via exemplars; and alerts page on RED, warn on USE. Every production failure in the unit reduces to one missed signal in that tree.