Observability OBS · 04 · 10

RED and USE: build the dashboard and triage an incident

Hands-on project — instrument a service with bounded RED, wire USE/PSI for its hosts, build the layered dashboard, then drive an incident and triage it RED-first, USE-second with evidence.

OBS Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading about RED+USE triage is not the same as running it under load. Instrument a real service the way a senior engineer would, build the layered dashboard, deliberately break it, and prove you can name the symptom from RED and the cause from USE — with screenshots and numbers at every step.

Goal

Turn the unit’s mental model into a working observability stack and a triage you can defend: bounded RED metrics, USE plus PSI for the hosts, a single layered dashboard, an alert split that pages on symptoms, and a documented RED-first / USE-second incident walkthrough.

Project

0 of 7

Objective

Instrument a small HTTP service with correct RED, wire USE and PSI for its hosts, assemble one layered RED-over-USE dashboard, then induce an incident and triage it RED-first / USE-second — proving each conclusion with a panel, a query, or a number.

Requirements

Acceptance criteria

A screenshot of the layered dashboard with all three rows populated under load, p99 computed via histogram_quantile with sum by (le).
A short cardinality audit: list every label on the RED metrics, its bounded value set, and the resulting series count — and confirm no unbounded or PII-bearing label is present.
Two written triage walkthroughs, one per induced incident, each naming which RED signal moved, which USE/PSI signal explained it, and the order you read them — backed by the captured panels.
Evidence the alert split works: the RED alert paged on the user-facing incident and the USE/PSI signal stayed on the warning channel (or fired there), not the page channel.

Senior stretch

Add an on-call runbook: the RED-first / USE-second reading rhythm, the four USE resources with their saturation metric, the alert-severity table, and a checklist for confirming a false positive.
Add exemplars to the duration histogram and demonstrate clicking a p99 spike to jump straight to the slow request's trace.
Add a service-mesh sidecar (Envoy/Linkerd) and compare its auto-RED against your application-emitted RED — find one case the sidecar cannot see (e.g. a 200 that returned wrong data).
Add an async or queue-based path and instrument its Saturation signal (consumer lag / queue-depth-in-seconds), showing a backlog the per-job Duration alone would miss.

Recap

This is the loop you run on every real incident: instrument RED with bounded labels, wire USE plus PSI for the resources underneath, put them on one layered dashboard sharing a time axis, page on the symptom and warn on the cause, then triage RED-first to name what the user felt and USE-second to find why. Doing it once on a service you broke on purpose — with screenshots, a cardinality audit, and two written walkthroughs — turns the discipline into reflex for the production version.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.