awesome-everything RU
↑ Back to the climb

Observability

RED and USE: build the dashboard and triage an incident

Crux Hands-on project — instrument a service with bounded RED, wire USE/PSI for its hosts, build the layered dashboard, then drive an incident and triage it RED-first, USE-second with evidence.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 240 min

Reading about RED+USE triage is not the same as running it under load. Instrument a real service the way a senior engineer would, build the layered dashboard, deliberately break it, and prove you can name the symptom from RED and the cause from USE — with screenshots and numbers at every step.

Goal

Turn the unit’s mental model into a working observability stack and a triage you can defend: bounded RED metrics, USE plus PSI for the hosts, a single layered dashboard, an alert split that pages on symptoms, and a documented RED-first / USE-second incident walkthrough.

Project
0 of 7
Objective

Instrument a small HTTP service with correct RED, wire USE and PSI for its hosts, assemble one layered RED-over-USE dashboard, then induce an incident and triage it RED-first / USE-second — proving each conclusion with a panel, a query, or a number.

Requirements
Acceptance criteria
  • A screenshot of the layered dashboard with all three rows populated under load, p99 computed via histogram_quantile with sum by (le).
  • A short cardinality audit: list every label on the RED metrics, its bounded value set, and the resulting series count — and confirm no unbounded or PII-bearing label is present.
  • Two written triage walkthroughs, one per induced incident, each naming which RED signal moved, which USE/PSI signal explained it, and the order you read them — backed by the captured panels.
  • Evidence the alert split works: the RED alert paged on the user-facing incident and the USE/PSI signal stayed on the warning channel (or fired there), not the page channel.
Senior stretch
  • Add an on-call runbook: the RED-first / USE-second reading rhythm, the four USE resources with their saturation metric, the alert-severity table, and a checklist for confirming a false positive.
  • Add exemplars to the duration histogram and demonstrate clicking a p99 spike to jump straight to the slow request's trace.
  • Add a service-mesh sidecar (Envoy/Linkerd) and compare its auto-RED against your application-emitted RED — find one case the sidecar cannot see (e.g. a 200 that returned wrong data).
  • Add an async or queue-based path and instrument its Saturation signal (consumer lag / queue-depth-in-seconds), showing a backlog the per-job Duration alone would miss.
Recap

This is the loop you run on every real incident: instrument RED with bounded labels, wire USE plus PSI for the resources underneath, put them on one layered dashboard sharing a time axis, page on the symptom and warn on the cause, then triage RED-first to name what the user felt and USE-second to find why. Doing it once on a service you broke on purpose — with screenshots, a cardinality audit, and two written walkthroughs — turns the discipline into reflex for the production version.

Continue the climb ↑SLI, SLO, and the error budget: reliability by the numbers
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.