Observability OBS · 04 · 09

RED and USE: PromQL and signal reading

Read real PromQL queries, a PSI report, and instrumentation snippets; predict the behaviour and pick the highest-leverage fix a senior engineer makes first.

OBS Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

RED and USE problems are diagnosed by reading queries and kernel reports, not by reciting definitions. Read each snippet, predict what it actually computes or exposes, and choose the fix a senior engineer reaches for first.

Goal

Practise the reading loop of every triage: parse the PromQL or the PSI line, spot the silent defect (a dropped label, a high-cardinality dimension, the wrong aggregation), and reach for the correct repair.

Snippet 1 — the p99 query

histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])))

Quiz

A dashboard panel uses this for fleet-wide p99 and shows a flat, suspicious line. What is wrong, and what is the fix?

Snippet 2 — the PSI report

/proc/pressure/memory
some avg10=18.40 avg60=12.10 avg300=4.02 total=9534120
full avg10=9.10  avg60=6.55  avg300=2.01 total=4120553

Quiz

MemAvailable on this host reads 600 MB free. Reading the PSI report, what is happening and what alert grade does it deserve?

Snippet 3 — the Errors counter

errs.inc({
  method: req.method,
  route: req.route?.path,
  error_message: err.message,   // e.g. "user alice@example.com not found"
});

Quiz

This RED Errors counter passed review until a buggy release tripled the metrics bill overnight and triggered a security ticket. Name both problems and the single fix.

Snippet 4 — the histogram buckets

Histogram(
    "http_request_duration_seconds",
    "Request duration",
    buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
)
# checkout SLO target: p99 under 200 ms

Quiz

For a service whose SLO is p99 under 200 ms and whose traffic clusters between 50 and 250 ms, why are these default buckets a problem, and what is the better choice?

Recap

Every RED+USE incident is read in queries and reports: histogram_quantile needs sum by (le) or it returns garbage; PSI memory full above zero is a page-grade crunch even when MemAvailable looks healthy; RED labels must be bounded error classes, never PII-bearing free text; and bucket boundaries must bracket the SLO with real resolution (or use native histograms). Read the snippet, find the silent defect, then make the bounded, correctness-preserving fix.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.