Crux Read real PromQL queries, a PSI report, and instrumentation snippets; predict the behaviour and pick the highest-leverage fix a senior engineer makes first.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min
RED and USE problems are diagnosed by reading queries and kernel reports, not by reciting definitions. Read each snippet, predict what it actually computes or exposes, and choose the fix a senior engineer reaches for first.
Goal
Practise the reading loop of every triage: parse the PromQL or the PSI line, spot the silent defect (a dropped label, a high-cardinality dimension, the wrong aggregation), and reach for the correct repair.
A dashboard panel uses this for fleet-wide p99 and shows a flat, suspicious line. What is wrong, and what is the fix?
Heads-up Window size is a separate tuning concern; 5m is normal. The defect is the missing by (le) — without it histogram_quantile has no bucket vector to interpolate across.
Heads-up Aggregating per-bucket counts across replicas is exactly what histograms are for — but only when grouped by le. The query is broken by the missing grouping, not by the idea of aggregation.
Heads-up A collapsed sum without by (le) produces a meaningless result, not a true stable p99. The give-away is the dropped le label, which histogram_quantile requires.
MemAvailable on this host reads 600 MB free. Reading the PSI report, what is happening and what alert grade does it deserve?
Heads-up PSI full at 9% directly contradicts that. MemAvailable reports what is free right now, not the reclaim pressure; the unit's whole point is that PSI full can be high while free-RAM looks healthy.
Heads-up It is the reverse: 'full' (all non-idle tasks stalled) is the more severe signal and is the page-grade one. Memory full sustained above zero warrants an immediate page.
Heads-up avg10 at 9% full is the live signal; the lower avg300 just means the crunch is recent and building. You act on the short-window pressure, not the cooled-off long average.
Snippet 3 — the Errors counter
errs.inc({ method: req.method, route: req.route?.path, error_message: err.message, // e.g. "user alice@example.com not found"});
Quiz
Completed
This RED Errors counter passed review until a buggy release tripled the metrics bill overnight and triggered a security ticket. Name both problems and the single fix.
Heads-up method and route templates are bounded and actionable — they belong on the metric. The unbounded, PII-bearing dimension is error_message; that is the one to remove.
Heads-up Counter increments are essentially free. Hosted backends bill per active series, and the kill is unique error_message values, not the number of increments.
Heads-up Scrape interval does not address series count, and the PII leak remains. High-cardinality, sensitive content belongs in logs/traces, not metric labels.
For a service whose SLO is p99 under 200 ms and whose traffic clusters between 50 and 250 ms, why are these default buckets a problem, and what is the better choice?
Heads-up Eleven le boundaries is a normal, cheap series count — bucket count is not the cost driver here. The problem is bucket placement: no resolution where the SLO actually lives.
Heads-up Accuracy depends entirely on bucket density near the percentile you care about. With one bucket covering 100-250 ms, interpolated p99 is essentially a guess across that whole span.
Heads-up The average masks exactly the tail the SLO is about — a regression can push p99 from 200 to 800 ms while the mean barely moves. Fix the buckets (or go native), do not abandon percentiles.
Recap
Every RED+USE incident is read in queries and reports: histogram_quantile needs sum by (le) or it returns garbage; PSI memory full above zero is a page-grade crunch even when MemAvailable looks healthy; RED labels must be bounded error classes, never PII-bearing free text; and bucket boundaries must bracket the SLO with real resolution (or use native histograms). Read the snippet, find the silent defect, then make the bounded, correctness-preserving fix.