Crux Read real PromQL recording rules, MWMBR alert expressions, a latency-bucket SLI, and a Sloth YAML, then predict the behaviour and pick the highest-leverage fix.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min
SLO bugs hide in the PromQL, not in the slides. A wrong bucket, a hard-coded budget rate, a NaN denominator — each silently corrupts the budget and the alert. Read the rule and choose the fix a senior engineer would make first.
Goal
Practise reading the artefacts SLOs are actually built from: recording rules, burn-rate alert expressions, latency-bucket SLIs, and the YAML a platform generates them from — then spot the defect.
Snippet 1 — the burn-rate page rule
# 99.9% SLO. Intended: page when 1h AND 5m burn both exceed 14.4x.alert: SLOAvailabilityBurnFastexpr: | (1 - job:slo_availability:ratio_rate1h) > (14.4 * 0.001) or (1 - job:slo_availability:ratio_rate5m) > (14.4 * 0.001)labels: severity: page
Quiz
Completed
The author wanted MWMBR behaviour. What is the bug and what does it cause in production?
Heads-up 0.001 is exactly the budget rate for a 99.9% SLO (1 − 0.999). The defect is the OR operator: MWMBR requires AND so that the long window confirms severity while the short window confirms it is still ongoing.
Heads-up In canonical MWMBR the short window uses the same burn threshold as its paired long window — it is the 'still happening now?' gate, not a different severity. The real bug is OR vs AND.
Heads-up MWMBR is defined by AND between a long and a short window. OR is the failed Approach 4 from the SRE workbook precisely because it combines the noise of the short window with the slow reset of the long window.
Snippet 2 — the latency SLI
# Latency SLO target: 99% of requests under 200ms.# Histogram buckets defined as le: 0.1, 0.25, 0.5, 1, 2.5latency_sli = sum(rate(http_request_duration_seconds_bucket{le="0.25"}[1h])) / sum(rate(http_request_duration_seconds_count[1h]))
Quiz
Completed
The SLO threshold is 200ms but the nearest bucket is le=0.25. What is wrong, and what is the fix?
Heads-up A latency SLO is a counter ratio (fast / total), not a percentile estimate. histogram_quantile interpolates within a bucket and introduces error that corrupts the budget. The fix is a bucket boundary exactly at the threshold, then count directly.
Heads-up The range affects smoothing, not correctness. The actual defect is the missing bucket at 0.2s, forcing the SLI to use the 0.25 boundary and misclassify 200-250ms requests as 'fast'.
Heads-up 'Close enough' silently biases the budget: every request between 200ms and 250ms counts as good when it should be a budget burn. The SLO threshold and a histogram bucket boundary must coincide exactly.
Snippet 3 — the gctrace-free NaN
# Recording rule for a low-traffic internal service.record: job:slo_availability:ratio_rate5mexpr: | sum(rate(http_requests_total{status!~"5..",job="reports"}[5m])) / sum(rate(http_requests_total{job="reports"}[5m]))# During a quiet window, this series evaluates to NaN.
Quiz
Completed
The ratio goes NaN during quiet periods on this low-traffic service. What does that mean for the SLO, and what is the senior remedy?
Heads-up NaN is not 100% — it is 0/0 from an absent denominator. The alert evaluates against NaN and stays in an indeterminate state, which is worse than a false negative because nobody notices.
Heads-up The target is unrelated to a zero denominator. The structural fix for low-traffic NaN is synthetic probes (stabilise the denominator), aggregation across related services, or a longer evaluation window — plus a separate no-traffic alert.
Heads-up increase() over a window of no traffic still yields zero, so the ratio is still 0/0. The root cause is no traffic in the denominator, not the choice of rate vs increase.
A teammate argues this is risky because you must hand-maintain six recording rules and three MWMBR alerts per service. Why is that argument wrong here?
Heads-up Rebasing is exactly what the generator does: it derives the per-objective budget rate (here 0.005) and emits correct alert expressions. Hand-rolling is what gets the 0.005 wrong; Sloth does not.
Heads-up Hand-written PromQL across many services is the documented source of silent SLO bugs (wrong window, wrong budget rate). Declarative generation gives auditability via the YAML plus consistent, correct output.
Heads-up Sloth emits Prometheus recording rules AND MWMBR alert rules (plus optional Grafana dashboards) from the declaration. The whole alerting stack is generated, not just dashboards.
Recap
Every SLO defect is read in the rules: MWMBR needs and between a long and short window (OR is the failed Approach 4); a latency SLI needs a histogram bucket exactly at the SLO threshold or it silently miscounts and biases the budget; a zero denominator on low-traffic services yields NaN that neither fires nor clears, so add synthetic probes plus a no-traffic alert; and a platform like Sloth generates all six recording rules and three MWMBR alerts — with the budget rate correctly rebased to the objective — from one YAML, which is exactly why hand-rolled PromQL is the bug source.