Observability OBS · 05 · 09

SLO and error budgets: PromQL and rule reading

Read real PromQL recording rules, MWMBR alert expressions, a latency-bucket SLI, and a Sloth YAML, then predict the behaviour and pick the highest-leverage fix.

OBS Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

SLO bugs hide in the PromQL, not in the slides. A wrong bucket, a hard-coded budget rate, a NaN denominator — each silently corrupts the budget and the alert. Read the rule and choose the fix a senior engineer would make first.

Goal

Practise reading the artefacts SLOs are actually built from: recording rules, burn-rate alert expressions, latency-bucket SLIs, and the YAML a platform generates them from — then spot the defect.

Snippet 1 — the burn-rate page rule

# 99.9% SLO. Intended: page when 1h AND 5m burn both exceed 14.4x.
alert: SLOAvailabilityBurnFast
expr: |
  (1 - job:slo_availability:ratio_rate1h) > (14.4 * 0.001)
  or
  (1 - job:slo_availability:ratio_rate5m) > (14.4 * 0.001)
labels:
  severity: page

Quiz

The author wanted MWMBR behaviour. What is the bug and what does it cause in production?

Snippet 2 — the latency SLI

# Latency SLO target: 99% of requests under 200ms.
# Histogram buckets defined as le: 0.1, 0.25, 0.5, 1, 2.5
latency_sli =
  sum(rate(http_request_duration_seconds_bucket{le="0.25"}[1h]))
  /
  sum(rate(http_request_duration_seconds_count[1h]))

Quiz

The SLO threshold is 200ms but the nearest bucket is le=0.25. What is wrong, and what is the fix?

Snippet 3 — the gctrace-free NaN

# Recording rule for a low-traffic internal service.
record: job:slo_availability:ratio_rate5m
expr: |
  sum(rate(http_requests_total{status!~"5..",job="reports"}[5m]))
  /
  sum(rate(http_requests_total{job="reports"}[5m]))
# During a quiet window, this series evaluates to NaN.

Quiz

The ratio goes NaN during quiet periods on this low-traffic service. What does that mean for the SLO, and what is the senior remedy?

Snippet 4 — the Sloth declaration

version: "prometheus/v1"
service: checkout
slos:
  - name: availability
    objective: 99.5
    sli:
      events:
        error_query: sum(rate(http_requests_total{status=~"5..",job="checkout"}[{{.window}}]))
        total_query: sum(rate(http_requests_total{job="checkout"}[{{.window}}]))
    alerting:
      page_alert:   { labels: { severity: page } }
      ticket_alert: { labels: { severity: ticket } }

Quiz

A teammate argues this is risky because you must hand-maintain six recording rules and three MWMBR alerts per service. Why is that argument wrong here?

Recap

Every SLO defect is read in the rules: MWMBR needs and between a long and short window (OR is the failed Approach 4); a latency SLI needs a histogram bucket exactly at the SLO threshold or it silently miscounts and biases the budget; a zero denominator on low-traffic services yields NaN that neither fires nor clears, so add synthetic probes plus a no-traffic alert; and a platform like Sloth generates all six recording rules and three MWMBR alerts — with the budget rate correctly rebased to the objective — from one YAML, which is exactly why hand-rolled PromQL is the bug source.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.