Engineering Practice ENG · 07 · 09

On-call: alert rules and budget math

Read real Prometheus alert rules, a multi-burn-rate SLO rule, an error-budget calculation, and an Alertmanager route, then pick the senior fix or read the math correctly.

ENG Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

Alert rules and error-budget math are where on-call philosophy becomes config. Read each snippet, predict what it will do to the pager, and choose the fix a senior makes first.

Goal

Practise the loop that turns principle into a trustworthy pager: read an alert rule, judge whether it pages on a symptom or a cause, do the burn-rate and budget arithmetic, and spot the runbook step that actually lowers MTTR.

Snippet 1 — a cause-based alert rule

groups:
- name: node
  rules:
  - alert: HighCPU
    expr: instance:node_cpu_utilisation:rate5m > 0.80
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "CPU above 80% on {{ $labels.instance }}"

Quiz

This rule pages with severity: page. What is wrong with it, and what is the highest-leverage fix?

Snippet 2 — a multi-burn-rate SLO rule

# SLO: 99.9% availability over 30 days. Budget = 0.1% of requests may fail.
- alert: ErrorBudgetFastBurn
  expr: |
    (
      job:slo_errors_per_request:ratio_rate1h{job="api"} > (14.4 * 0.001)
    and
      job:slo_errors_per_request:ratio_rate5m{job="api"} > (14.4 * 0.001)
    )
  for: 2m
  labels:
    severity: page

Quiz

Why does this rule require BOTH a 1h and a 5m window to exceed 14.4× the budget, and what does the 14.4× factor mean?

Snippet 3 — error-budget arithmetic

SLO            = 99.95% successful requests over 30 days
Traffic        = 2,000 requests/second, steady
Budget         = (1 - 0.9995) = 0.05% of requests may fail
Incident       = a deploy bug returns 5xx on 2% of requests for 30 minutes
Question       = how much of the 30-day error budget did this one incident burn?

Quiz

Roughly what fraction of the monthly error budget did this 30-minute, 2%-error incident consume?

Snippet 4 — a runbook step

## Runbook: api ErrorBudgetFastBurn
1. Ack the page; open the SLO dashboard (latency, errors, traffic, saturation).
2. Check Deploys panel: did a release land in the last 30 min? If yes, ROLL BACK first.
3. (If no recent deploy) check upstream-dependency error panel and DB saturation.
4. Mitigate to stop the burn; only then root-cause.
5. If error rate not falling within 15 min, escalate to secondary.

Quiz

Step 2 says roll back a recent deploy before root-causing. Why is that ordering correct for an on-call responder mid-incident?

Recap

On-call reads in config and arithmetic: a raw CPU threshold tagged severity: page is a cause alert that breeds fatigue; a multi-window, multi-burn-rate rule fires on a real fast burn yet ignores flaps; error-budget math turns a short incident into a concrete percentage of the month’s allowance, so you size urgency to real harm; and a good runbook encodes mitigate-before-diagnose with a timed escalation so the median responder recovers fast. Judge alerts by actionability, do the budget math, and let the runbook carry the 3 a.m. brain.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.