Crux Read real Prometheus alert rules, a multi-burn-rate SLO rule, an error-budget calculation, and an Alertmanager route, then pick the senior fix or read the math correctly.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min
Alert rules and error-budget math are where on-call philosophy becomes config. Read each snippet, predict what it will do to the pager, and choose the fix a senior makes first.
Goal
Practise the loop that turns principle into a trustworthy pager: read an alert rule, judge whether it pages on a symptom or a cause, do the burn-rate and budget arithmetic, and spot the runbook step that actually lowers MTTR.
This rule pages with severity: page. What is wrong with it, and what is the highest-leverage fix?
Heads-up A longer window only delays the same non-actionable cause page. Sustained high CPU still may harm no user; the defect is that it pages on a cause at all, not the duration.
Heads-up Faster detection of a non-symptom is not the goal — it would page sooner on something that doesn't map to user harm. The problem is what it alerts on, not how quickly.
Heads-up Saturation is one of the golden signals as a symptom of user-facing degradation, but a raw CPU threshold with no tie to latency/errors or SLO is a classic cause alert. Page on the symptom, not the gauge.
Snippet 2 — a multi-burn-rate SLO rule
# SLO: 99.9% availability over 30 days. Budget = 0.1% of requests may fail.- alert: ErrorBudgetFastBurn expr: | ( job:slo_errors_per_request:ratio_rate1h{job="api"} > (14.4 * 0.001) and job:slo_errors_per_request:ratio_rate5m{job="api"} > (14.4 * 0.001) ) for: 2m labels: severity: page
Quiz
Completed
Why does this rule require BOTH a 1h and a 5m window to exceed 14.4× the budget, and what does the 14.4× factor mean?
Heads-up They are not redundant. The 1h window alone pages slowly and can keep firing after a burn ends; the 5m alone fires on transient spikes. Requiring both is what gives fast, low-false-positive detection.
Heads-up 14.4 is a multiplier on the budget burn rate, not an error percentage. The threshold is 14.4 × 0.001 = 1.44% of requests failing, a rate that would drain a 30-day budget in roughly 2 days.
Heads-up for: 2m only requires the combined condition to hold for 2 minutes; it does not provide the long-window confirmation. Removing the windows reintroduces flap-driven false pages.
Snippet 3 — error-budget arithmetic
SLO = 99.95% successful requests over 30 daysTraffic = 2,000 requests/second, steadyBudget = (1 - 0.9995) = 0.05% of requests may failIncident = a deploy bug returns 5xx on 2% of requests for 30 minutesQuestion = how much of the 30-day error budget did this one incident burn?
Quiz
Completed
Roughly what fraction of the monthly error budget did this 30-minute, 2%-error incident consume?
Heads-up That conflates instantaneous burn multiplier with budget fraction. The right calculation is failed requests over allowed failures: 72,000 / ~2.59M ≈ 2.8%, not 56%. The 40× multiplier only tells you the burn would be alarming if sustained for the whole month — it didn't run anywhere near that long.
Heads-up Budget burn depends on failed requests versus allowed failures, not on wall-clock duration as a percentage. The 30 minutes only sets how many requests were affected; do the request math.
Heads-up It absolutely can — 72,000 failures against a ~2.59M allowance is ≈2.8% of the whole month's budget from one short incident. A handful of such incidents exhausts the SLO.
Snippet 4 — a runbook step
## Runbook: api ErrorBudgetFastBurn1. Ack the page; open the SLO dashboard (latency, errors, traffic, saturation).2. Check Deploys panel: did a release land in the last 30 min? If yes, ROLL BACK first.3. (If no recent deploy) check upstream-dependency error panel and DB saturation.4. Mitigate to stop the burn; only then root-cause.5. If error rate not falling within 15 min, escalate to secondary.
Quiz
Completed
Step 2 says roll back a recent deploy before root-causing. Why is that ordering correct for an on-call responder mid-incident?
Heads-up Rollback is a mitigation, not a diagnosis — the deploy may or may not be the true root cause. The point is to stop user harm first; root-cause analysis happens after service is restored, in step 4 and the postmortem.
Heads-up It's not an absolute rule. The runbook orders mitigate-before-diagnose for this specific, high-probability scenario (recent deploy + budget burn). The escalation timer in step 5 exists precisely for when the obvious mitigation doesn't work.
Heads-up During an active SLO burn, root-causing first prolongs user harm and spends more budget. Stopping the bleed with a known mitigation, then diagnosing, is the standard incident-response ordering and what lowers MTTR.
Recap
On-call reads in config and arithmetic: a raw CPU threshold tagged severity: page is a cause alert that breeds fatigue; a multi-window, multi-burn-rate rule fires on a real fast burn yet ignores flaps; error-budget math turns a short incident into a concrete percentage of the month’s allowance, so you size urgency to real harm; and a good runbook encodes mitigate-before-diagnose with a timed escalation so the median responder recovers fast. Judge alerts by actionability, do the budget math, and let the runbook carry the 3 a.m. brain.