Observability OBS · 05 · 07

SLO and error budgets: multiple-choice review

Multiple-choice synthesis across the SLO unit — budget arithmetic, burn-rate derivation, MWMBR AND logic, composite ceilings, and the organisational failure modes.

OBS Senior ◷ 13 min

Level

FoundationsJuniorMiddleSenior

Six questions that cut across the whole unit. Each one is a decision you make in a real reliability review — not a definition to recite, but a number to compute and a tradeoff to defend.

Goal

Confirm you can connect SLI selection, budget arithmetic, burn-rate derivation, MWMBR alerting, composite-journey math, and the organisational layer — the synthesis the eight lessons built toward.

Quiz

A service runs a 99.9% availability SLO over a 28-day window at 1M requests/day. Fourteen days in it has served 14M requests with 21,000 failures. How healthy is the budget?

Quiz

An on-call team copies a 1h+5m page rule from a 99.9% SLO service — expr (1 − ratio_rate1h) > (14.4 × 0.001) — onto a service whose SLO is 99.5%, leaving the expression unchanged. What breaks?

Quiz

An MWMBR page rule fires when the 1h burn AND the 5m burn both exceed 14.4x. During an incident the 1h burn is 18x but the 5m burn has dropped to 9x. Should it page, and what does each window tell you?

Quiz

A checkout journey traverses API gateway, auth, inventory, payment, and database, each holding an independent 99.9% SLO, all green. The user-facing checkout success rate sits at ~99.5%. What is the correct reading and fix?

Quiz

A checkout availability SLI counts any 2xx as 'good'. It is steadily green, yet customers report duplicate charges and 30-second 'successful' checkouts. Why does the SLO miss this, and what is the senior fix?

Quiz

Six months after rolling SLOs to 80 services with correctly generated MWMBR alerts and a signed error budget policy, half the teams ignore the pages and budgets go negative without freezes. What is the most likely root cause?

Recap

The unit’s through-line is one chain of arithmetic made organisational: SLI = good/total (and it must track user pain, not 2xx), SLO is the target, error budget = (1 − SLO) × events, and burn rate normalises the spend. Alerting derives from burn = (budget_fraction × period) / window, fired with a long-AND-short window so it is both noise-resistant and fast to reset; remember to rebase the budget rate when the SLO target changes. Journeys multiply (0.999^5 ≈ 99.5%), iceberg SLIs hide correctness behind 2xx, SLO must sit tighter than SLA, and the framework only delivers value when a signed, enforced policy turns the number into a decision.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.