Observability
SLO and error budgets: free-recall review
Retrieval beats re-reading. For each prompt, say or write a full answer from memory before you open the model answer — the effort of recall is what makes the SLO arithmetic stick when you are deriving it live in a postmortem.
Reconstruct the unit’s spine — what makes a good SLI, the budget and burn-rate arithmetic, why MWMBR uses AND, the composite ceiling, and the organisational policy — without looking back at the lessons.
- 01Define SLI, SLO, and error budget, and state the budget and burn-rate formulas with one worked number.
- 02Derive the 14.4x / 6x / 1x burn-rate thresholds from first principles.
- 03Why does MWMBR alerting use AND between a long and a short window, and what fails with single-window or OR?
- 04Explain the composite-SLO ceiling and two architectural levers that raise it.
- 05What must an error budget policy contain, why do the signatures matter, and how does it differ from an SLA?
- 06Why do low-traffic services break naive SLO arithmetic, and what are the four standard remedies?
If you could reconstruct each answer from memory, you hold the unit’s spine: an SLI is a user-facing good/total ratio, the budget is (1 − SLO) × events and burn rate normalises the spend, every MWMBR threshold derives from (budget_fraction × period) / window with AND binding a long and short window, journeys multiply to a 0.999^N ceiling that retries and hedging can raise, the error budget policy needs signatures and exclusions to have teeth, the SLO must sit tighter than the SLA, and low-traffic surfaces need probes, aggregation, longer windows, or count-based alerts plus a NaN guard.