Observability OBS · 05 · 10

SLO and error budgets: instrument a journey end to end

Hands-on project — define a journey SLI, generate MWMBR alerts with a platform, prove them with a fire drill, and write a signed error budget policy for a multi-service journey.

OBS Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading about SLOs is not the same as being woken at 2 AM by an alert you can trust. Take a small multi-service journey, define an SLI that tracks real user pain, generate MWMBR alerts, prove they fire and clear with a fire drill, and write the policy that turns the budget into a decision.

Goal

Turn the unit’s mental model into a working SLO stack: a journey-level SLI, a platform-generated MWMBR alert set with correct burn thresholds, a verified fire drill, the composite-ceiling math for the journey, and a signed error budget policy — every step backed by evidence.

Project

0 of 7

Objective

Instrument a small multi-service user journey (your own, or a 3-4 service starter such as gateway, order, payment, db) with a journey-level SLO, MWMBR burn-rate alerts generated by a platform, and a signed error budget policy — then prove the alerts fire fast and reset within 5 minutes with a deliberate fire drill.

Requirements

Acceptance criteria

A documented SLI spec showing each indicator, its query, the bucket boundary at the latency threshold, and the worst-of join — with a one-line bad-user-outcome each indicator catches.
The generated recording rules and MWMBR alerts checked into the repo, with the budget rate visibly rebased to the chosen SLO target (not a hard-coded 0.001).
A fire-drill timeline (timestamps) proving the page fired within minutes and cleared within ~5 minutes of fix — measured from Prometheus/Alertmanager, not estimated.
The composite-ceiling calculation for the journey and a one-paragraph argument for which layer is the authoritative SLO.
The signed error budget policy document with all five mandatory sections and the exclusion list.

Senior stretch

Add a second severity tier and demonstrate the 6h+30m page catches a sustained moderate burn (6x) that the 1h+5m page would miss, with a slow-burn fire drill.
Raise the journey ceiling with one architectural lever — idempotent retries with an idempotency key, or parallel hedging on the worst hop — and show the before/after journey success rate and the latency cost.
Build an SLO meta-dashboard with the three self-observability signals: NaN/zero-denominator detection, 3d burn-rate stationarity (target ~1x), and budget-negative events vs freeze activations.
Define the customer SLA looser than the internal SLO by 0.05-0.5pp, justify the buffer size from your mean time to detect-and-fix, and show on the burn history that the internal SLO trips before the SLA would.

Recap

This is the loop you run when you bring SLOs to a real service: define the SLI from user-facing bad outcomes (availability, latency at an exact bucket, correctness), set a conservative target on a 28-day window, generate MWMBR alerts from a platform so the budget rate is correctly rebased, prove the alert fires fast and resets within 5 minutes with a fire drill, guard against NaN and low traffic, and sign an error budget policy that turns the burn into an organisational decision. Doing it once on a small journey makes the production rollout — and the quarterly review that keeps the SLO honest — muscle memory.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.