Observability
SLO and error budgets: instrument a journey end to end
Reading about SLOs is not the same as being woken at 2 AM by an alert you can trust. Take a small multi-service journey, define an SLI that tracks real user pain, generate MWMBR alerts, prove they fire and clear with a fire drill, and write the policy that turns the budget into a decision.
Turn the unit’s mental model into a working SLO stack: a journey-level SLI, a platform-generated MWMBR alert set with correct burn thresholds, a verified fire drill, the composite-ceiling math for the journey, and a signed error budget policy — every step backed by evidence.
Instrument a small multi-service user journey (your own, or a 3-4 service starter such as gateway, order, payment, db) with a journey-level SLO, MWMBR burn-rate alerts generated by a platform, and a signed error budget policy — then prove the alerts fire fast and reset within 5 minutes with a deliberate fire drill.
- A documented SLI spec showing each indicator, its query, the bucket boundary at the latency threshold, and the worst-of join — with a one-line bad-user-outcome each indicator catches.
- The generated recording rules and MWMBR alerts checked into the repo, with the budget rate visibly rebased to the chosen SLO target (not a hard-coded 0.001).
- A fire-drill timeline (timestamps) proving the page fired within minutes and cleared within ~5 minutes of fix — measured from Prometheus/Alertmanager, not estimated.
- The composite-ceiling calculation for the journey and a one-paragraph argument for which layer is the authoritative SLO.
- The signed error budget policy document with all five mandatory sections and the exclusion list.
- Add a second severity tier and demonstrate the 6h+30m page catches a sustained moderate burn (6x) that the 1h+5m page would miss, with a slow-burn fire drill.
- Raise the journey ceiling with one architectural lever — idempotent retries with an idempotency key, or parallel hedging on the worst hop — and show the before/after journey success rate and the latency cost.
- Build an SLO meta-dashboard with the three self-observability signals: NaN/zero-denominator detection, 3d burn-rate stationarity (target ~1x), and budget-negative events vs freeze activations.
- Define the customer SLA looser than the internal SLO by 0.05-0.5pp, justify the buffer size from your mean time to detect-and-fix, and show on the burn history that the internal SLO trips before the SLA would.
This is the loop you run when you bring SLOs to a real service: define the SLI from user-facing bad outcomes (availability, latency at an exact bucket, correctness), set a conservative target on a 28-day window, generate MWMBR alerts from a platform so the budget rate is correctly rebased, prove the alert fires fast and resets within 5 minutes with a fire drill, guard against NaN and low traffic, and sign an error budget policy that turns the burn into an organisational decision. Doing it once on a small journey makes the production rollout — and the quarterly review that keeps the SLO honest — muscle memory.