awesome-everything RU
↑ Back to the climb

Observability

SLO and error budgets: instrument a journey end to end

Crux Hands-on project — define a journey SLI, generate MWMBR alerts with a platform, prove them with a fire drill, and write a signed error budget policy for a multi-service journey.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 240 min

Reading about SLOs is not the same as being woken at 2 AM by an alert you can trust. Take a small multi-service journey, define an SLI that tracks real user pain, generate MWMBR alerts, prove they fire and clear with a fire drill, and write the policy that turns the budget into a decision.

Goal

Turn the unit’s mental model into a working SLO stack: a journey-level SLI, a platform-generated MWMBR alert set with correct burn thresholds, a verified fire drill, the composite-ceiling math for the journey, and a signed error budget policy — every step backed by evidence.

Project
0 of 7
Objective

Instrument a small multi-service user journey (your own, or a 3-4 service starter such as gateway, order, payment, db) with a journey-level SLO, MWMBR burn-rate alerts generated by a platform, and a signed error budget policy — then prove the alerts fire fast and reset within 5 minutes with a deliberate fire drill.

Requirements
Acceptance criteria
  • A documented SLI spec showing each indicator, its query, the bucket boundary at the latency threshold, and the worst-of join — with a one-line bad-user-outcome each indicator catches.
  • The generated recording rules and MWMBR alerts checked into the repo, with the budget rate visibly rebased to the chosen SLO target (not a hard-coded 0.001).
  • A fire-drill timeline (timestamps) proving the page fired within minutes and cleared within ~5 minutes of fix — measured from Prometheus/Alertmanager, not estimated.
  • The composite-ceiling calculation for the journey and a one-paragraph argument for which layer is the authoritative SLO.
  • The signed error budget policy document with all five mandatory sections and the exclusion list.
Senior stretch
  • Add a second severity tier and demonstrate the 6h+30m page catches a sustained moderate burn (6x) that the 1h+5m page would miss, with a slow-burn fire drill.
  • Raise the journey ceiling with one architectural lever — idempotent retries with an idempotency key, or parallel hedging on the worst hop — and show the before/after journey success rate and the latency cost.
  • Build an SLO meta-dashboard with the three self-observability signals: NaN/zero-denominator detection, 3d burn-rate stationarity (target ~1x), and budget-negative events vs freeze activations.
  • Define the customer SLA looser than the internal SLO by 0.05-0.5pp, justify the buffer size from your mean time to detect-and-fix, and show on the burn history that the internal SLO trips before the SLA would.
Recap

This is the loop you run when you bring SLOs to a real service: define the SLI from user-facing bad outcomes (availability, latency at an exact bucket, correctness), set a conservative target on a 28-day window, generate MWMBR alerts from a platform so the budget rate is correctly rebased, prove the alert fires fast and resets within 5 minutes with a fire drill, guard against NaN and low traffic, and sign an error budget policy that turns the burn into an organisational decision. Doing it once on a small journey makes the production rollout — and the quarterly review that keeps the SLO honest — muscle memory.

Continue the climb ↑What is trace propagation and why broken propagation is worse than none
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.