awesome-everything RU
↑ Back to the climb

Observability

SLO platforms and the 90-day rollout

Crux Hand-rolling MWMBR PromQL per service is error-prone — Sloth, Pyrra, OpenSLO, Nobl9, and Datadog SLOs generate it declaratively. Adoption is the bottleneck: a structured 90-day rollout turns SLO-as-arithmetic into SLO-as-culture.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 13 min

A team re-derives the same 14.4x burn-rate PromQL for the sixth time this month — different service, same math, different typo on line 4. The recording rule was wrong. The alert never fired during the incident. Nobody noticed until the postmortem.

Why hand-rolling is a trap

The canonical MWMBR setup for a single SLO requires:

  • 6 recording rules (ratio_rate per window: 5m, 30m, 1h, 6h, 3d, and a slow-burn variant)
  • 3 alert rules (14.4x page, 6x page, 1x ticket), each using AND between two windows
  • 2 dashboard panels (burn rate over time, budget remaining)

Hand-writing this per service across 80 services means 480 recording rules and 240 alert expressions. Any of the 14.4 * 0.001 multipliers could be wrong (wrong SLO target, off-by-one window). The error is silent — the recording rule computes a value, the alert fires or not, and nobody knows the math was wrong until a real incident doesn’t page.

SLO platforms exist to solve this: you describe the SLI and target in a YAML declaration, and the platform generates the PromQL.

The major platforms

PlatformTypeOutputKey trait
SlothOpen source, CLIPrometheus rules + Grafana dashboardsYAML → recording rules + MWMBR alerts; GitOps-friendly
PyrraOpen source, Kubernetes operatorPrometheusRule CRDsKubernetes-native; reconciles SLO objects into PrometheusRule resources
OpenSLOOpen spec (CNCF sandbox)Cross-tool YAML definitionVendor-neutral spec; Nobl9 founded it; Grafana/Datadog/Honeycomb support it
Nobl9Commercial SaaSMulti-datasource SLOsConnects Datadog, Prometheus, Dynatrace; cross-source SLOs; ships OpenSLO
Datadog SLOsManaged (Datadog-native)Datadog monitors + dashboardsTight APM integration; burn-rate alerts built in; calendar and rolling window both supported

A minimal Sloth declaration looks like:

version: "prometheus/v1"
service: checkout
slos:
  - name: availability
    description: "Checkout requests succeed"
    sli:
      events:
        error_query: sum(rate(http_requests_total{status=~"5..",job="checkout"}[{{.window}}]))
        total_query: sum(rate(http_requests_total{job="checkout"}[{{.window}}]))
    objective: 99.9
    alerting:
      name: CheckoutAvailability
      page_alert:
        labels:
          severity: page
      ticket_alert:
        labels:
          severity: ticket

Sloth reads {{.window}} as a template variable and emits all six recording rules with correct window values. The burn rates (14.4x / 6x / 1x) and the AND logic between windows are injected automatically. You declare the SLI expression once; the platform ensures correctness.

Why this works

OpenSLO is the emerging cross-tool standard — the same YAML drives Sloth, Nobl9, and Grafana SLO configurations. If you standardise on OpenSLO, you can migrate from Prometheus-based alerting to Datadog without rewriting the SLO definitions. Write once, run anywhere is the bet OpenSLO is making. For teams already on the Prometheus stack, Sloth or Pyrra are the pragmatic choice.

The 90-day SLO rollout pattern

SLO adoption fails at the cultural layer more often than the technical one. A common rollout pattern:

Weeks 1–2 — single journey, single team: Pick the most important user journey (checkout, login, the API endpoint customers pay for). Define the SLI at the API gateway level. Set a conservative target (start with 99% even if you think you can do 99.9% — you’ll learn the baseline quickly). Instrument counters, add the Sloth declaration, verify the recording rules appear in Prometheus.

Weeks 3–4 — verify SLO and budget: Watch the SLO for two weeks. Is the baseline error rate what you expected? Is the SLI correlating with real user reports? Adjust the target if baseline is much better (tighten) or worse (too tight SLOs create constant freeze culture, loosen if necessary). Define the error budget policy draft but don’t sign it yet.

Weeks 5–8 — alerting and policy sign-off: Enable MWMBR alerts. Add the SLO dashboard. Run a fire drill: deliberately break the service for 5 minutes and verify the 14.4x alert fires within a few minutes, pages the on-call, and clears within 5 minutes of fix. Sign the error budget policy with director-level approval. Communicate the policy to all stakeholders.

Weeks 9–12 — rollout to 5–10 more services: Replicate the pattern across other services. Each team uses the same Sloth YAML template. The platform team owns the Sloth/Pyrra generator; product teams own their SLO targets. Hold a first quarterly SLO review: which SLOs fired alerts that mattered, which were noise?

After 90 days — culture check: Survey the on-call teams: are they acting on SLO alerts or ignoring them? Review: are feature freezes actually happening when budgets are exhausted? If not, re-run the policy communication. The 90-day mark is also when to look at whether SLOs are covering the right journeys — gaps are common on the first pass.

Why this works

Adoption is the bottleneck, not tooling. A team with Sloth configured but no error budget policy sign-off and no quarterly review ritual is in the same state as a team with a single slow-window alert: the numbers exist, but the culture doesn’t act on them. The 90-day rollout is the difference between “we have SLOs” and “SLOs govern how we make decisions.”

Order the steps

Order the 90-day SLO adoption steps from first to last:

  1. 1 Pick a single high-value user journey; define the SLI at the API gateway level
  2. 2 Set a conservative SLO target; instrument counters; verify recording rules in Prometheus
  3. 3 Observe the baseline for 2 weeks; adjust the target to match real user tolerance
  4. 4 Enable MWMBR alerts; run a fire drill; sign the error budget policy
  5. 5 Roll out to 5–10 more services using the same Sloth/Pyrra template
  6. 6 Hold the first quarterly SLO review: which alerts mattered, which were noise?
Quiz

What is the primary benefit of using Sloth or Pyrra instead of hand-writing PromQL for SLO alerts?

Quiz

A team sets up Sloth, gets MWMBR alerts working, and considers the SLO adoption complete. What critical step is missing?

Recall before you leave
  1. 01
    What does a tool like Sloth generate from a single SLO YAML declaration?
  2. 02
    Why does a 90-day SLO rollout start with a single journey at a conservative target rather than immediately applying 99.9% targets to all services?
Recap

Hand-rolling MWMBR PromQL across many services introduces silent math errors — wrong window lengths, wrong burn multipliers, wrong thresholds. SLO platforms (Sloth, Pyrra, OpenSLO, Nobl9, Datadog) solve this by generating recording rules, alert expressions, and dashboards from a declarative SLI definition. OpenSLO is the emerging cross-tool standard. The technical setup is the easy part: the bottleneck is adoption. A 90-day rollout starts with one user journey at a conservative target, observes the baseline for two weeks, runs an alert fire drill, signs the error budget policy, and only then expands to more services. Without the quarterly review ritual and policy sign-off, SLOs become dashboard numbers that no one acts on.

Connected lessons
Continue the climb ↑Low-traffic SLOs and burn-rate math from first principles
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources4
expand
  1. 01
  2. 02
  3. 03
  4. 04

Trademarks belong to their respective owners. Editorial reference only.