Observability OBS · 05 · 05

SLO platforms and the 90-day rollout

Hand-rolling MWMBR PromQL per service is error-prone — Sloth, Pyrra, OpenSLO, Nobl9, and Datadog SLOs generate it declaratively. Adoption is the bottleneck: a structured 90-day rollout turns SLO-as-arithmetic into SLO-as-culture.

OBS Middle ◷ 13 min

Level

FoundationsJuniorMiddleSenior

A team re-derives the same 14.4x burn-rate PromQL for the sixth time this month — different service, same math, different typo on line 4. The recording rule was wrong. The alert never fired during the incident. Nobody noticed until the postmortem.

Why hand-rolling is a trap

Imagine multiplying one possible typo across 80 services, each typo silent until an incident exposes it. That is the hand-rolling problem. The canonical MWMBR setup for a single SLO requires:

6 recording rules (ratio_rate per window: 5m, 30m, 1h, 6h, 3d, and a slow-burn variant)
3 alert rules (14.4x page, 6x page, 1x ticket), each using AND between two windows
2 dashboard panels (burn rate over time, budget remaining)

Hand-writing this per service across 80 services means 480 recording rules and 240 alert expressions. Any of the 14.4 * 0.001 multipliers could be wrong (wrong SLO target, off-by-one window). The error is silent — the recording rule computes a value, the alert fires or not, and nobody knows the math was wrong until a real incident doesn’t page. Together these 11 artifacts per SLO mean: one declaration wrong propagates into missed pages across the entire service.

Eleven hand-written artifacts per SLO — six recording rules, three alerts, two panels. Multiply by 80 services and any one of 880 expressions can carry a silent typo that only surfaces when a real incident fails to page.

SLO platforms exist to solve this: you describe the SLI and target in a YAML declaration, and the platform generates the PromQL.

The major platforms

Platform	Type	Output	Key trait
Sloth	Open source, CLI	Prometheus rules + Grafana dashboards	YAML → recording rules + MWMBR alerts; GitOps-friendly
Pyrra	Open source, Kubernetes operator	PrometheusRule CRDs	Kubernetes-native; reconciles SLO objects into PrometheusRule resources
OpenSLO	Open spec (CNCF sandbox)	Cross-tool YAML definition	Vendor-neutral spec; Nobl9 founded it; Grafana/Datadog/Honeycomb support it
Nobl9	Commercial SaaS	Multi-datasource SLOs	Connects Datadog, Prometheus, Dynatrace; cross-source SLOs; ships OpenSLO
Datadog SLOs	Managed (Datadog-native)	Datadog monitors + dashboards	Tight APM integration; burn-rate alerts built in; calendar and rolling window both supported

A minimal Sloth declaration looks like:

version: "prometheus/v1"
service: checkout
slos:
  - name: availability
    description: "Checkout requests succeed"
    sli:
      events:
        error_query: sum(rate(http_requests_total{status=~"5..",job="checkout"}[{{.window}}]))
        total_query: sum(rate(http_requests_total{job="checkout"}[{{.window}}]))
    objective: 99.9
    alerting:
      name: CheckoutAvailability
      page_alert:
        labels:
          severity: page
      ticket_alert:
        labels:
          severity: ticket

Sloth reads {{.window}} as a template variable and emits all six recording rules with correct window values. The burn rates (14.4x / 6x / 1x) and the AND logic between windows are injected automatically. You declare the SLI expression once; the platform ensures correctness.

▸Why this works

OpenSLO is the emerging cross-tool standard — the same YAML drives Sloth, Nobl9, and Grafana SLO configurations. If you standardise on OpenSLO, you can migrate from Prometheus-based alerting to Datadog without rewriting the SLO definitions. Write once, run anywhere is the bet OpenSLO is making. For teams already on the Prometheus stack, Sloth or Pyrra are the pragmatic choice.

One SLO spec in git fans out through the CI generator into recording rules, burn-rate alerts, and dashboards — then ships uniformly across every service. No per-service hand-written PromQL.

The 90-day SLO rollout pattern

SLO adoption fails at the cultural layer more often than the technical one. A common rollout pattern:

Weeks 1–2 — single journey, single team: Pick the most important user journey (checkout, login, the API endpoint customers pay for). Define the SLI at the API gateway level. Set a conservative target (start with 99% even if you think you can do 99.9% — you’ll learn the baseline quickly). Instrument counters, add the Sloth declaration, verify the recording rules appear in Prometheus.

Weeks 3–4 — verify SLO and budget: Watch the SLO for two weeks. Is the baseline error rate what you expected? Is the SLI correlating with real user reports? Adjust the target if baseline is much better (tighten) or worse (too tight SLOs create constant freeze culture, loosen if necessary). Define the error budget policy draft but don’t sign it yet.

Weeks 5–8 — alerting and policy sign-off: Enable MWMBR alerts. Add the SLO dashboard. Run a fire drill: deliberately break the service for 5 minutes and verify the 14.4x alert fires within a few minutes, pages the on-call, and clears within 5 minutes of fix. Sign the error budget policy with director-level approval. Communicate the policy to all stakeholders.

Weeks 9–12 — rollout to 5–10 more services: Replicate the pattern across other services. Each team uses the same Sloth YAML template. The platform team owns the Sloth/Pyrra generator; product teams own their SLO targets. Hold a first quarterly SLO review: which SLOs fired alerts that mattered, which were noise?

After 90 days — culture check: Survey the on-call teams: are they acting on SLO alerts or ignoring them? Review: are feature freezes actually happening when budgets are exhausted? If not, re-run the policy communication. The 90-day mark is also when to look at whether SLOs are covering the right journeys — gaps are common on the first pass.

▸Why this works

Adoption is the bottleneck, not tooling. A team with Sloth configured but no error budget policy sign-off and no quarterly review ritual is in the same state as a team with a single slow-window alert: the numbers exist, but the culture doesn’t act on them. The 90-day rollout is the difference between “we have SLOs” and “SLOs govern how we make decisions.”

Order the steps

Order the 90-day SLO adoption steps from first to last:

1 Pick a single high-value user journey; define the SLI at the API gateway level
2 Set a conservative SLO target; instrument counters; verify recording rules in Prometheus
3 Observe the baseline for 2 weeks; adjust the target to match real user tolerance
4 Enable MWMBR alerts; run a fire drill; sign the error budget policy
5 Roll out to 5–10 more services using the same Sloth/Pyrra template
6 Hold the first quarterly SLO review: which alerts mattered, which were noise?

Quiz

What is the primary benefit of using Sloth or Pyrra instead of hand-writing PromQL for SLO alerts?

Quiz

A team sets up Sloth, gets MWMBR alerts working, and considers the SLO adoption complete. What critical step is missing?

Recall before you leave

01
What does a tool like Sloth generate from a single SLO YAML declaration?
02
Why does a 90-day SLO rollout start with a single journey at a conservative target rather than immediately applying 99.9% targets to all services?

Recap

Hand-rolling MWMBR PromQL across many services introduces silent math errors — wrong window lengths, wrong burn multipliers, wrong thresholds. SLO platforms (Sloth, Pyrra, OpenSLO, Nobl9, Datadog) solve this by generating recording rules, alert expressions, and dashboards from a declarative SLI definition. OpenSLO is the emerging cross-tool standard. The technical setup is the easy part: the bottleneck is adoption. A 90-day rollout starts with one user journey at a conservative target, observes the baseline for two weeks, runs an alert fire drill, signs the error budget policy, and only then expands to more services. Without the quarterly review ritual and policy sign-off, SLOs become dashboard numbers that no one acts on. Now when you see a team proud of their Sloth setup but shipping through every freeze — you know exactly which step they skipped.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Error budget policy, latency SLOs, and composite journeysmiddle

unlocks

Low-traffic SLOs and burn-rate math from first principlessenior

deepens into

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.