awesome-everything RU
↑ Back to the climb

Observability

Production SLO failures, self-observability, security, and the big picture

Crux Stripe, GitHub, Coinbase, and Netflix incidents reveal SLO failure modes. SLO self-observability — NaN denominators, burn rate drift, policy gaps — is the meta-layer. Security intersects SLIs. The framework survives tool migrations because it is a contract, not a config.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 18 min

A platform team has MWMBR alerts, Sloth-generated recording rules, and a signed error budget policy. Half the teams are ignoring SLO alerts. The problem isn’t the tooling — it’s that the SLIs don’t correlate with user pain, and the policy was never actually enforced.

Real production failures

Four case studies that reveal the failure modes:

Stripe 2022 — the policy worked: A checkout SLO at 99.99% was internally violated (engineering noticed via burn rate) but the external SLA (99.9%) was met. The team’s pre-defined error budget policy auto-froze new feature deploys for 3 days while the reliability team investigated — preventing a second incident on the still-degraded code path. The policy was triggered, enforced, and the freeze held without executive override. The lesson: when the policy is signed and the team believes it applies, it works exactly as designed. The SLO caught what manual monitoring would have missed.

GitHub 2023 — the SLI was wrong: GitHub’s SLO platform miscounted background-job failures as user-facing events for a quarter, eating the reliability budget and triggering a culture-of-blame conversation. Teams were penalized for “incidents” that users never experienced. Postmortem reset the SLI definition to journey-level only (user-facing GitHub Actions runs, not internal queue processing). The lesson: the SLI definition is the most important decision — getting it wrong poisons an entire quarter’s worth of data and can destroy trust in the SLO program before it’s established.

Coinbase 2024 — the budget policy halted a risky expansion: A multi-region deploy violated the 99.99% trading API SLO for 8 minutes due to a misconfigured load balancer. The error budget policy kicked in within 24 hours, and the team paused new region launches for a week. The pause let the reliability team audit the multi-region deploy tooling and find two additional misconfiguration patterns before they caused incidents. The lesson: the freeze isn’t punishment — it’s a forcing function that directs engineering effort to the fragility that just exposed itself.

Netflix 2024 — the SLO was relaxed on purpose: Netflix’s internal SLO for video playback was loosened from 99.99% to 99.95% after a six-month review showed users couldn’t perceive the difference at 99.95% but the engineering cost to maintain the extra nine was significant. The lesson: SLOs are living targets that evolve with the system and with user research. “Tighter is always better” is false. The quarterly review exists to run this exact experiment.

Common pattern across all four: SLOs drive engineering decisions, not the other way around. The companies that get value from SLOs are the ones where the budget number changes what teams do — freeze, investigate, relax, tighten — not the ones where the SLO is a dashboard metric that no one acts on.

Observability for SLOs themselves

The meta-question: how do you know the SLO platform is working?

Signal 1 — ratio_total must never go to NaN: If the SLI denominator is zero (no traffic, low-traffic edge case, counter reset), the recording rule produces NaN. A NaN burn rate is invisible: the alert neither fires nor clears correctly. Monitor sum(rate(http_request_total[5m])) == 0 and alert on it separately — “we have no traffic signal” is itself an alert condition.

Signal 2 — long-window burn rate should be stationary: Plot the 3d burn rate over 90 days. It should oscillate around 1x on average (hitting the SLO exactly; some weeks above, some below). A persistent 1.5x average means the SLO target is too tight for the current system — constant stress, constant freeze risk. A persistent 0.3x means the target is too loose — over-engineering for reliability no user needs. Stationary around 1x means the target is calibrated.

Signal 3 — policy outcomes must match burn history: If the 3-month burn rate history shows three periods where the budget went negative but no freezes were triggered, the policy is being overridden. Either the policy doesn’t have real authority (needs director-level re-sign) or the teams don’t know it applies to them (communication gap). The SLO meta-dashboard should track: number of active SLOs, number currently burning above 1x, average budget remaining, time since last freeze per SLO.

Meta-signalWhat it revealsAction
ratio_total == 0 / NaNNo traffic; SLI denominator brokenAlert on NaN; add synthetic probes
3d burn avg > 1.5x sustainedSLO target too tight for systemQuarterly review: relax or fix
3d burn avg < 0.3x sustainedSLO target too looseQuarterly review: tighten
Budget ≤ 0 with no freeze triggeredPolicy not enforcedRe-sign policy; investigate override
Senior-tier SLO production numbers
Budget at 99.9% SLO, 1M req/day, 28 days
28,000 errors
Burn rate 14.4x error rate at 99.9% SLO
1.44% request failure rate
Composite ceiling: 5 services at 99.9%
~99.5%
Typical SLA vs SLO buffer
0.05–0.5 percentage points
Single-incident postmortem trigger
≥ 20% of 28-day budget burned
Netflix SLO relaxation: from 99.99% → 99.95%
Users could not perceive difference; engineering cost dropped

Security and SLOs: two intersections

Intersection 1 — bot traffic skews the SLI: A successful but malicious request counts as “good” in the availability SLI — a credential-stuffing attack that returns 200 OK passes the SLO. Bot traffic inflates the denominator and can mask real user issues: 1,000 bot requests per second can dilute a 1% error rate on legitimate traffic to a 0.01% measured error rate. The senior pattern: compute SLOs over filtered traffic (drop known bots, rate-limited IPs, scanner traffic from security testing). The SLI should track real-user health, not all-traffic health.

Intersection 2 — SLO burn as a security signal: An availability drop with no infra cause — no deploys, no config changes, no upstream degradation — may be the first symptom of a DDoS or a backend exploit. Several incident-response playbooks include “check SLO burn rate” as a step in the security-incident checklist, alongside log anomaly checks and network traffic analysis. A burn-rate spike at 3 AM on a Saturday with no correlated infra event is worth a security look even before the infrastructure explanation is found.

The bigger picture

An SLO is not a number in a dashboard. It is a contract that converts product decisions into engineering arithmetic. The error budget is the bridge between “we want to ship” and “we want to be reliable.” The MWMBR alert is the bridge between “the budget is being spent” and “wake the engineer up.” The error budget policy is the bridge between the alert and the org chart.

Why this works

The reason the SLO framework outlives every monitoring tool generation is that it doesn’t depend on tools. It depends on the team having committed to one number — the SLO target — that everyone (product, engineering, operations) agrees is the truth. Prometheus gets replaced by Datadog; Datadog gets replaced by something else. The SLO survives every migration because it’s the commitment, not the infrastructure.

Why teams abandon SLOs:

  • SLI doesn’t correlate with user pain → alerts are noise → team learns to ignore them
  • SLO target too tight → constant freezes → product pressure overrides → policy loses teeth
  • Policy never signed at director level → “advisory” SLOs that no one acts on
  • No quarterly review → SLO drifts from actual user needs → wrong signal for two years

Why teams succeed with SLOs:

  • Start with one journey, validate SLI against actual user reports
  • Set a conservative initial target, tighten quarterly
  • Run the fire drill before going live
  • Get the director signature before the first freeze event (not during it)
  • Hold the quarterly review with product present
Quiz

A platform team rolls out SLOs to 80 services. Six months in, half the teams are ignoring SLO alerts even with MWMBR properly configured. What is the most likely root cause?

Quiz

The ratio_total recording rule for a service returns NaN in Prometheus. What does this mean for SLO alerting?

Recall before you leave
  1. 01
    What are three organizational failure modes that cause teams to abandon SLOs after initial adoption?
  2. 02
    Describe three meta-signals that tell you whether the SLO platform itself is working correctly.
  3. 03
    Why does the SLO framework survive tool migrations when specific monitoring tools don't?
Recap

Real SLO production failures reveal the failure modes: GitHub miscounted background jobs as user-facing events and corrupted a quarter’s data; Coinbase’s error budget policy triggered correctly and prevented a cascade; Netflix deliberately relaxed a target after user research showed the extra nine was invisible to users. Observing the SLO system itself requires three meta-signals: ratio_total must never be NaN (no traffic → silent alert failure), long-window burn rate should be stationary around 1x (drift reveals miscalibrated targets), and budget-negative events must produce freeze activations (gap reveals policy with no teeth). Security intersects SLOs in two places: bot traffic dilutes the SLI denominator, and burn-rate spikes with no infra cause may be security incidents. The SLO framework survives every tooling generation because it is a contract — product and engineering committed to one number — not a configuration. Tools migrate; contracts persist.

Connected lessons
appears again in175
Continue the climb ↑SLO and error budgets: multiple-choice review
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.