Observability OBS · 05 · 08

Production SLO failures, self-observability, security, and the big picture

Stripe, GitHub, Coinbase, and Netflix incidents reveal SLO failure modes. SLO self-observability — NaN denominators, burn rate drift, policy gaps — is the meta-layer. Security intersects SLIs. The framework survives tool migrations because it is a contract, not a config.

OBS Senior ◷ 18 min

Level

FoundationsJuniorMiddleSenior

A platform team has MWMBR alerts, Sloth-generated recording rules, and a signed error budget policy. Half the teams are ignoring SLO alerts. The problem isn’t the tooling — it’s that the SLIs don’t correlate with user pain, and the policy was never actually enforced.

Real production failures

Four case studies that reveal the failure modes:

Stripe 2022 — the policy worked: A checkout SLO at 99.99% was internally violated (engineering noticed via burn rate) but the external SLA (99.9%) was met. The team’s pre-defined error budget policy auto-froze new feature deploys for 3 days while the reliability team investigated — preventing a second incident on the still-degraded code path. The policy was triggered, enforced, and the freeze held without executive override. The lesson: when the policy is signed and the team believes it applies, it works exactly as designed. The SLO caught what manual monitoring would have missed.

GitHub 2023 — the SLI was wrong: GitHub’s SLO platform miscounted background-job failures as user-facing events for a quarter, eating the reliability budget and triggering a culture-of-blame conversation. Teams were penalized for “incidents” that users never experienced. Postmortem reset the SLI definition to journey-level only (user-facing GitHub Actions runs, not internal queue processing). The lesson: the SLI definition is the most important decision — getting it wrong poisons an entire quarter’s worth of data and can destroy trust in the SLO program before it’s established.

Coinbase 2024 — the budget policy halted a risky expansion: A multi-region deploy violated the 99.99% trading API SLO for 8 minutes due to a misconfigured load balancer. The error budget policy kicked in within 24 hours, and the team paused new region launches for a week. The pause let the reliability team audit the multi-region deploy tooling and find two additional misconfiguration patterns before they caused incidents. The lesson: the freeze isn’t punishment — it’s a forcing function that directs engineering effort to the fragility that just exposed itself.

Netflix 2024 — the SLO was relaxed on purpose: Netflix’s internal SLO for video playback was loosened from 99.99% to 99.95% after a six-month review showed users couldn’t perceive the difference at 99.95% but the engineering cost to maintain the extra nine was significant. The lesson: SLOs are living targets that evolve with the system and with user research. “Tighter is always better” is false. The quarterly review exists to run this exact experiment.

Common pattern across all four: SLOs drive engineering decisions, not the other way around. The companies that get value from SLOs are the ones where the budget number changes what teams do — freeze, investigate, relax, tighten — not the ones where the SLO is a dashboard metric that no one acts on.

A server-side SLI (5xx at the load balancer) stays green while users hit timeouts and slow responses — the failure is invisible upstream of the measurement point. The fix: measure at the user-facing edge and alert on burn rate, so the page fires on real pain.

Observability for SLOs themselves

The meta-question: how do you know the SLO platform is working?

Signal 1 — ratio_total must never go to NaN: If the SLI denominator is zero (no traffic, low-traffic edge case, counter reset), the recording rule produces NaN. A NaN burn rate is invisible: the alert neither fires nor clears correctly. Monitor sum(rate(http_request_total[5m])) == 0 and alert on it separately — “we have no traffic signal” is itself an alert condition.

Signal 2 — long-window burn rate should be stationary: Plot the 3d burn rate over 90 days. It should oscillate around 1x on average (hitting the SLO exactly; some weeks above, some below). A persistent 1.5x average means the SLO target is too tight for the current system — constant stress, constant freeze risk. A persistent 0.3x means the target is too loose — over-engineering for reliability no user needs. Stationary around 1x means the target is calibrated.

Signal 3 — policy outcomes must match burn history: If the 3-month burn rate history shows three periods where the budget went negative but no freezes were triggered, the policy is being overridden. Either the policy doesn’t have real authority (needs director-level re-sign) or the teams don’t know it applies to them (communication gap). The SLO meta-dashboard should track: number of active SLOs, number currently burning above 1x, average budget remaining, time since last freeze per SLO.

Meta-signal	What it reveals	Action
ratio_total == 0 / NaN	No traffic; SLI denominator broken	Alert on NaN; add synthetic probes
3d burn avg > 1.5x sustained	SLO target too tight for system	Quarterly review: relax or fix
3d burn avg < 0.3x sustained	SLO target too loose	Quarterly review: tighten
Budget ≤ 0 with no freeze triggered	Policy not enforced	Re-sign policy; investigate override

Senior-tier SLO production numbers

Budget at 99.9% SLO, 1M req/day, 28 days: 28,000 errors
Burn rate 14.4x error rate at 99.9% SLO: 1.44% request failure rate
Composite ceiling: 5 services at 99.9%: ~99.5%
Typical SLA vs SLO buffer: 0.05–0.5 percentage points
Single-incident postmortem trigger: ≥ 20% of 28-day budget burned
Netflix SLO relaxation: from 99.99% → 99.95%: Users could not perceive difference; engineering cost dropped

Security and SLOs: two intersections

Intersection 1 — bot traffic skews the SLI: A successful but malicious request counts as “good” in the availability SLI — a credential-stuffing attack that returns 200 OK passes the SLO. Bot traffic inflates the denominator and can mask real user issues: 1,000 bot requests per second can dilute a 1% error rate on legitimate traffic to a 0.01% measured error rate. The senior pattern: compute SLOs over filtered traffic (drop known bots, rate-limited IPs, scanner traffic from security testing). The SLI should track real-user health, not all-traffic health.

Intersection 2 — SLO burn as a security signal: An availability drop with no infra cause — no deploys, no config changes, no upstream degradation — may be the first symptom of a DDoS or a backend exploit. Several incident-response playbooks include “check SLO burn rate” as a step in the security-incident checklist, alongside log anomaly checks and network traffic analysis. A burn-rate spike at 3 AM on a Saturday with no correlated infra event is worth a security look even before the infrastructure explanation is found.

The bigger picture

An SLO is not a number in a dashboard. It is a contract that converts product decisions into engineering arithmetic. The error budget is the bridge between “we want to ship” and “we want to be reliable.” The MWMBR alert is the bridge between “the budget is being spent” and “wake the engineer up.” The error budget policy is the bridge between the alert and the org chart.

▸Why this works

The reason the SLO framework outlives every monitoring tool generation is that it doesn’t depend on tools. It depends on the team having committed to one number — the SLO target — that everyone (product, engineering, operations) agrees is the truth. Prometheus gets replaced by Datadog; Datadog gets replaced by something else. The SLO survives every migration because it’s the commitment, not the infrastructure.

Why teams abandon SLOs:

SLI doesn’t correlate with user pain → alerts are noise → team learns to ignore them
SLO target too tight → constant freezes → product pressure overrides → policy loses teeth
Policy never signed at director level → “advisory” SLOs that no one acts on
No quarterly review → SLO drifts from actual user needs → wrong signal for two years

Why teams succeed with SLOs:

Start with one journey, validate SLI against actual user reports
Set a conservative initial target, tighten quarterly
Run the fire drill before going live
Get the director signature before the first freeze event (not during it)
Hold the quarterly review with product present

Same four failure modes, opposite outcomes. SLO adoption fails or succeeds on the organizational contract — the SLI definition, the signed policy, the quarterly review — not on the monitoring tool.

Quiz

A platform team rolls out SLOs to 80 services. Six months in, half the teams are ignoring SLO alerts even with MWMBR properly configured. What is the most likely root cause?

Quiz

The ratio_total recording rule for a service returns NaN in Prometheus. What does this mean for SLO alerting?

Recall before you leave

01
What are three organizational failure modes that cause teams to abandon SLOs after initial adoption?
02
Describe three meta-signals that tell you whether the SLO platform itself is working correctly.
03
Why does the SLO framework survive tool migrations when specific monitoring tools don't?

Recap

Real SLO production failures reveal the failure modes: GitHub miscounted background jobs as user-facing events and corrupted a quarter’s data; Coinbase’s error budget policy triggered correctly and prevented a cascade; Netflix deliberately relaxed a target after user research showed the extra nine was invisible to users. Observing the SLO system itself requires three meta-signals: ratio_total must never be NaN (no traffic → silent alert failure), long-window burn rate should be stationary around 1x (drift reveals miscalibrated targets), and budget-negative events must produce freeze activations (gap reveals policy with no teeth). Security intersects SLOs in two places: bot traffic dilutes the SLI denominator, and burn-rate spikes with no infra cause may be security incidents. The SLO framework survives every tooling generation because it is a contract — product and engineering committed to one number — not a configuration. Tools migrate; contracts persist. Now when you inherit a Prometheus stack migrating to Datadog, you know your SLOs survive the migration — only the recording rules need regenerating.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Iceberg SLIs, composite SLO math, and SLA vs SLOsenior

appears again in201

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.