Observability
Production SLO failures, self-observability, security, and the big picture
A platform team has MWMBR alerts, Sloth-generated recording rules, and a signed error budget policy. Half the teams are ignoring SLO alerts. The problem isn’t the tooling — it’s that the SLIs don’t correlate with user pain, and the policy was never actually enforced.
Real production failures
Four case studies that reveal the failure modes:
Stripe 2022 — the policy worked: A checkout SLO at 99.99% was internally violated (engineering noticed via burn rate) but the external SLA (99.9%) was met. The team’s pre-defined error budget policy auto-froze new feature deploys for 3 days while the reliability team investigated — preventing a second incident on the still-degraded code path. The policy was triggered, enforced, and the freeze held without executive override. The lesson: when the policy is signed and the team believes it applies, it works exactly as designed. The SLO caught what manual monitoring would have missed.
GitHub 2023 — the SLI was wrong: GitHub’s SLO platform miscounted background-job failures as user-facing events for a quarter, eating the reliability budget and triggering a culture-of-blame conversation. Teams were penalized for “incidents” that users never experienced. Postmortem reset the SLI definition to journey-level only (user-facing GitHub Actions runs, not internal queue processing). The lesson: the SLI definition is the most important decision — getting it wrong poisons an entire quarter’s worth of data and can destroy trust in the SLO program before it’s established.
Coinbase 2024 — the budget policy halted a risky expansion: A multi-region deploy violated the 99.99% trading API SLO for 8 minutes due to a misconfigured load balancer. The error budget policy kicked in within 24 hours, and the team paused new region launches for a week. The pause let the reliability team audit the multi-region deploy tooling and find two additional misconfiguration patterns before they caused incidents. The lesson: the freeze isn’t punishment — it’s a forcing function that directs engineering effort to the fragility that just exposed itself.
Netflix 2024 — the SLO was relaxed on purpose: Netflix’s internal SLO for video playback was loosened from 99.99% to 99.95% after a six-month review showed users couldn’t perceive the difference at 99.95% but the engineering cost to maintain the extra nine was significant. The lesson: SLOs are living targets that evolve with the system and with user research. “Tighter is always better” is false. The quarterly review exists to run this exact experiment.
Common pattern across all four: SLOs drive engineering decisions, not the other way around. The companies that get value from SLOs are the ones where the budget number changes what teams do — freeze, investigate, relax, tighten — not the ones where the SLO is a dashboard metric that no one acts on.
Observability for SLOs themselves
The meta-question: how do you know the SLO platform is working?
Signal 1 — ratio_total must never go to NaN:
If the SLI denominator is zero (no traffic, low-traffic edge case, counter reset), the recording rule produces NaN. A NaN burn rate is invisible: the alert neither fires nor clears correctly. Monitor sum(rate(http_request_total[5m])) == 0 and alert on it separately — “we have no traffic signal” is itself an alert condition.
Signal 2 — long-window burn rate should be stationary: Plot the 3d burn rate over 90 days. It should oscillate around 1x on average (hitting the SLO exactly; some weeks above, some below). A persistent 1.5x average means the SLO target is too tight for the current system — constant stress, constant freeze risk. A persistent 0.3x means the target is too loose — over-engineering for reliability no user needs. Stationary around 1x means the target is calibrated.
Signal 3 — policy outcomes must match burn history: If the 3-month burn rate history shows three periods where the budget went negative but no freezes were triggered, the policy is being overridden. Either the policy doesn’t have real authority (needs director-level re-sign) or the teams don’t know it applies to them (communication gap). The SLO meta-dashboard should track: number of active SLOs, number currently burning above 1x, average budget remaining, time since last freeze per SLO.
| Meta-signal | What it reveals | Action |
|---|---|---|
| ratio_total == 0 / NaN | No traffic; SLI denominator broken | Alert on NaN; add synthetic probes |
| 3d burn avg > 1.5x sustained | SLO target too tight for system | Quarterly review: relax or fix |
| 3d burn avg < 0.3x sustained | SLO target too loose | Quarterly review: tighten |
| Budget ≤ 0 with no freeze triggered | Policy not enforced | Re-sign policy; investigate override |
- Budget at 99.9% SLO, 1M req/day, 28 days
- 28,000 errors
- Burn rate 14.4x error rate at 99.9% SLO
- 1.44% request failure rate
- Composite ceiling: 5 services at 99.9%
- ~99.5%
- Typical SLA vs SLO buffer
- 0.05–0.5 percentage points
- Single-incident postmortem trigger
- ≥ 20% of 28-day budget burned
- Netflix SLO relaxation: from 99.99% → 99.95%
- Users could not perceive difference; engineering cost dropped
Security and SLOs: two intersections
Intersection 1 — bot traffic skews the SLI: A successful but malicious request counts as “good” in the availability SLI — a credential-stuffing attack that returns 200 OK passes the SLO. Bot traffic inflates the denominator and can mask real user issues: 1,000 bot requests per second can dilute a 1% error rate on legitimate traffic to a 0.01% measured error rate. The senior pattern: compute SLOs over filtered traffic (drop known bots, rate-limited IPs, scanner traffic from security testing). The SLI should track real-user health, not all-traffic health.
Intersection 2 — SLO burn as a security signal: An availability drop with no infra cause — no deploys, no config changes, no upstream degradation — may be the first symptom of a DDoS or a backend exploit. Several incident-response playbooks include “check SLO burn rate” as a step in the security-incident checklist, alongside log anomaly checks and network traffic analysis. A burn-rate spike at 3 AM on a Saturday with no correlated infra event is worth a security look even before the infrastructure explanation is found.
The bigger picture
An SLO is not a number in a dashboard. It is a contract that converts product decisions into engineering arithmetic. The error budget is the bridge between “we want to ship” and “we want to be reliable.” The MWMBR alert is the bridge between “the budget is being spent” and “wake the engineer up.” The error budget policy is the bridge between the alert and the org chart.
Why this works
The reason the SLO framework outlives every monitoring tool generation is that it doesn’t depend on tools. It depends on the team having committed to one number — the SLO target — that everyone (product, engineering, operations) agrees is the truth. Prometheus gets replaced by Datadog; Datadog gets replaced by something else. The SLO survives every migration because it’s the commitment, not the infrastructure.
Why teams abandon SLOs:
- SLI doesn’t correlate with user pain → alerts are noise → team learns to ignore them
- SLO target too tight → constant freezes → product pressure overrides → policy loses teeth
- Policy never signed at director level → “advisory” SLOs that no one acts on
- No quarterly review → SLO drifts from actual user needs → wrong signal for two years
Why teams succeed with SLOs:
- Start with one journey, validate SLI against actual user reports
- Set a conservative initial target, tighten quarterly
- Run the fire drill before going live
- Get the director signature before the first freeze event (not during it)
- Hold the quarterly review with product present
A platform team rolls out SLOs to 80 services. Six months in, half the teams are ignoring SLO alerts even with MWMBR properly configured. What is the most likely root cause?
The ratio_total recording rule for a service returns NaN in Prometheus. What does this mean for SLO alerting?
- 01What are three organizational failure modes that cause teams to abandon SLOs after initial adoption?
- 02Describe three meta-signals that tell you whether the SLO platform itself is working correctly.
- 03Why does the SLO framework survive tool migrations when specific monitoring tools don't?
Real SLO production failures reveal the failure modes: GitHub miscounted background jobs as user-facing events and corrupted a quarter’s data; Coinbase’s error budget policy triggered correctly and prevented a cascade; Netflix deliberately relaxed a target after user research showed the extra nine was invisible to users. Observing the SLO system itself requires three meta-signals: ratio_total must never be NaN (no traffic → silent alert failure), long-window burn rate should be stationary around 1x (drift reveals miscalibrated targets), and budget-negative events must produce freeze activations (gap reveals policy with no teeth). Security intersects SLOs in two places: bot traffic dilutes the SLI denominator, and burn-rate spikes with no infra cause may be security incidents. The SLO framework survives every tooling generation because it is a contract — product and engineering committed to one number — not a configuration. Tools migrate; contracts persist.
appears again in175
- Why GraphQL gets N+1junior
- DataLoader mechanics: tick-boundary batchingmiddle
- Batch function contracts: ordering, shapes, errorsmiddle
- Federation and lookahead: batching beyond DataLoadermiddle
- Query complexity defences: depth, cost, persisted queriesmiddle
- Senior GraphQL API: scheduling contract, tenant isolation, observabilitysenior
- Why idempotency: making retries safejunior
- Server-side state machine: four states of an idempotency keymiddle
- Outbox and inbox: effectively-once across the dual-write boundarymiddle
- Concurrency and cache architecture for idempotency at scalesenior
- Observability, production failures, and global-scale designsenior
- The event loop: one thread, three queuesjunior
- Tasks, microtasks, and scheduler.yield()middle
- Microtask starvation, Long Tasks, and LoAFsenior
- Node.js event loop: phases, nextTick, and loop lagsenior
- React, Vue, and INP observability in productionsenior
- The render pipeline: six stages from bytes to pixelsjunior
- Stage costs and the renderer process modelmiddle
- Invalidation, dirty bits, and containmiddle
- Compositor layers: promotion, overlap, and GPU memorymiddle
- DevTools flame strip and the frame lifecyclemiddle
- Layout thrash: forced synchronous layoutsenior
- BeginMainFrame, compositor-driven animations, and GPU memorysenior
- Production observability: LoAF, INP, and the full attack surfacesenior
- What V8 is and why performance varies 100×junior
- V8''''s four-tier JIT pipeline and profile-guided tieringmiddle
- Hidden classes, transition trees, and memory layoutmiddle
- Inline caches, IC states, and deoptimizationmiddle
- Orinoco GC: parallel scavenger, concurrent marking, and write barriersmiddle
- TurboFan''''s speculative engine and the deopt-loop trapsenior
- V8 in production: isolates, pointer compression, and real failuressenior
- Service worker lifecycle and cache strategiesmiddle
- Service worker edge cases: version skew, durability, and navigation trapssenior
- What the reconciler does: render vs commitjunior
- The fiber object and the double-buffer treemiddle
- Render phase purity and commit phase sub-stepsmiddle
- Reconciliation: diffing heuristics and the key trapmiddle
- Priority lanes, time-slicing, and useTransitionmiddle
- Bailout, memoisation, and tearingsenior
- React Profiler, the Compiler, and production observabilitysenior
- Rendering strategies: SSG, SSR, ISR, streaming, and hydrationjunior
- SSG, SSR, ISR, streaming, and RSC — how each worksmiddle
- Hydration cost: selective, progressive, islands, resumabilitymiddle
- Hydration mismatch: causes, detection, and the determinism rulesenior
- RSC, per-route strategy, and production observabilitysenior
- Core Web Vitals: what LCP, INP, and CLS measurejunior
- CLS: why layout shifts happen and how to stop themmiddle
- Metric tradeoffs, RUM attribution, and the CI+field loopsenior
- The full picture: URL to LCP to INP as a relay racejunior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- What is a cache stampede and why it makes things worsejunior
- Lock and single-flight: bounding concurrent rebuildsmiddle
- XFetch: coordination-free probabilistic early expirationmiddle
- Stale-while-revalidate and CDN request coalescingmiddle
- Detecting stampedes and designing TTL for productionmiddle
- Metastable failure, fencing tokens, and production postmortemssenior
- What a relation is: tables, rows, keys, and constraintsjunior
- Constraints, keys, and Postgres data typesmiddle
- Normal forms, denormalization, and why schemas stickmiddle
- JSONB, arrays, and when a side table winsmiddle
- Heap storage, TOAST, and column alignmentsenior
- Schema integrity: deferral, versioning, and production failure modessenior
- Relational vs document, wide-column, graph, and key-valuesenior
- Index-only scans, the Visibility Map, and INCLUDEsenior
- Production failure modes and the index audit playbooksenior
- pg_statistic, ANALYZE, and production observabilitymiddle
- Production failure modes and plan stabilitysenior
- MVCC: why readers and writers never wait for each otherjunior
- Row versions and snapshots: the on-disk mechanicsmiddle
- HOT updates and isolation levels: what you gain and what you paymiddle
- Vacuum and bloat: keeping the storage tax boundedmiddle
- CLOG, XID wraparound, and MultiXact: deep visibility internalssenior
- SSI internals and production autovacuum tuningsenior
- Real-world MVCC failures, deployment patterns, and distributed snapshotssenior
- Connection pools: amortising the cost of a Postgres backendjunior
- PgBouncer session, transaction, and statement modesmiddle
- Pool sizing: the (cores × 2) + spindles formula and the two-layer stackmiddle
- Pool exhaustion and idle-in-transaction: the 3 AM failure modemiddle
- Migrating to transaction mode: rollout playbook and PgBouncer 1.21 prepared statementsmiddle
- The Postgres process model and why raising max_connections degrades throughputsenior
- Pooler landscape 2026, serverless connection storms, and the full failure-mode taxonomysenior
- What a schema migration is and why it replaces ad-hoc DDLjunior
- ADD COLUMN: instant in PG 11+ vs rewrite in older Postgresjunior
- The lock-queue failure mode: why instant DDL can freeze the databasemiddle
- Safe DDL patterns: NOT VALID, CONCURRENTLY, and unsafe-op fixesmiddle
- Expand-contract: zero-downtime for breaking schema changesmiddle
- Advisory locks, migration tools, and deploy coordinationsenior
- Migration failure taxonomy and production disciplinesenior
- Why sharding exists: the single-Postgres ceilingjunior
- Shard-key selection: hash, range, list, and directory strategiesmiddle
- Partitioning vs sharding: same word, two different thingsmiddle
- Co-location and Citus: the invariant that makes sharding usablemiddle
- The hot-shard failure mode: detection, isolation, and durable policymiddle
- Schema-based sharding and multi-tenancy alternativessenior
- Online resharding, 2PC, and the operational cost of shardingsenior
- The seven acts: from CREATE TABLE to Citusjunior
- Acts 1–3 in depth: schema, indexes, and planner statisticsmiddle
- Acts 4–6 in depth: MVCC bloat, connection pooling, and safe migrationsmiddle
- Act 7 in depth: sharding, co-location, and the seven-tier tradeoff cascademiddle
- Observability, anti-patterns, and production triagesenior
- Raft roles, terms, and why majority quorums prevent split brainjunior
- How Raft replicates a log entry and decides it is safe to commitmiddle
- Raft leader election: timeouts, voting rules, and the four safety propertiesmiddle
- Raft in the real world: partitions, slow disks, and client routingmiddle
- Raft extensions: pre-vote, learners, snapshots, and linearizable readssenior
- Raft in production: membership changes, Multi-Raft, and observabilitysenior
- Where data fetching happens — and why it decides LCPjunior
- Fetch waterfalls — diagnosis and the Promise.all curemiddle
- React Server Components and Suspense streamingmiddle
- Client-side cache: TanStack Query, SWR, and stale-while-revalidatemiddle
- LCP, prefetch, and race conditions in interactive fetchingmiddle
- Senior internals: RSC payload, caching layers, and production failure modessenior
- The three-way handshakejunior
- Sequence numbers and connection statemiddle
- DNS: what it does and why it existsjunior
- The resolver walk: referrals, record types, and gluemiddle
- TTL, caching, and DNS propagationmiddle
- The 1-RTT handshake: key shares and ECDHEmiddle
- Session resumption and 0-RTTmiddle
- WebSocket: the HTTP upgrade handshakejunior
- WebSocket frame format: opcodes, masking, fragmentationmiddle
- WebSocket backpressure: when clients can''''t keep upmiddle
- Reconnection: jittered backoff, thundering herd, message resumptionsenior
- WebSocket at scale: HTTP/2 multiplexing, permessage-deflate, C10Msenior
- WebSocket in production: proxies, security, and distributed architecturesenior
- What reverse proxies dojunior
- Health checks, connection draining, and slow startmiddle
- Session affinity, consistent hashing, and the right fixmiddle
- Retry storms, circuit breakers, and load sheddingsenior
- Resilient LB architecture: anycast, zone-aware routing, and observabilitysenior
- Why QUIC and not TCP+TLSjunior
- Connection IDs and network migrationmiddle
- 0-RTT resumption and packet encryptionsenior
- DDoS: what it is and why it worksjunior
- Amplification attacks and state exhaustionmiddle
- Rate limiting: algorithms and architecturemiddle
- WAFs, firewalls, mTLS, and HSTSmiddle
- DNS cache poisoning and BGP hijackingsenior
- Defense-in-depth architecture and attack economicssenior
- DNS, TCP, TLS in sequence: where the milliseconds gomiddle
- Proxy intercepts and security gates: rate limiters, WAF, mTLSmiddle
- Alternate paths: QUIC 0-RTT, WebSocket upgrade, connection migrationmiddle
- Observability: distributed traces, USE/RED, and samplingsenior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- Cache lines, struct layout, and false sharingmiddle
- SIMD, SoA vs AoS, and memory bandwidthmiddle
- Cache-oblivious algorithms, PGO, and production failuressenior
- GC in production: observability, security, edge cases, and fleet governancesenior
- Batching: amortize fixed cost per operationjunior
- The batching window: size and wait timemiddle
- Batching in Kafka and Postgresmiddle
- io_uring and observability of batchingmiddle
- From Nagle to io_uring: evolution of batchingmiddle
- Backpressure, failure isolation, and batch security in productionsenior
- CI enforcement and RUM: making budgets stickmiddle
- V8 JIT pipeline, HTTP priorities, and bundle securitysenior
- The performance loop: discipline, not a projectjunior
- Classify and fix: matching bottleneck families to remediesmiddle
- Observability stack and CI gates: catching regressions before they shipmiddle
- Incident to enforcement: SLO burn to verified fix in 35 minutesmiddle
- Culture, economics, and org-scale performancesenior
- At-most-once, at-least-once, exactly-once: the three delivery contractsjunior
- The three failure legs — where duplicates and losses actually happenmiddle
- Consumer-side dedup: the cheapest path to exactly-once processingmiddle
- Kafka exactly-once semantics: idempotent producer and transactionsmiddle
- SQS visibility timeout, DLQ, and the outbox patternmiddle
- Exactly-once in production: impossibility proof, hybrid patterns, and real incidentssenior
- What OAuth is and why passwords are not the answerjunior
- Authorization code flow with PKCEmiddle
- ID token validation and JWKS cache managementmiddle
- Refresh token rotation and scope-based least privilegemiddle
- Sender-constrained tokens: DPoP and mTLSsenior
- OAuth in production: audience attacks, observability, and real failuressenior