Observability
Low-traffic SLOs and burn-rate math from first principles
A service at 100 requests per hour fires a 100% error-rate alert. The on-call wakes up. One request failed. The SLO math is technically correct — and completely useless.
Edge case: low-traffic services break naive SLOs
A service at 100 requests per hour faces a structural problem with 99.9% SLO arithmetic.
The math: 99.9% SLO → budget = 0.1% → 1 allowed failure per 1000 requests. At 100 req/h, 1000 requests takes 10 hours. A single failure in a 1-hour alerting window is 1/100 = 1.0% error rate — ten times the annual budget — even though one request failing is not necessarily a crisis.
The alert fires, the on-call investigates, the incident was a timeout that resolved itself. This cycle destroys trust in SLO alerting on low-traffic surfaces.
Solutions:
Synthetic traffic (preferred): Generate consistent probe requests at a fixed rate so the denominator never drops too low. A probe every 10 seconds = 360 requests/hour, giving the SLO arithmetic a stable base. The probe covers availability (was the endpoint reachable?) but not full user-journey correctness — combine with real-traffic SLO for critical paths.
Aggregation across related services: Treat 10 small internal services as one SLO target. The combined traffic is 10× larger, making the denominator stable. Works when the services are logically equivalent (e.g. 10 shard handlers for the same data type).
Longer SLO windows: Compute the SLO over a week instead of 28 days for the rolling window. Use a 24h or 7d recording window for alerting evaluation rather than 1h. A single failure over 7 days at 100 req/h is 16,800 requests with 1 failure = 0.006% error rate — below the budget. The tradeoff: slower incident detection (the weekly window takes days to accumulate meaningful signal for a new degradation).
Hybrid: event-based alerting: For very low-traffic surfaces, compute the SLO for reporting but alert on raw counts rather than rates. “More than 5 failures in 30 minutes” is actionable; “error rate > 0.1%” may not be.
Google’s workbook and platforms like Nobl9, Pyrra, and Sloth all ship explicit low-traffic accommodations. The pattern is not an edge case — any service that isn’t customer-critical will have low traffic surfaces.
The six alerting approaches: why five fail
The Google SRE Workbook classifies six approaches to SLO alerting. Understanding why five fail is what makes Approach 6 legible — not as a recommendation to memorize but as the result of systematically eliminating failure modes.
Approach 1 — alert when error rate > SLO threshold:
error_rate > (1 − SLO)
Simple and wrong. The SLO threshold is the budget rate — designed to be exceeded by normal variation. At 99.9% SLO, any spike above 0.1% fires the alert. A single bad request batch triggers the page. Teams learn to ignore the pager within days — the classic alert fatigue death spiral. Any 5-minute window with even slightly elevated errors is a page.
Approach 2 — require N consecutive minutes above threshold:
for: 5m added to Approach 1.
Better: eliminates brief spikes. But the detection delay scales with the window — a 5-minute sustained outage doesn’t page for 5 minutes. And the reset is still fast, so recovery works. But the threshold is still wrong (budget rate, not burn rate), so the sensitivity is still too high.
Approach 3 — single burn rate over a longer window (e.g. 14.4x over 1h):
(1 - ratio_rate1h) > (14.4 * 0.001)
Catches real outages. Burns 2% of the budget before firing. But: after the incident resolves at 12:00, the 1-hour window still contains data from 11:00–12:00. The alert keeps firing until ~12:55. The on-call can’t tell if their fix worked. Long reset time.
Approach 4 — dual burn rate, OR logic (either long OR short fires):
expr: A or B
Addresses reset time by adding a short window. But OR means either window alone is sufficient to trigger. The short window fires on transient spikes (noise). The long window fires late after resolution. OR gives the worst of both.
Approach 5 — dual severity (page at 14.4x, ticket at 1x): Two separate single-window alerts at different thresholds. Better than one threshold but each alert still uses a single window. The page has the long-reset problem. The ticket is an improvement (slow burn catches) but still noisy.
Approach 6 — MWMBR: long AND short, per severity level:
page: (burn_1h > 14.4x) AND (burn_5m > 14.4x)
page: (burn_6h > 6x) AND (burn_30m > 6x)
ticket: (burn_3d > 1x) AND (burn_6h > 1x)The AND between windows eliminates both failure modes: the long window rejects noise (a brief spike doesn’t sustain 14.4x burn for 1 hour), and the short window enables fast reset (clears within 5 minutes of fix because the short window’s burn rate drops below threshold). Approach 6 is the only one that balances detection latency, noise resistance, and recovery latency.
Why this works
Engineers re-deriving SLO alerting from scratch reliably re-invent Approach 1 or 3. The SRE Workbook chapter exists explicitly to prevent that retrace. Knowing that Approach 6 is the result of eliminating five failure modes, not an arbitrary configuration, is what makes it robust to challenge — when someone suggests “just lower the threshold” (Approach 1 variant) or “use OR not AND” (Approach 4), you have the systematic answer.
Burn-rate math: derivation from first principles
Every canonical MWMBR threshold derives from one equation:
burn_rate = (budget_fraction × period) / windowWhere:
budget_fraction= fraction of the total budget this alert tier should consume if sustainedperiod= SLO window (e.g. 30 days = 720 hours)window= the long alert window (e.g. 1h, 6h, 3d)
Deriving 14.4x (the 1h+5m page):
Goal: page when an incident, if sustained, would consume 2% of the budget in 1 hour.
burn_rate = (0.02 × 720h) / 1h = 14.4
Deriving 6x (the 6h+30m page):
Goal: page when sustained burn would consume 5% in 6 hours.
burn_rate = (0.05 × 720h) / 6h = 6.0
Deriving 1x (the 3d+6h ticket):
Goal: ticket when sustained burn would consume 10% in 3 days.
burn_rate = (0.10 × 720h) / 72h = 1.0
The 5-minute and 30-minute short windows use the same burn rate as their paired long window — they don’t change the threshold, only the “is this still happening?” check.
- 14.4x derives from
- (2% × 720h) / 1h
- 6x derives from
- (5% × 720h) / 6h
- 1x derives from
- (10% × 720h) / 72h
- For 28-day window (672h), 14.4x becomes
- (2% × 672h) / 1h = 13.44x
- Error rate at 14.4x burn (99.9% SLO)
- 14.4 × 0.001 = 1.44%
- Alert reset with 5m short window
- < 5 minutes after fix
Recomputing for a 28-day window: The Google Workbook examples use 30 days (720h). Most platforms default to 28 days (672h). The thresholds shift slightly: (0.02 × 672) / 1 = 13.44x. Most teams round to 14x for the 28-day window. Any team can recompute — the formula is the same.
Recomputing for a different SLO target: A 99.5% SLO has budget = 0.5%. The error rate at 14.4x burn: 14.4 × 0.005 = 7.2%. The burn-rate threshold (14.4x) doesn’t change; the error rate threshold changes because the budget rate changes. The PromQL expression (1 - ratio_rate1h) > (14.4 * 0.001) must be updated to (14.4 * 0.005) for a 99.5% SLO. This is the most common hand-rolling bug: copying a 99.9% alert for a 99.5% SLO and forgetting to update 0.001 → 0.005.
A team has a 28-day SLO window (672 hours). They want a page-grade alert that fires when sustained burn would consume 2% of the budget in 1 hour. What burn-rate threshold should they use?
Approach 4 (OR logic between long and short windows) was proposed to fix the reset-time problem of Approach 3. Why does it fail?
- 01Walk through the full derivation of why the 1h+5m page alert uses 14.4x burn rate and consumes 2% of the budget.
- 02A service runs at 50 requests per hour. How would you structure its SLO to avoid meaningless alerts?
- 03Why does Approach 3 (single-window burn-rate alert) fail after the incident resolves, and how does Approach 6 fix it?
Low-traffic services need special handling because a single failure dominates the error rate when the denominator is small — synthetic probes, aggregation across related services, longer evaluation windows, or event-based alerting all stabilize the SLO signal. The Google SRE Workbook’s six alerting approaches show why five fail: Approaches 1–2 are too sensitive to noise, Approach 3 has a long reset time, Approach 4 (OR) combines the noise problem with the reset problem, Approach 5 is closer but still single-window. Only Approach 6 (MWMBR with AND) handles both noise and recovery correctly. Every burn-rate threshold is derivable from (budget_fraction × period) / window — 14.4x from 2%×720h/1h, 6x from 5%×720h/6h, 1x from 10%×720h/72h. Teams can recompute for any SLO period or target; the formula is the same.