Observability OBS · 05 · 02

Choosing SLIs and SLO targets: ratios, not feelings

A good SLI is a ratio in [0%, 100%] that correlates with user pain — not CPU usage or queue depth — and the right SLO target is the lowest number users tolerate, because each extra nine multiplies engineering cost by 3–10x.

OBS Middle ◷ 14 min

Level

FoundationsJuniorMiddleSenior

A team alerts on CPU usage as their primary reliability signal. During autoscaling under healthy traffic, CPU spikes — and the pager fires. No user was ever harmed. The SLI was wrong.

What makes a good SLI

Ask yourself: if this metric looks fine, could users still be suffering? If yes, it is not a good SLI. The Google SRE workbook is firm: a good SLI is a ratio of good events to total events, in [0%, 100%], that correlates with what users feel. Four properties follow from this definition.

It must be a ratio, not a count. “Less than 100 errors per day” conflates traffic volume with reliability. At 1,000 req/day that is 10% errors; at 1,000,000 req/day it is 0.01%. The ratio form makes SLOs comparable across traffic levels and makes the error budget directly computable: budget = (1 − SLO) × total_events.

It must land in [0%, 100%]. This makes dashboards readable (fixed axis, intuitive range) and burn rate computable (error_rate / budget_rate).

It must track user pain, not machine pain. The forbidden anti-pattern is an “internal SLI”: CPU usage, queue length, GC pause duration, heap utilization. These are USE-style operational signals — they describe the machine, not the user. A server at 100% CPU may be serving every request perfectly. An SLI on CPU would fire alerts during a healthy autoscaling event while users are happy.

Signal categories by service type:

Service type	Availability SLI	Latency SLI
Request-driven API	successful_requests / total_requests	requests_under_200ms / total_requests
Data pipeline	records_processed / records_arrived	records_within_SLA / total_records
Storage	data_intact / data_stored	reads_under_threshold / total_reads

Measure at the point closest to the user (load balancer / edge), classify each request as good or bad by status and the latency threshold, then the SLI is good / total. Both good and bad count toward total — only good events form the numerator.

Latency SLIs use bucket counts, not percentiles

A latency SLO (“99% of requests under 200ms”) sounds like a percentile, but it is implemented as a counter. The Prometheus histogram at the SLO threshold gives you exactly this:

fast_requests = http_request_duration_seconds_bucket{le="0.2"}
latency_sli = sum(rate(fast_requests[1h])) / sum(rate(http_request_duration_seconds_count[1h]))

No histogram_quantile required — and no estimation error that would corrupt the budget. This is why your RED-Duration histogram must have a bucket boundary exactly at the SLO threshold: without it, you cannot evaluate the SLO without approximating, and approximations contaminate the budget.

Choosing the SLO target

The SLO target is a business decision, not an engineering one. It answers: “what is the minimum reliability users will accept before they notice and complain?” Engineering then asks: “what is the cheapest architecture that delivers that?”

SLO targets and what each nine costs

99% SLO, 30 days: 7.2 hours allowed downtime
99.9% SLO, 30 days: 43.2 minutes
99.95% SLO, 30 days: 21.6 minutes
99.99% SLO, 30 days: 4.3 minutes
99.999% SLO, 30 days: ~26 seconds
Engineering cost jump per nine: 3–10x

The pattern:

99% → 99.9%: add monitoring and basic alerting
99.9% → 99.99%: add multi-region with automated failover
99.99% → 99.999%: add N+2 redundancy, chaos engineering, 24/7 on-call that wakes within minutes

Each extra nine cuts allowed downtime ~10x and demands a whole new architecture tier — which is why every nine costs 3–10x more in engineering, infra, and on-call.

The error budget arithmetic in full

Once you have the SLO, the budget becomes concrete:

error_budget = (1 − SLO) × total_events_in_window

For a 99.9% SLO over 28 days at 1 million requests per day:

Total events = 28,000,000
Budget = 0.001 × 28,000,000 = 28,000 failed requests

As failures accumulate: budget_remaining = budget_total − failures_so_far

Burn rate at any moment: burn_rate = current_error_rate / (1 − SLO)

At 0.1% error rate and 99.9% SLO: burn rate = 0.001 / 0.001 = 1x (sustainable). At 1.44% error rate: burn rate = 0.0144 / 0.001 = 14.4x (budget gone in 2 days).

Why 28 days, not “this calendar month”

Calendar months vary (28–31 days) and create cliff effects: a bad week at the end of February impacts the budget differently than the same bad week at the end of March. A 28-day rolling window (four whole weeks) solves both problems: it always covers the same length, and it includes complete weekday/weekend cycles, so traffic patterns normalise. Every major SLO platform — Datadog, Nobl9, Sloth, Pyrra, Google Cloud SLOs — defaults to 28 days for this reason.

▸Why this works

The rolling window is also why SLO targets should start loose and tighten quarterly, not be set tightly on day one. A conservative first SLO gives the team time to see what the real baseline error rate is, understand the traffic pattern, and instrument the counters correctly before they are held to a number that is either unmeetable or trivially easy.

Quiz

Which of these is a good SLI for a request-driven API?

Quiz

A 99.9% availability SLO over 28 days serves 1M requests per day. The team is 14 days in and has had 6,000 failed requests. How is the budget?

Quiz

Why is a 28-day rolling window preferred over a calendar month?

Recall before you leave

01
Why should an SLI be expressed as a ratio rather than an absolute count or a machine metric?
02
Why does a latency SLI require a histogram bucket boundary exactly at the SLO threshold?
03
How do you decide which SLO target to pick?

Recap

A good SLI is a ratio of good events to total events — always in [0%, 100%], always tracking what users experience, never a machine metric like CPU or queue length. Latency SLIs use histogram bucket counts at the SLO threshold, not histogram_quantile estimates, so the budget arithmetic is exact. The SLO target is a business decision: pick the lowest reliability users tolerate, because each additional nine multiplies engineering cost 3–10x. Use a 28-day rolling window to normalise traffic patterns and avoid month-boundary cliff effects. Budget = (1 − SLO) × total_events; burn rate = current_error_rate / (1 − SLO). Start loose, tighten quarterly. Now when you look at a dashboard metric and ask “is this a good SLI?” — check whether it is a ratio, whether it tracks user pain, and whether it lives in [0%, 100%].

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 6 done

Connected lessons

builds on

SLI, SLO, and the error budget: reliability by the numbersjunior

unlocks

Multi-window multi-burn-rate alerting: why AND beats ORmiddle

deepens into

appears again in170

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.