Observability OBS · 05 · 01

SLI, SLO, and the error budget: reliability by the numbers

An SLO is a promise about how reliable your service will be, and an error budget is the leftover failure allowance you can spend before that promise breaks — together they end arguments about ''''how reliable is reliable enough?

OBS Junior ◷ 12 min

Level

FoundationsJuniorMiddleSenior

Already know this unit? Take a 1-minute quick check →

A product manager asks: “Can we ship the new checkout flow this week?” Without an SLO, the answer is a feeling. With one, it is arithmetic — the error budget either has headroom or it does not.

What the three terms actually mean

Before you can answer “should we ship?”, you need three numbers. Here is what they are and how they connect.

SLI (Service Level Indicator) is a ratio: the fraction of requests (or events) that were “good” out of everything that was attempted. For a request-driven service, “good” typically means: returned a successful status code, within an acceptable latency. The formula is always the same shape — good events / total events — and the result is a number between 0% and 100%.

SLO (Service Level Objective) is the target you commit to for that SLI. If you pick 99.9%, you are saying: at least 99.9% of requests must be good. Everything above the target is surplus. Everything below is a failure.

Error budget is what remains under the target: 1 − SLO. A 99.9% SLO leaves a 0.1% budget of failures you are allowed to produce before the SLO is breached.

Term	Formula	Example (99.9% SLO, 30 days)
SLI	good_events / total_events	999,000 / 1,000,000 = 99.9%
SLO	target for the SLI	>= 99.9% of requests must succeed
Error budget	1 − SLO	0.1% = 1,000 failures per million, or ~43 minutes of downtime
Burn rate	actual_error_rate / (1 − SLO)	1.0% error rate = 10x burn (exhausts budget in 3 days)

The SLI fills the bar up to the SLO line (99.9%). Everything from the SLO line to 100% is the error budget: 1 − SLO = 0.1% ≈ 43 min/30 days. The slice is enlarged here for legibility.

The data-plan metaphor

Think of the error budget like a monthly mobile data plan. A 99.9% SLO over 30 days has the same shape as “10 GB per month”: you start the period with a fixed allowance, you spend it as events happen, and if you run out before the period ends, you slow down — no new feature deploys — until the next billing cycle resets.

A small incident is like a 30-second video: barely a dent. A two-hour outage is like streaming HD all weekend: most of the month’s quota gone. Teams that are heavy spenders ration carefully; light spenders ship freely. The budget is a real resource, not a metaphor in the system — when it goes negative, real consequences follow.

Burn rate: the derived quantity that matters most

Burn rate normalises the error rate against the SLO target:

burn_rate = actual_error_rate / (1 − SLO)

At the 99.9% SLO (budget = 0.1% = 0.001):

0.1% error rate → burn rate 1.0: you would exhaust the budget exactly by month-end
1.44% error rate → burn rate 14.4: you would exhaust the 30-day budget in about 2 days

A burn rate of 1 means “sustainable.” Above 1 means you are on pace to miss the SLO. The burn rate is what alerts and dashboards show, because it is the same number regardless of the SLO target or traffic volume — 14.4x means the same thing on any service.

The runway is wildly nonlinear: nudging the error rate from 0.1% to 1.44% does not shave a tenth off the budget — it collapses a month of headroom into two days. That cliff is why burn rate, not raw error rate, drives the alert.

A concrete scenario

A SaaS team sets a 99.5% SLO. Over a month they serve 10 million requests; 0.5% = 50,000 errors allowed. A config bug in week one burns 30,000 errors in 20 minutes — 60% of the monthly budget in one incident. The burn-rate alert fires; the postmortem uses the burned budget as the severity metric. The remaining 40% covers the next 20 days only at the baseline error rate — no room for a risky deploy. The team ships the planned feature behind a feature flag instead.

▸Why this works

Without an SLO, “should we ship?” is a political argument — whoever has more authority wins. With an SLO and an error budget, the answer is arithmetic: the budget either has headroom or it does not. This is the single most leveraged cultural shift the framework creates. It turns engineering reliability into a language product managers, SREs, and executives can all read.

Quiz

A service has a 99.9% availability SLO over 30 days. What is the error budget?

Quiz

A service has a 99.9% SLO and is currently running at a 1.44% error rate. What is the burn rate?

Order the steps

Order the steps of building an SLO from scratch:

1 Identify the user journey that matters (checkout, search, login)
2 Pick an SLI: a measurable good/total ratio (e.g. successful_requests / total_requests)
3 Set the SLO: a target percentage for that ratio (e.g. 99.9%)
4 Compute the error budget: 1 − SLO, over a rolling window (typically 28 days)
5 Instrument multi-window multi-burn-rate alerts on top of the SLO
6 Write the error budget policy: what happens if the budget is exhausted
7 Review the SLO quarterly: tighten, loosen, or change the SLI based on real user impact

Complete the analogy

Fill in the blank: the error _______ is the failure allowance you can spend before a deployment freeze is triggered.

Recall before you leave

01
In one sentence each, what is an SLI, an SLO, and an error budget?
02
Why is burn rate more useful than raw error rate for alerting?
03
What happens when the error budget reaches zero?

Recap

An SLI is the measurable ratio — good events over total events — that tells you how the service is performing from the user’s perspective. The SLO sets the target for that ratio: a 99.9% SLO means 99.9% of events must be good. The error budget is the 0.1% of failures you are allowed before the SLO is breached, converted into concrete numbers: at 1 million requests per month, that is 1,000 failures, or roughly 43 minutes of downtime. Burn rate normalises the current error rate against the budget rate — a burn rate of 14.4x means the 30-day budget would be gone in 2 days — making alerts and dashboards comparable across any service. When the budget is exhausted, the error budget policy halts feature deploys until it regenerates. Now when someone asks “can we ship this week?”, you have a number — not a feeling.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Native histograms, SLO tie-in, and production failure patternsmiddle

unlocks

deepens into

appears again in201

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Job schedulerA cron + backoff job runner with at-least-once delivery, idempotent handlers, and visibility timeouts — so no job is silently lost even when workers crash mid-execution.