awesome-everything RU
↑ Back to the climb

Observability

SLI, SLO, and the error budget: reliability by the numbers

Crux An SLO is a promise about how reliable your service will be, and an error budget is the leftover failure allowance you can spend before that promise breaks — together they end arguments about ''''how reliable is reliable enough?
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at junior altitude — the surface
◷ 12 min

A product manager asks: “Can we ship the new checkout flow this week?” Without an SLO, the answer is a feeling. With one, it is arithmetic — the error budget either has headroom or it does not.

What the three terms actually mean

SLI (Service Level Indicator) is a ratio: the fraction of requests (or events) that were “good” out of everything that was attempted. For a request-driven service, “good” typically means: returned a successful status code, within an acceptable latency. The formula is always the same shape — good events / total events — and the result is a number between 0% and 100%.

SLO (Service Level Objective) is the target you commit to for that SLI. If you pick 99.9%, you are saying: at least 99.9% of requests must be good. Everything above the target is surplus. Everything below is a failure.

Error budget is what remains under the target: 1 − SLO. A 99.9% SLO leaves a 0.1% budget of failures you are allowed to produce before the SLO is breached.

TermFormulaExample (99.9% SLO, 30 days)
SLIgood_events / total_events999,000 / 1,000,000 = 99.9%
SLOtarget for the SLI>= 99.9% of requests must succeed
Error budget1 − SLO0.1% = 1,000 failures per million, or ~43 minutes of downtime
Burn rateactual_error_rate / (1 − SLO)1.0% error rate = 10x burn (exhausts budget in 3 days)

The data-plan metaphor

Think of the error budget like a monthly mobile data plan. A 99.9% SLO over 30 days has the same shape as “10 GB per month”: you start the period with a fixed allowance, you spend it as events happen, and if you run out before the period ends, you slow down — no new feature deploys — until the next billing cycle resets.

A small incident is like a 30-second video: barely a dent. A two-hour outage is like streaming HD all weekend: most of the month’s quota gone. Teams that are heavy spenders ration carefully; light spenders ship freely. The budget is a real resource, not a metaphor in the system — when it goes negative, real consequences follow.

Burn rate: the derived quantity that matters most

Burn rate normalises the error rate against the SLO target:

burn_rate = actual_error_rate / (1 − SLO)

At the 99.9% SLO (budget = 0.1% = 0.001):

  • 0.1% error rate → burn rate 1.0: you would exhaust the budget exactly by month-end
  • 1.44% error rate → burn rate 14.4: you would exhaust the 30-day budget in about 2 days

A burn rate of 1 means “sustainable.” Above 1 means you are on pace to miss the SLO. The burn rate is what alerts and dashboards show, because it is the same number regardless of the SLO target or traffic volume — 14.4x means the same thing on any service.

A concrete scenario

A SaaS team sets a 99.5% SLO. Over a month they serve 10 million requests; 0.5% = 50,000 errors allowed. A config bug in week one burns 30,000 errors in 20 minutes — 60% of the monthly budget in one incident. The burn-rate alert fires; the postmortem uses the burned budget as the severity metric. The remaining 40% covers the next 20 days only at the baseline error rate — no room for a risky deploy. The team ships the planned feature behind a feature flag instead.

Why this works

Without an SLO, “should we ship?” is a political argument — whoever has more authority wins. With an SLO and an error budget, the answer is arithmetic: the budget either has headroom or it does not. This is the single most leveraged cultural shift the framework creates. It turns engineering reliability into a language product managers, SREs, and executives can all read.

Quiz

A service has a 99.9% availability SLO over 30 days. What is the error budget?

Quiz

A service has a 99.9% SLO and is currently running at a 1.44% error rate. What is the burn rate?

Order the steps

Order the steps of building an SLO from scratch:

  1. 1 Identify the user journey that matters (checkout, search, login)
  2. 2 Pick an SLI: a measurable good/total ratio (e.g. successful_requests / total_requests)
  3. 3 Set the SLO: a target percentage for that ratio (e.g. 99.9%)
  4. 4 Compute the error budget: 1 − SLO, over a rolling window (typically 28 days)
  5. 5 Instrument multi-window multi-burn-rate alerts on top of the SLO
  6. 6 Write the error budget policy: what happens if the budget is exhausted
  7. 7 Review the SLO quarterly: tighten, loosen, or change the SLI based on real user impact
Complete the analogy

Fill in the blank: the error _______ is the failure allowance you can spend before a deployment freeze is triggered.

Recall before you leave
  1. 01
    In one sentence each, what is an SLI, an SLO, and an error budget?
  2. 02
    Why is burn rate more useful than raw error rate for alerting?
  3. 03
    What happens when the error budget reaches zero?
Recap

An SLI is the measurable ratio — good events over total events — that tells you how the service is performing from the user’s perspective. The SLO sets the target for that ratio: a 99.9% SLO means 99.9% of events must be good. The error budget is the 0.1% of failures you are allowed before the SLO is breached, converted into concrete numbers: at 1 million requests per month, that is 1,000 failures, or roughly 43 minutes of downtime. Burn rate normalises the current error rate against the budget rate — a burn rate of 14.4x means the 30-day budget would be gone in 2 days — making alerts and dashboards comparable across any service. When the budget is exhausted, the error budget policy halts feature deploys until it regenerates.

Connected lessons
appears again in175
Continue the climb ↑Choosing SLIs and SLO targets: ratios, not feelings
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.