Observability OBS · 05 · 04

Error budget policy, latency SLOs, and composite journeys

A signed error budget policy turns alert burns into organisational action. Latency SLOs require histogram bucket boundaries at the threshold. Composite journeys multiply SLO failures — 5 services at 99.9% give a ~99.5% journey ceiling.

OBS Middle ◷ 16 min

Level

FoundationsJuniorMiddleSenior

Every service in the checkout path shows 99.9% availability on its own dashboard. The user-facing checkout success rate is 99.5%. Both numbers are correct — the team is monitoring the wrong layer.

The error budget policy: more than a rule

You have the alerts wired up. Now ask: when the page fires and the budget hits zero, what actually happens? Without a written, signed answer, the team improvises — and usually ships anyway. An error budget policy is a written, signed document that specifies what happens when the budget is exhausted. Without the signatures, it is a suggestion that pressured on-call teams ignore during release crunches. With a director-level signer, it is a covenant.

Mandatory contents:

Deployment freeze trigger: “when budget reaches 0, halt feature deploys except P0 fixes and security patches”
Postmortem trigger: “if a single incident burns >20% of the budget, a postmortem is mandatory”
Freeze exit: “focus on reliability work until budget regenerates above 50% remaining”
Escalation path: “if the freeze lasts >2 weeks, escalate to VP/CTO”

Out-of-scope exclusions (events that do not burn the budget):

Load test and penetration test requests (tagged by header or IP range)
Known bot and scanner traffic
Requests during planned maintenance windows
Requests rejected by rate limiting at the edge

These exclusion categories must be explicit in the policy document and enforced as label filters in the SLO platform — otherwise every postmortem becomes an argument about whether the incident “should count.”

The policy gate reads the remaining budget: above zero, feature deploys continue; at zero, releases freeze and the team does reliability work until the budget regenerates above the exit threshold.

Latency SLOs: buckets, not quantiles

An availability SLO is trivially a ratio of counts: 5xx_count / total_count. A latency SLO requires expressing “X% of requests under Y ms.” This sounds like a percentile, but it is implemented as a counter:

latency_sli = http_request_duration_seconds_bucket{le="0.2"} / http_request_duration_seconds_count

The histogram bucket at the SLO threshold gives this counter directly — no histogram_quantile needed, and no estimation error contaminating the budget. This is why the RED-Duration histogram must have a bucket boundary exactly at the SLO threshold: without it, the SLO is unevaluable without approximation.

Multi-threshold latency SLO: “90% of requests under 100ms AND 99% of requests under 500ms” catches tail-latency hiding behind a single percentile. A service might pass the 99%-under-500ms threshold while hiding a degraded tail that fails the 90%-under-100ms check. Production-grade SLO platforms support multiple thresholds per SLO via separate SLI ratios joined with worst-of logic.

Composite SLOs: the multiplication problem

Services in series	Each at 99.9%	Journey ceiling
1 service	99.9%	99.9%
2 services	99.9% each	99.8%
5 services	99.9% each	~99.5%
10 services	99.9% each	~99.0%

A checkout journey through API gateway → auth → inventory → payment → database means a request is “good” only if all five services were good. Each service independently at 99.9% gives a journey ceiling of 0.999⁵ ≈ 99.5% — not 99.9%. With dependent failures (shared dependencies, regional incidents), the ceiling is lower still because failures cluster in time.

Reframe the 99.9 → 99.0 ceiling as its error budget: chaining 5 services at 99.9% burns 5× the failure budget of one (0.5% vs 0.1%), and 10 services burn 10×. The 'nines' framing hides how fast composite risk compounds.

The fix: add a journey-level SLI at the API gateway — count successful checkout completions / total checkout attempts, not per-service 200 codes. Per-service SLOs become diagnostic context; the headline is the journey SLO. This is the layer users actually experience.

Finding the dominant contributor: when the journey SLO is red and per-service dashboards are green, pull per-service failure rates and rank them. Pareto applies: 80% of journey failures typically come from 1–2 services. Fix those first; they give maximum leverage per engineering-hour.

▸Why this works

Correlated failures (e.g. a shared database, a noisy neighbour, a regional network event) make the composite ceiling worse than the independence calculation predicts. All five services may fail simultaneously because they share a dependency. This is why the journey-level SLI at the API gateway is the authoritative measurement — it captures correlated failures that per-service SLOs miss by definition.

Order the steps

Order the SLO instrumentation pipeline from raw signal to actionable alert:

1 Define the SLI (good/total ratio at the user journey level, not per service)
2 Pick an SLO target (e.g. 99.9% over 28 days)
3 Instrument the counters with bucket boundaries exactly at the SLO threshold
4 Compute recording rules for ratio_rate per window (5m, 1h, 30m, 6h, 6h, 3d)
5 Define MWMBR alerts (pages at 14.4x and 6x; ticket at 1x)
6 Write the error budget policy (freeze threshold, postmortem trigger, signatures)
7 Quarterly: review whether the SLO matches actual user impact

Quiz

A checkout journey passes through 5 services, each with 99.9% availability SLO. What is the theoretical journey availability ceiling (assuming independent failures)?

Quiz

Why must a latency SLO histogram have a bucket boundary exactly at the SLO threshold (e.g. 200ms)?

Quiz

A team's 99.95% error budget hits 5% remaining 18 days into the 28-day window. Per the policy, what happens next?

Recall before you leave

01
What must an error budget policy contain, and why do the signatures matter?
02
A checkout request traverses 5 services. Per-service dashboards all show 99.9%. The user-facing success rate is 99.5%. What is the right diagnostic and fix?

Recap

An error budget policy is a signed document that specifies what happens when the budget runs out: feature deploys halt, reliability work takes priority, postmortems are mandatory for large burns. Signatures are what give it teeth — without a director-level signer, on-call teams get overridden under pressure. Latency SLOs are implemented as histogram bucket ratios, not percentile estimates — the histogram must have a bucket boundary exactly at the SLO threshold. Composite journeys through N services multiply: 5 services at 99.9% give a ~99.5% journey ceiling, not 99.9%. The fix is a journey-level SLI at the API gateway and ranking per-service failure rates to find the dominant contributor. Now when you see every per-service dashboard green but the user-facing success rate red, you know the problem — and where to look first.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Multi-window multi-burn-rate alerting: why AND beats ORmiddle

unlocks

deepens into

appears again in297

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.