awesome-everything RU
↑ Back to the climb

Observability

Error budget policy, latency SLOs, and composite journeys

Crux A signed error budget policy turns alert burns into organisational action. Latency SLOs require histogram bucket boundaries at the threshold. Composite journeys multiply SLO failures — 5 services at 99.9% give a ~99.5% journey ceiling.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 16 min

Every service in the checkout path shows 99.9% availability on its own dashboard. The user-facing checkout success rate is 99.5%. Both numbers are correct — the team is monitoring the wrong layer.

The error budget policy: more than a rule

An error budget policy is a written, signed document that specifies what happens when the budget is exhausted. Without the signatures, it is a suggestion that pressured on-call teams ignore during release crunches. With a director-level signer, it is a covenant.

Mandatory contents:

  • Deployment freeze trigger: “when budget reaches 0, halt feature deploys except P0 fixes and security patches”
  • Postmortem trigger: “if a single incident burns >20% of the budget, a postmortem is mandatory”
  • Freeze exit: “focus on reliability work until budget regenerates above 50% remaining”
  • Escalation path: “if the freeze lasts >2 weeks, escalate to VP/CTO”

Out-of-scope exclusions (events that do not burn the budget):

  • Load test and penetration test requests (tagged by header or IP range)
  • Known bot and scanner traffic
  • Requests during planned maintenance windows
  • Requests rejected by rate limiting at the edge

These exclusion categories must be explicit in the policy document and enforced as label filters in the SLO platform — otherwise every postmortem becomes an argument about whether the incident “should count.”

Latency SLOs: buckets, not quantiles

An availability SLO is trivially a ratio of counts: 5xx_count / total_count. A latency SLO requires expressing “X% of requests under Y ms.” This sounds like a percentile, but it is implemented as a counter:

latency_sli = http_request_duration_seconds_bucket{le="0.2"} / http_request_duration_seconds_count

The histogram bucket at the SLO threshold gives this counter directly — no histogram_quantile needed, and no estimation error contaminating the budget. This is why the RED-Duration histogram must have a bucket boundary exactly at the SLO threshold: without it, the SLO is unevaluable without approximation.

Multi-threshold latency SLO: “90% of requests under 100ms AND 99% of requests under 500ms” catches tail-latency hiding behind a single percentile. A service might pass the 99%-under-500ms threshold while hiding a degraded tail that fails the 90%-under-100ms check. Production-grade SLO platforms support multiple thresholds per SLO via separate SLI ratios joined with worst-of logic.

Composite SLOs: the multiplication problem

Services in seriesEach at 99.9%Journey ceiling
1 service99.9%99.9%
2 services99.9% each99.8%
5 services99.9% each~99.5%
10 services99.9% each~99.0%

A checkout journey through API gateway → auth → inventory → payment → database means a request is “good” only if all five services were good. Each service independently at 99.9% gives a journey ceiling of 0.999⁵ ≈ 99.5% — not 99.9%. With dependent failures (shared dependencies, regional incidents), the ceiling is lower still because failures cluster in time.

The fix: add a journey-level SLI at the API gateway — count successful checkout completions / total checkout attempts, not per-service 200 codes. Per-service SLOs become diagnostic context; the headline is the journey SLO. This is the layer users actually experience.

Finding the dominant contributor: when the journey SLO is red and per-service dashboards are green, pull per-service failure rates and rank them. Pareto applies: 80% of journey failures typically come from 1–2 services. Fix those first; they give maximum leverage per engineering-hour.

Why this works

Correlated failures (e.g. a shared database, a noisy neighbour, a regional network event) make the composite ceiling worse than the independence calculation predicts. All five services may fail simultaneously because they share a dependency. This is why the journey-level SLI at the API gateway is the authoritative measurement — it captures correlated failures that per-service SLOs miss by definition.

Order the steps

Order the SLO instrumentation pipeline from raw signal to actionable alert:

  1. 1 Define the SLI (good/total ratio at the user journey level, not per service)
  2. 2 Pick an SLO target (e.g. 99.9% over 28 days)
  3. 3 Instrument the counters with bucket boundaries exactly at the SLO threshold
  4. 4 Compute recording rules for ratio_rate per window (5m, 1h, 30m, 6h, 6h, 3d)
  5. 5 Define MWMBR alerts (pages at 14.4x and 6x; ticket at 1x)
  6. 6 Write the error budget policy (freeze threshold, postmortem trigger, signatures)
  7. 7 Quarterly: review whether the SLO matches actual user impact
Quiz

A checkout journey passes through 5 services, each with 99.9% availability SLO. What is the theoretical journey availability ceiling (assuming independent failures)?

Quiz

Why must a latency SLO histogram have a bucket boundary exactly at the SLO threshold (e.g. 200ms)?

Quiz

A team's 99.95% error budget hits 5% remaining 18 days into the 28-day window. Per the policy, what happens next?

Recall before you leave
  1. 01
    What must an error budget policy contain, and why do the signatures matter?
  2. 02
    A checkout request traverses 5 services. Per-service dashboards all show 99.9%. The user-facing success rate is 99.5%. What is the right diagnostic and fix?
Recap

An error budget policy is a signed document that specifies what happens when the budget runs out: feature deploys halt, reliability work takes priority, postmortems are mandatory for large burns. Signatures are what give it teeth — without a director-level signer, on-call teams get overridden under pressure. Latency SLOs are implemented as histogram bucket ratios, not percentile estimates — the histogram must have a bucket boundary exactly at the SLO threshold. Composite journeys through N services multiply: 5 services at 99.9% give a ~99.5% journey ceiling, not 99.9%. The fix is a journey-level SLI at the API gateway and ranking per-service failure rates to find the dominant contributor.

Connected lessons
appears again in268
Continue the climb ↑SLO platforms and the 90-day rollout
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.