Observability OBS · 04 · 06

Native histograms, SLO tie-in, and production failure patterns

Native (exponential) histograms eliminate manual bucket tuning. RED''''s Duration and Errors feed the SLO error budget. Real production failures — Cloudflare, GitHub, Stripe, Discord — each trace to a specific RED+USE gap.

OBS Middle ◷ 16 min

Level

FoundationsJuniorMiddleSenior

GitHub 2023: a SLO alert for code-search latency repeatedly fired without action. Investigation found the alert was on average latency — a bug fix that pushed p99 from 200 ms to 800 ms barely moved the mean. The SLO was reworked to multi-window p99 burn-rate alerts per the Google SRE Workbook pattern. The next incident was caught in under 4 minutes.

Native histograms (Prometheus 2.40+)

How often have you shipped a service, then realised three months later that the wrong latency range has no bucket density — so your p99 reads as a wide smear instead of a number? Classic histograms require choosing bucket boundaries at instrumentation time. If your service’s latency distribution shifts — say a slow path suddenly dominates the 1–5 s range — you need to redeploy with different buckets.

Native histograms (Prometheus 2.40+, also called exponential histograms in OTel) use a logarithmic scale with dynamically adjusting resolution. Every observation is placed in a bucket sized at ~1% relative precision — e^(2^-scale) width — without any static boundary list. Benefits:

No manual bucket tuning.
Uniform p99 accuracy across the latency range (not just near the SLO).
Similar storage cost to a well-tuned classic histogram.
Full PromQL compatibility: histogram_quantile works the same way.

OTel’s exponential histograms work identically and are the default for new metric instrumentation in the OTel SDK.

Compatibility note: some older storage backends (Cortex < 1.15, Thanos < 0.32) need version upgrades to read native histograms. Verify the storage tier before switching.

Histogram type	Bucket tuning	p99 accuracy	Compatibility
Classic (fixed buckets)	Manual, per-service	Good near SLO only	Universal
Native / exponential	None needed	~1% everywhere	Prometheus 2.40+, OTel SDK default
Summary (per-replica)	None	Correct per-replica, wrong aggregated	Cannot aggregate across replicas

Why averaging p99 across replicas is wrong

Averaging pre-computed percentiles (summary metrics) across replicas is a famous anti-pattern:

Example: replica A serves 100 requests all at 100 ms. Replica B serves 100 requests all at 1000 ms. p99 of A is 100 ms; p99 of B is 1000 ms; their average is 550 ms. The true p99 of the merged 200 requests is 1000 ms (rank 199 out of 200, above the 99th cutoff). The average is off by nearly 2×.

Percentiles are not additive: averaging two replicas' p99s reads 550 ms, but the true fleet-wide p99 is 1000 ms — off by nearly 2x.

The correct pattern: histograms emit per-replica bucket counts; sum by (le) (rate(...)) aggregates them at query time; histogram_quantile computes the true fleet-wide percentile from the aggregated counts. Accurate within ±half a bucket. No replica isolation.

RED feeds the SLO error budget

An SLO is a promise: 99.9% of requests complete successfully within 200 ms over 30 days. RED’s Duration and Errors are exactly the metrics that populate this promise.

Availability SLO: 1 - rate(errors[window]) / rate(total[window]) — “99.9% of requests must not return 5xx”
Latency SLO: histogram_quantile(0.99, ...) < 0.200 — “99th percentile must be under 200 ms”
Error budget: minutes or request-count remaining before the SLO is breached for the month.

Without RED instrumentation, you do not have an SLO — you have a wish.

Multi-window burn-rate alerting (Google SRE Workbook): alert when the error budget is burning too fast at two windows simultaneously:

1-hour window (fast burn): catches sudden spikes (deployment gone wrong).
6-hour window (slow burn): catches slow regressions that threshold alerts miss.

A threshold alert on absolute p99 (“fire if p99 > 200 ms”) misses slow regressions (p99 goes from 150 to 190 ms over two weeks, always below the threshold) and fires false positives on brief spikes. Burn-rate alerts catch both.

Production failure patterns in RED+USE

GitHub 2023 (average latency alert): SLO alert on average latency. A regression pushed p99 from 200 to 800 ms; mean moved from 45 to 55 ms. Alert never fired. Fix: switch to p99 burn-rate alerts with 1h short + 6h long windows per the SRE Workbook.

Cloudflare 2022 (cardinality explosion): A new label on request_duration caused Prometheus servers to OOM. Rule-evaluation latency grew from 100 ms to 4+ s, masking service alerts. Prometheus’s own prometheus_tsdb_head_series was the leading signal — but there was no alert on it. Fix: per-team cardinality budget enforced in CI; meta-monitoring on Prometheus’s own RED.

Stripe 2024 (PSI memory pressure invisible to free-RAM): Webhook worker stalled for 11 minutes. Free-RAM dashboards showed 500 MB available. PSI memory full was at 90% — kernel was thrashing reclaim. Fix: migrated saturation alerting from MemAvailable < threshold to PSI-based alerting; next incident caught in 30 seconds.

Discord 2023 (rule-evaluation latency growth): A refactor moved a hot label from one metric to two. Cardinality doubled silently. Prometheus rule-evaluation latency grew from 100 ms to 4 s over two weeks, delaying alerts during a customer-facing incident. Fix: rule-evaluation duration added as a first-class panel in the meta-monitoring dashboard.

▸Why this works

Self-referential observability: Prometheus itself emits RED and USE metrics. Key Prometheus self-metrics to watch: prometheus_tsdb_head_series (cardinality growth), prometheus_engine_query_duration_seconds_p99 (query latency), prometheus_rule_evaluation_duration_seconds_p99 (alert delays). Every observability layer — OTel Collectors, Fluent Bit, Vector — emits its own RED. In a mature stack, every layer of observability is itself observed. The collapse mode for an unobserved monitoring tier is silent failure: your services look fine, but the monitoring pipeline stopped delivering their data 20 minutes ago.

Quiz

A team uses Prometheus summary metrics (pre-computed p99 per replica) and averages them to get fleet-wide p99. Why is this mathematically wrong?

Quiz

A threshold alert fires when p99 > 200 ms. p99 drifts from 140 ms to 185 ms over three weeks (never crossing 200 ms). The SLO allows 99.9% of requests to complete within 200 ms. What does the threshold alert miss that a burn-rate alert would catch?

The average lands in the dense hump near 50 ms and looks healthy. Percentiles read the tail directly: p95 and p99 sit past the 200 ms SLO line, exposing slow requests the mean hides.

Recall before you leave

01
What is a native (exponential) histogram and what problem does it solve over classic histograms?
02
How does RED tie into SLO error budgets? What is the formula for a latency SLO?
03
What are the two burn-rate windows in multi-window alerting and what does each catch?

Recap

Native (exponential) histograms, available in Prometheus 2.40+ and as the OTel SDK default, use a logarithmic bucket scale at ~1% relative precision, eliminating the manual bucket-tuning required by classic histograms. RED Duration histograms and Error counters are the direct substrate for SLO measurement: the availability SLO is 1 - error_rate and the latency SLO is histogram_quantile(0.99) < target. Multi-window burn-rate alerting (1h short + 6h long windows) catches both sudden regressions and slow drift that threshold alerts miss — catching the slow kind is what the GitHub 2023 incident post-mortem mandated. Every major production failure in this unit — Cloudflare 2022, GitHub 2023, Stripe 2024, Discord 2023 — reduced to one specific missed RED or USE signal: cardinality explosion invisible without Prometheus self-monitoring, average latency masking p99, free-RAM invisible to PSI memory pressure, and rule-evaluation latency growth masking alerts. Self-referential observability — monitoring the monitoring pipeline itself with the same RED+USE discipline — is the last defense. Now when you review an SLO definition or an alert rule, you will ask: “Which RED signal feeds this, and does burn-rate alerting cover the slow-drift case?” — because Cloudflare, GitHub, Stripe, and Discord already paid the cost of not asking.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 7 done

Connected lessons

builds on

unlocks

SLI, SLO, and the error budget: reliability by the numbersjunior

deepens into

appears again in170

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.