awesome-everything RU
↑ Back to the climb

Observability

Instrumenting RED in Prometheus: counters, histograms, and cardinality discipline

Crux The three canonical Prometheus metrics for RED, why Duration must be a histogram (never an average), how histogram_quantile works, and the iron label discipline that keeps cardinality under control.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 14 min

A team alerts on average request latency. A bug fix pushes p99 from 200 ms to 800 ms — but barely moves the mean. The on-call misses the incident for 40 minutes. The SLO review finds the average-latency alert has never fired on a real user impact. Histograms would have fired in 2 minutes.

The three canonical RED metrics

Every HTTP service should emit exactly three metric groups, named consistently:

http_requests_total        # counter — Rate
http_request_errors_total  # counter — Errors (5xx only, or a status label)
http_request_duration_seconds  # histogram — Duration

Prometheus PromQL then gives you all three RED dimensions:

  • Rate: rate(http_requests_total[5m])
  • Error rate: rate(http_request_errors_total[5m]) / rate(http_requests_total[5m])
  • Duration p99: histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

Why Duration must be a histogram

The average hides everything users notice. A service with 99% of requests at 50 ms and 1% at 5000 ms has the same mean latency (~100 ms) as one with all requests at 100 ms. The first kills users on retries; the second does not.

Prometheus’s histogram_quantile(q, buckets) reads per-bucket counts accumulated over a time window and estimates the q-th percentile by linear interpolation between adjacent buckets. Accuracy depends entirely on bucket density near the percentile you care about.

The by (le) requirement. The correct form is always:

histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

Dropping by (le) collapses all label dimensions including le (the bucket boundary label), leaving histogram_quantile with a single point rather than a distribution — the result is NaN or garbage. This is a real, common mistake that silently produces wrong values.

Latency signalWhat it hidesUse it for
Average (sum/count)Slow-tail behavior that users noticeNever for SLO alerts
Prometheus summaryCannot aggregate across replicasSingle-replica-owns-data only
Prometheus histogramAccuracy depends on bucket densityFleet-wide p99 alerts

Bucket strategy

Default Prometheus client buckets — [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10] seconds — are wrong for most services. For a checkout API with a 200 ms SLO, most traffic falls between 50 ms and 250 ms. One bucket covers that entire range (100 ms to 250 ms), so p99 could be anywhere in it — unreadable.

Production rule: 10–15 buckets, densest around the SLO target. For a 200 ms SLO:

[0.01, 0.025, 0.05, 0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 10]

Three buckets below 200 ms (100, 25, 50 ms boundaries give resolution), three above (400, 800, 1600 ms), hard cap at the service timeout (10 s). Adjacent buckets differ by ≤2× near the SLO.

Label discipline — the iron rule

Every unique combination of label values on a Prometheus metric creates a separate time series. A naive RED instrumentation labelled by user_id in a service with 100k active users grows from a few hundred series to hundreds of thousands within hours.

What belongs in labels:

  • route — the URL template (/cart, not /cart?u=12345)
  • method — HTTP verb (GET / POST / …)
  • status_class — 2xx / 4xx / 5xx (not the exact code)
  • service — injected by the deployment as a meta-label

Forbidden in labels: user IDs, request IDs, customer email, session tokens, query strings, country code unless small and bounded. All of these have unbounded cardinality.

The cost math: collapsing 200/201/204 into 2xx cuts 60 unique status codes down to 4 classes. For 20 routes × 4 methods: 60 × 20 × 4 = 4,800 series → 4 × 20 × 4 = 320 series, a 15× reduction with no loss of useful alerting power.

Why this works

If you genuinely need to alert on a specific status code on a specific route, build that alert from logs — not from a metric with a high-cardinality label. Logs are the natural home of high-cardinality data (each event is one record). Metrics are the home of aggregated, time-series counts (each series is a separate in-memory counter). The split is architectural, not preference.

A Node.js RED middleware

const client = require('prom-client');
const reqs = new client.Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status_class'],
});
const errs = new client.Counter({
  name: 'http_request_errors_total',
  help: 'Failed HTTP requests (5xx)',
  labelNames: ['method', 'route'],
});
const dur = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Request duration',
  labelNames: ['method', 'route', 'status_class'],
  buckets: [0.01, 0.025, 0.05, 0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 10],
});

app.use((req, res, next) => {
  const start = process.hrtime.bigint();
  res.on('finish', () => {
    const seconds = Number(process.hrtime.bigint() - start) / 1e9;
    const sclass = `${Math.floor(res.statusCode / 100)}xx`;
    const route = req.route?.path || 'unknown';
    reqs.inc({ method: req.method, route, status_class: sclass });
    dur.observe({ method: req.method, route, status_class: sclass }, seconds);
    if (res.statusCode >= 500) errs.inc({ method: req.method, route });
  });
  next();
});

req.route.path gives the matched template (/cart), not req.url which includes query strings. That one line prevents cardinality explosion.

Quiz

A team alerts on the AVERAGE request latency across all replicas. Why is this dangerous?

Quiz

A service emits an Errors counter labelled by exact error_message string. After a buggy release that throws unique stack traces, the metrics backend bill triples overnight. Why?

Quiz

A senior engineer claims histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m]))) (no 'by (le)') gives the fleet-wide p99. Why is this wrong?

Recall before you leave
  1. 01
    Why must RED Duration be a histogram rather than sum/count (average)?
  2. 02
    What does the 'by (le)' clause do in a histogram_quantile query, and what happens without it?
  3. 03
    Name three label values forbidden on RED metrics and one label value that is always allowed.
Recap

RED in Prometheus is three metric groups: http_requests_total (counter for Rate), http_request_errors_total (counter for Errors), and http_request_duration_seconds (histogram for Duration). Duration must be a histogram because the average masks tail behavior that users feel — histogram_quantile reads per-bucket counts and interpolates the percentile, but only when sum by (le) preserves the bucket-boundary label. Bucket selection decides p99 accuracy: choose 10–15 buckets densest around the SLO target with adjacent buckets differing by ≤2× near the SLO. Label discipline is the other half: use route templates, HTTP method, and status class — never user IDs, request IDs, or exact error messages. Each unique label combination is a separate time series, billed separately, and stored in RAM on the Prometheus server.

Connected lessons
appears again in167
Continue the climb ↑USE on Linux: CPU, memory, disk, network, and PSI
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.