Observability OBS · 04 · 05

Cardinality as a cost driver: labels, PII, exemplars, and sampling

Every unique label combination is a separate time series charged separately in RAM and in hosted billing. The discipline is iron: only bounded, actionable labels go on metrics — everything else lives in logs or traces, joined by exemplars.

OBS Middle ◷ 14 min

Level

FoundationsJuniorMiddleSenior

Cloudflare 2022: a global outage was preceded by Prometheus servers OOMing under cardinality from a new label on the request-duration metric. The fix landed in 90 minutes — but the post-mortem mandated a per-team cardinality budget and a CI check that rejects new label dimensions over a threshold. The alert came from Prometheus’s own meta-monitoring, not from the services it was supposed to watch.

The math of cardinality

Why does one innocent label addition turn a predictable system into an OOM incident overnight? Because cardinality multiplies — it does not add. Every Prometheus metric series lives in the TSDB head block at roughly 3 KB of RAM, and the number of series is the product, not the sum, of all label-value cardinalities:

series count = |route| × |method| × |status_class| × |service| × |region|

For a service with 200 routes, 5 methods, 4 status classes, 50 pods across 3 regions:

200 × 5 × 4 × 50 × 3 = 600,000 series
600,000 × 3 KB = ~1.8 GB just for this one metric

Now add user_id with 100k active users:

600,000 × 100,000 = 60 billion series

This crashes a 16 GB Prometheus server in seconds. The TSDB cannot index that many series, and the append path serializes on the head mutex.

Bounded labels multiply to a fixed, plannable series count. Adding one unbounded label (user_id) multiplies every existing series by its value count — 600K becomes 60B and the TSDB OOMs.

The cost in hosted backends

At Datadog’s ~$0.05 / custom metric / host / month (2024 pricing), an unbounded user_id label that grows to 1M series adds ~$50k/month overnight for one careless label.

The cardinality-to-cost linearity is what makes this a security and financial incident, not just a performance issue.

The PII security angle

A naive Errors counter labelled by error_message or stack_trace publishes exception text into the metrics scrape, which is often less access-controlled than the application database. If the message contains user input — “could not find user alice@example.com” — that PII lands in a metrics backend that the entire engineering org can read.

Real incident: a payments service in 2021 leaked customer phone numbers via a poorly-named failed_phone label. The post-mortem mandated a global pre-commit hook that flags any new label named with a known-PII pattern.

Label audit rule: label by error class (auth_failed, db_timeout, parse_error), never by error content. Audit label names as a security review item, not just a performance review item.

Label type	Example	Where it belongs
Bounded, actionable	route, method, status_class, region	Metric labels ✓
Unbounded, high-cardinality	user_id, request_id, session_token	Logs / traces only ✗
PII content	email, phone, ip_address, stack_trace	Never in metrics ✗✗

Exemplars: the bridge between metrics and traces

If you cannot put trace_id in a metric label (unbounded cardinality), how do you jump from a p99 spike to the slow request that caused it? Exemplars.

Prometheus 2.32+ and OpenTelemetry’s histogram implementation both support exemplars: sampled trace IDs attached to individual histogram observations. When histogram_quantile shows p99 at 800 ms, clicking the spike in Grafana reveals the exemplar — a trace ID from a request that landed in that bucket. One click jumps to the full span tree.

# HELP http_request_duration_seconds
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.2"} 14324 # {trace_id="abc123"} 0.183
http_request_duration_seconds_bucket{le="0.4"} 14329

The exemplar trace_id="abc123" is attached to the specific observation 0.183, not added as a label to the metric. Cardinality stays flat; drilldown is preserved.

Aggregation vs sampling

RED + USE metrics are pre-aggregated — they summarise across all requests or all wall-clock time without sampling. A histogram’s bucket counts are incrementally updated; you never throw away an observation.

Traces are the opposite: sampled (typically 0.1–5%) because each trace carries the full request path with all spans. The senior pattern:

Pre-aggregate RED + USE at the source — 100% coverage, bounded storage.
Sample traces: head-based at 5% for cost, tail-based at 100% for errors and slow requests (duration > SLO target) — so the rare slow path always has a trace.
Exemplars bridge the two: the metric shows the spike (aggregate), the exemplar points to a specific trace (sample).

Metrics are pre-aggregated at full coverage but must stay low-cardinality; traces carry the unbounded data but only for a sampled subset. Exemplars link a metric spike to one of those sampled traces.

The four-signal stack — RED metrics, USE metrics, sampled traces, sampled profiles — composes if and only if they share label keys (http.route, service.name, status_class). OpenTelemetry’s semantic conventions formalise these join keys.

▸Why this works

Self-referential observability: Prometheus itself emits RED and USE metrics. Prometheus’s prometheus_tsdb_head_series (growing too fast → cardinality explosion), prometheus_engine_query_duration_seconds_p99 (too slow → queries timing out), and prometheus_rule_evaluation_duration_seconds_p99 (too slow → alert delays) are the signals that caught the Cloudflare and Discord 2022–2023 incidents. In both cases Prometheus’s own meta-monitoring fired before the affected services’ RED alerts did. Monitoring the monitor is not optional.

Quiz

A team adds a new label 'country_code' (220 possible values) to their existing RED metrics. Their current series count is 10,000. Roughly how many series will they have after the change?

Quiz

An engineer wants to jump from a p99 latency spike in a Prometheus histogram to the specific slow request. The team cannot add trace_id as a metric label (cardinality). What is the correct solution?

Recall before you leave

01
A service with 50 routes × 5 methods × 4 status classes has 1,000 series for its RED metrics. The team adds 'customer_tier' with 3 values. How many series now, and why?
02
What is the PII risk of labelling metrics by error_message, and what is the correct alternative?
03
What are exemplars and how do they solve the trace_id cardinality problem?

Recap

Cardinality is the number of unique label-value combinations on a Prometheus metric — each combination is a separate time series stored in RAM at ~3 KB and billed separately in hosted backends. One unbounded label (user_id, request_id, error_message content) can grow a 200-series service to millions of series and crash the Prometheus TSDB or add tens of thousands of dollars to the monthly bill overnight. The iron rule: only bounded, actionable labels go on metrics — route templates, HTTP methods, status classes, service name, region. Everything high-cardinality (trace IDs, user IDs, error content) lives in logs and traces. Exemplars bridge the gap: Prometheus 2.32+ and OTel histograms support attaching a sampled trace ID to specific observations, letting Grafana jump from a p99 spike to the slow request’s full span tree without adding trace_id as a cardinality-multiplying label. PII in labels is both a cardinality problem and a data-leak problem — audit label names as a security review item. Now when you review a pull request that adds a new label, your first question is “what is the upper bound on this value?” — and if the answer is “users” or “requests,” it belongs in a log, not a metric.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 7 done

Connected lessons

builds on

Instrumenting RED in Prometheus: counters, histograms, and cardinality disciplinemiddle

unlocks

deepens into

appears again in170

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.