awesome-everything RU
↑ Back to the climb

Observability

Cardinality as a cost driver: labels, PII, exemplars, and sampling

Crux Every unique label combination is a separate time series charged separately in RAM and in hosted billing. The discipline is iron: only bounded, actionable labels go on metrics — everything else lives in logs or traces, joined by exemplars.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 14 min

Cloudflare 2022: a global outage was preceded by Prometheus servers OOMing under cardinality from a new label on the request-duration metric. The fix landed in 90 minutes — but the post-mortem mandated a per-team cardinality budget and a CI check that rejects new label dimensions over a threshold. The alert came from Prometheus’s own meta-monitoring, not from the services it was supposed to watch.

The math of cardinality

Every Prometheus metric series lives in the TSDB head block at roughly 3 KB of RAM. The number of series is the product of all label-value cardinalities:

series count = |route| × |method| × |status_class| × |service| × |region|

For a service with 200 routes, 5 methods, 4 status classes, 50 pods across 3 regions:

200 × 5 × 4 × 50 × 3 = 600,000 series
600,000 × 3 KB = ~1.8 GB just for this one metric

Now add user_id with 100k active users:

600,000 × 100,000 = 60 billion series

This crashes a 16 GB Prometheus server in seconds. The TSDB cannot index that many series, and the append path serializes on the head mutex.

The cost in hosted backends

At Datadog’s ~$0.05 / custom metric / host / month (2024 pricing), an unbounded user_id label that grows to 1M series adds ~$50k/month overnight for one careless label.

The cardinality-to-cost linearity is what makes this a security and financial incident, not just a performance issue.

The PII security angle

A naive Errors counter labelled by error_message or stack_trace publishes exception text into the metrics scrape, which is often less access-controlled than the application database. If the message contains user input — “could not find user alice@example.com” — that PII lands in a metrics backend that the entire engineering org can read.

Real incident: a payments service in 2021 leaked customer phone numbers via a poorly-named failed_phone label. The post-mortem mandated a global pre-commit hook that flags any new label named with a known-PII pattern.

Label audit rule: label by error class (auth_failed, db_timeout, parse_error), never by error content. Audit label names as a security review item, not just a performance review item.

Label typeExampleWhere it belongs
Bounded, actionableroute, method, status_class, regionMetric labels ✓
Unbounded, high-cardinalityuser_id, request_id, session_tokenLogs / traces only ✗
PII contentemail, phone, ip_address, stack_traceNever in metrics ✗✗

Exemplars: the bridge between metrics and traces

If you cannot put trace_id in a metric label (unbounded cardinality), how do you jump from a p99 spike to the slow request that caused it? Exemplars.

Prometheus 2.32+ and OpenTelemetry’s histogram implementation both support exemplars: sampled trace IDs attached to individual histogram observations. When histogram_quantile shows p99 at 800 ms, clicking the spike in Grafana reveals the exemplar — a trace ID from a request that landed in that bucket. One click jumps to the full span tree.

# HELP http_request_duration_seconds
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.2"} 14324 # {trace_id="abc123"} 0.183
http_request_duration_seconds_bucket{le="0.4"} 14329

The exemplar trace_id="abc123" is attached to the specific observation 0.183, not added as a label to the metric. Cardinality stays flat; drilldown is preserved.

Aggregation vs sampling

RED + USE metrics are pre-aggregated — they summarise across all requests or all wall-clock time without sampling. A histogram’s bucket counts are incrementally updated; you never throw away an observation.

Traces are the opposite: sampled (typically 0.1–5%) because each trace carries the full request path with all spans. The senior pattern:

  • Pre-aggregate RED + USE at the source — 100% coverage, bounded storage.
  • Sample traces: head-based at 5% for cost, tail-based at 100% for errors and slow requests (duration > SLO target) — so the rare slow path always has a trace.
  • Exemplars bridge the two: the metric shows the spike (aggregate), the exemplar points to a specific trace (sample).

The four-signal stack — RED metrics, USE metrics, sampled traces, sampled profiles — composes if and only if they share label keys (http.route, service.name, status_class). OpenTelemetry’s semantic conventions formalise these join keys.

Why this works

Self-referential observability: Prometheus itself emits RED and USE metrics. Prometheus’s prometheus_tsdb_head_series (growing too fast → cardinality explosion), prometheus_engine_query_duration_seconds_p99 (too slow → queries timing out), and prometheus_rule_evaluation_duration_seconds_p99 (too slow → alert delays) are the signals that caught the Cloudflare and Discord 2022–2023 incidents. In both cases Prometheus’s own meta-monitoring fired before the affected services’ RED alerts did. Monitoring the monitor is not optional.

Quiz

A team adds a new label 'country_code' (220 possible values) to their existing RED metrics. Their current series count is 10,000. Roughly how many series will they have after the change?

Quiz

An engineer wants to jump from a p99 latency spike in a Prometheus histogram to the specific slow request. The team cannot add trace_id as a metric label (cardinality). What is the correct solution?

Recall before you leave
  1. 01
    A service with 50 routes × 5 methods × 4 status classes has 1,000 series for its RED metrics. The team adds 'customer_tier' with 3 values. How many series now, and why?
  2. 02
    What is the PII risk of labelling metrics by error_message, and what is the correct alternative?
  3. 03
    What are exemplars and how do they solve the trace_id cardinality problem?
Recap

Cardinality is the number of unique label-value combinations on a Prometheus metric — each combination is a separate time series stored in RAM at ~3 KB and billed separately in hosted backends. One unbounded label (user_id, request_id, error_message content) can grow a 200-series service to millions of series and crash the Prometheus TSDB or add tens of thousands of dollars to the monthly bill overnight. The iron rule: only bounded, actionable labels go on metrics — route templates, HTTP methods, status classes, service name, region. Everything high-cardinality (trace IDs, user IDs, error content) lives in logs and traces. Exemplars bridge the gap: Prometheus 2.32+ and OTel histograms support attaching a sampled trace ID to specific observations, letting Grafana jump from a p99 spike to the slow request’s full span tree without adding trace_id as a cardinality-multiplying label. PII in labels is both a cardinality problem and a data-leak problem — audit label names as a security review item.

Connected lessons
appears again in167
Continue the climb ↑Native histograms, SLO tie-in, and production failure patterns
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.