Observability
Cardinality as a cost driver: labels, PII, exemplars, and sampling
Cloudflare 2022: a global outage was preceded by Prometheus servers OOMing under cardinality from a new label on the request-duration metric. The fix landed in 90 minutes — but the post-mortem mandated a per-team cardinality budget and a CI check that rejects new label dimensions over a threshold. The alert came from Prometheus’s own meta-monitoring, not from the services it was supposed to watch.
The math of cardinality
Every Prometheus metric series lives in the TSDB head block at roughly 3 KB of RAM. The number of series is the product of all label-value cardinalities:
series count = |route| × |method| × |status_class| × |service| × |region|For a service with 200 routes, 5 methods, 4 status classes, 50 pods across 3 regions:
200 × 5 × 4 × 50 × 3 = 600,000 series
600,000 × 3 KB = ~1.8 GB just for this one metricNow add user_id with 100k active users:
600,000 × 100,000 = 60 billion seriesThis crashes a 16 GB Prometheus server in seconds. The TSDB cannot index that many series, and the append path serializes on the head mutex.
The cost in hosted backends
At Datadog’s ~$0.05 / custom metric / host / month (2024 pricing), an unbounded user_id label that grows to 1M series adds ~$50k/month overnight for one careless label.
The cardinality-to-cost linearity is what makes this a security and financial incident, not just a performance issue.
The PII security angle
A naive Errors counter labelled by error_message or stack_trace publishes exception text into the metrics scrape, which is often less access-controlled than the application database. If the message contains user input — “could not find user alice@example.com” — that PII lands in a metrics backend that the entire engineering org can read.
Real incident: a payments service in 2021 leaked customer phone numbers via a poorly-named failed_phone label. The post-mortem mandated a global pre-commit hook that flags any new label named with a known-PII pattern.
Label audit rule: label by error class (auth_failed, db_timeout, parse_error), never by error content. Audit label names as a security review item, not just a performance review item.
| Label type | Example | Where it belongs |
|---|---|---|
| Bounded, actionable | route, method, status_class, region | Metric labels ✓ |
| Unbounded, high-cardinality | user_id, request_id, session_token | Logs / traces only ✗ |
| PII content | email, phone, ip_address, stack_trace | Never in metrics ✗✗ |
Exemplars: the bridge between metrics and traces
If you cannot put trace_id in a metric label (unbounded cardinality), how do you jump from a p99 spike to the slow request that caused it? Exemplars.
Prometheus 2.32+ and OpenTelemetry’s histogram implementation both support exemplars: sampled trace IDs attached to individual histogram observations. When histogram_quantile shows p99 at 800 ms, clicking the spike in Grafana reveals the exemplar — a trace ID from a request that landed in that bucket. One click jumps to the full span tree.
# HELP http_request_duration_seconds
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.2"} 14324 # {trace_id="abc123"} 0.183
http_request_duration_seconds_bucket{le="0.4"} 14329The exemplar trace_id="abc123" is attached to the specific observation 0.183, not added as a label to the metric. Cardinality stays flat; drilldown is preserved.
Aggregation vs sampling
RED + USE metrics are pre-aggregated — they summarise across all requests or all wall-clock time without sampling. A histogram’s bucket counts are incrementally updated; you never throw away an observation.
Traces are the opposite: sampled (typically 0.1–5%) because each trace carries the full request path with all spans. The senior pattern:
- Pre-aggregate RED + USE at the source — 100% coverage, bounded storage.
- Sample traces: head-based at 5% for cost, tail-based at 100% for errors and slow requests (duration > SLO target) — so the rare slow path always has a trace.
- Exemplars bridge the two: the metric shows the spike (aggregate), the exemplar points to a specific trace (sample).
The four-signal stack — RED metrics, USE metrics, sampled traces, sampled profiles — composes if and only if they share label keys (http.route, service.name, status_class). OpenTelemetry’s semantic conventions formalise these join keys.
Why this works
Self-referential observability: Prometheus itself emits RED and USE metrics. Prometheus’s prometheus_tsdb_head_series (growing too fast → cardinality explosion), prometheus_engine_query_duration_seconds_p99 (too slow → queries timing out), and prometheus_rule_evaluation_duration_seconds_p99 (too slow → alert delays) are the signals that caught the Cloudflare and Discord 2022–2023 incidents. In both cases Prometheus’s own meta-monitoring fired before the affected services’ RED alerts did. Monitoring the monitor is not optional.
A team adds a new label 'country_code' (220 possible values) to their existing RED metrics. Their current series count is 10,000. Roughly how many series will they have after the change?
An engineer wants to jump from a p99 latency spike in a Prometheus histogram to the specific slow request. The team cannot add trace_id as a metric label (cardinality). What is the correct solution?
- 01A service with 50 routes × 5 methods × 4 status classes has 1,000 series for its RED metrics. The team adds 'customer_tier' with 3 values. How many series now, and why?
- 02What is the PII risk of labelling metrics by error_message, and what is the correct alternative?
- 03What are exemplars and how do they solve the trace_id cardinality problem?
Cardinality is the number of unique label-value combinations on a Prometheus metric — each combination is a separate time series stored in RAM at ~3 KB and billed separately in hosted backends. One unbounded label (user_id, request_id, error_message content) can grow a 200-series service to millions of series and crash the Prometheus TSDB or add tens of thousands of dollars to the monthly bill overnight. The iron rule: only bounded, actionable labels go on metrics — route templates, HTTP methods, status classes, service name, region. Everything high-cardinality (trace IDs, user IDs, error content) lives in logs and traces. Exemplars bridge the gap: Prometheus 2.32+ and OTel histograms support attaching a sampled trace ID to specific observations, letting Grafana jump from a p99 spike to the slow request’s full span tree without adding trace_id as a cardinality-multiplying label. PII in labels is both a cardinality problem and a data-leak problem — audit label names as a security review item.
appears again in167
- The journey of a request: seven stops from socket to responsejunior
- Accept and parse: from kernel queue to a typed requestmiddle
- Routing and middleware: choosing what runs, and in what ordermiddle
- Handler and response: from business logic to bytes on the wiremiddle
- Streaming and backpressure: when the client reads slower than you writesenior
- Timeouts and tail latency: budgets, deadlines, and the fan-out trapsenior
- Middleware and DI: the two patterns that shape every backendjunior
- Writing middleware: signatures, next(), and the three framework modelsmiddle
- Inversion of control: how dependencies reach a classmiddle
- DI scopes and lifecycles: singleton, request, transientmiddle
- DI as a testing seam: fakes, mocks, and the boundary that matterssenior
- DI containers in production: resolution graphs, circular deps, and when not tosenior
- Blocking vs non-blocking I/O: two ways to waitjunior
- The event loop: one thread, ordered phasesmiddle
- What blocks the loop: CPU work and sync callsmiddle
- Offloading CPU work: worker threads and the libuv poolmiddle
- Backpressure and bounded concurrencysenior
- Throughput under load: tail latency and saturationsenior
- Why pool: the cost of creating a connectionjunior
- Pool sizing: why bigger is not fastermiddle
- Acquisition and timeouts: the wait queue is the real latency dialmiddle
- Retry strategies: backoff, jitter, and thundering herdmiddle
- Observability, production failures, and global-scale designsenior
- Tasks, microtasks, and scheduler.yield()middle
- Timer accuracy, throttling, and idle workmiddle
- Node.js event loop: phases, nextTick, and loop lagsenior
- Rendering strategies: SSG, SSR, ISR, streaming, and hydrationjunior
- SSG, SSR, ISR, streaming, and RSC — how each worksmiddle
- Hydration cost: selective, progressive, islands, resumabilitymiddle
- Core Web Vitals: what LCP, INP, and CLS measurejunior
- LCP: four phases, one dominant costmiddle
- INP: input delay, processing, presentationmiddle
- Lab vs field: why the two disagree and how to use eachmiddle
- Metric tradeoffs, RUM attribution, and the CI+field loopsenior
- The full picture: URL to LCP to INP as a relay racejunior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- What an index is and how it speeds up queriesjunior
- The leading-column rule and composite index designmiddle
- Partial, expression, and covering indexesmiddle
- Index types: GIN, GiST, BRIN, Hash, Bloom, and HOT updatesmiddle
- Index-only scans, the Visibility Map, and INCLUDEsenior
- Production failure modes and the index audit playbooksenior
- Index design exercise: full-text search strategysenior
- EXPLAIN and execution plans: what the planner decides and whyjunior
- Scan types: Seq, Index, Bitmap, Index-Onlymiddle
- Join algorithms and the row-estimate cascademiddle
- pg_statistic, ANALYZE, and production observabilitymiddle
- Extended statistics: fixing correlated-column estimate failuressenior
- Plan cache, cost-constant tuning, and planner internalssenior
- Production failure modes and plan stabilitysenior
- Connection pools: amortising the cost of a Postgres backendjunior
- PgBouncer session, transaction, and statement modesmiddle
- Pool sizing: the (cores × 2) + spindles formula and the two-layer stackmiddle
- Pool exhaustion and idle-in-transaction: the 3 AM failure modemiddle
- Migrating to transaction mode: rollout playbook and PgBouncer 1.21 prepared statementsmiddle
- The Postgres process model and why raising max_connections degrades throughputsenior
- Pooler landscape 2026, serverless connection storms, and the full failure-mode taxonomysenior
- ADD COLUMN: instant in PG 11+ vs rewrite in older Postgresjunior
- The lock-queue failure mode: why instant DDL can freeze the databasemiddle
- Safe DDL patterns: NOT VALID, CONCURRENTLY, and unsafe-op fixesmiddle
- Migration failure taxonomy and production disciplinesenior
- Shard-key selection: hash, range, list, and directory strategiesmiddle
- Co-location and Citus: the invariant that makes sharding usablemiddle
- The hot-shard failure mode: detection, isolation, and durable policymiddle
- Online resharding, 2PC, and the operational cost of shardingsenior
- The seven acts: from CREATE TABLE to Citusjunior
- Acts 1–3 in depth: schema, indexes, and planner statisticsmiddle
- Acts 4–6 in depth: MVCC bloat, connection pooling, and safe migrationsmiddle
- Act 7 in depth: sharding, co-location, and the seven-tier tradeoff cascademiddle
- Observability, anti-patterns, and production triagesenior
- Bits on the wirejunior
- Latency mathmiddle
- Bufferbloat and congestionsenior
- The physical frontiersenior
- Sequence numbers and connection statemiddle
- Flow control and congestion controlmiddle
- BBR, production observability, and beyond TCPsenior
- CDN: putting content next doorjunior
- Anycast and GeoDNS: routing to the nearest edgemiddle
- Tiered cache and Cache-Controlmiddle
- Vary header and cache keysmiddle
- Stale-while-revalidate and cache stampedesenior
- Edge workers and edge-side compositionsenior
- CDN operations and observabilitysenior
- WebSocket: the HTTP upgrade handshakejunior
- WebSocket vs SSE vs long-polling: choosing the right transportmiddle
- WebSocket backpressure: when clients can''''t keep upmiddle
- Reconnection: jittered backoff, thundering herd, message resumptionsenior
- WebSocket at scale: HTTP/2 multiplexing, permessage-deflate, C10Msenior
- WebSocket in production: proxies, security, and distributed architecturesenior
- What reverse proxies dojunior
- Balancing algorithms: round-robin to power-of-two-choicesmiddle
- L4 vs L7 load balancing and client-IP preservationmiddle
- Health checks, connection draining, and slow startmiddle
- Retry storms, circuit breakers, and load sheddingsenior
- Resilient LB architecture: anycast, zone-aware routing, and observabilitysenior
- Why QUIC and not TCP+TLSjunior
- QUIC streams and head-of-line blockingjunior
- Integrated handshake and 1-RTTmiddle
- Connection IDs and network migrationmiddle
- Loss detection and congestion controlmiddle
- 0-RTT resumption and packet encryptionsenior
- Deployment tradeoffs and CPU costsenior
- DDoS: what it is and why it worksjunior
- Amplification attacks and state exhaustionmiddle
- Rate limiting: algorithms and architecturemiddle
- WAFs, firewalls, mTLS, and HSTSmiddle
- DNS cache poisoning and BGP hijackingsenior
- Defense-in-depth architecture and attack economicssenior
- The twelve layers: one URL, seven actorsjunior
- DNS, TCP, TLS in sequence: where the milliseconds gomiddle
- Critical render path and Core Web Vitalsmiddle
- Proxy intercepts and security gates: rate limiters, WAF, mTLSmiddle
- Alternate paths: QUIC 0-RTT, WebSocket upgrade, connection migrationmiddle
- Observability: distributed traces, USE/RED, and samplingsenior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- Why profile first: measure where time actually goesjunior
- Amdahl''''s law and self-time: the ceiling on every speedup you can shipmiddle
- The measurement loop: microbench, macrobench, prod profile, observer effectmiddle
- Reading flame graphs: shapes, per-language profilers, and the 60-second scanmiddle
- Statistical baselines: why one run is not a measurementmiddle
- Profiler history and microbenchmark pitfalls: Knuth to GWPsenior
- Hardware counters, cold-start profiles, and profile securitysenior
- Continuous profiling at scale: costs, CI gates, trace correlation, and anti-patternssenior
- What makes a hot path: symptom vs causejunior
- Five shapes of hotspot: CPU, alloc, cache, lock, syscallmiddle
- Reading parent and child chains: where to apply the fixmiddle
- JIT deopt, the fix-and-verify loop, and PR-time profilingmiddle
- Hardware counters and Intel TMA: sub-category diagnosissenior
- False sharing and native-bridge hot pathssenior
- Hot paths in production: security, tail latency, and tooling lineagesenior
- Memory hierarchy: why the same O(N) loop can be 17x slowerjunior
- Row-major vs column-major: access order and the 9x gapjunior
- Branch prediction and branchless codemiddle
- Hardware prefetcher, TLB, and memory-level parallelismsenior
- GC basics: what the runtime taxes you forjunior
- GC algorithms: generational, concurrent, and per-runtimemiddle
- GC tradeoffs: pause, throughput, heap — and object poolingmiddle
- GC tuning: pacing, heap shape, and allocation observabilitymiddle
- GC internals: tri-color invariant, write barriers, and per-runtime deep-divessenior
- GC in production: observability, security, edge cases, and fleet governancesenior
- N+1: one logical operation, many round-tripsjunior
- Fix families: JOIN, IN, preload, and DataLoadermiddle
- Detecting N+1: query logs, APM traces, and CI gatesmiddle
- DataLoader: batching across resolver treesmiddle
- Cross-protocol N+1: HTTP fan-out and Redis MGETmiddle
- N+1 at scale: pool exhaustion, plan changes, and denormalisationsenior
- Batching: amortize fixed cost per operationjunior
- The batching window: size and wait timemiddle
- Batching in Kafka and Postgresmiddle
- io_uring and observability of batchingmiddle
- From Nagle to io_uring: evolution of batchingmiddle
- Backpressure, failure isolation, and batch security in productionsenior
- What a bundle actually costs: download, parse, compile, executejunior
- Core Web Vitals: LCP, INP, and CLSmiddle
- Code splitting: route-level, component-level, vendor splittingmiddle
- Tree shaking and compression: removing what you don''''t usemiddle
- Third-party scripts: the silent budget killermiddle
- CI enforcement and RUM: making budgets stickmiddle
- V8 JIT pipeline, HTTP priorities, and bundle securitysenior
- The performance loop: discipline, not a projectjunior
- Classify and fix: matching bottleneck families to remediesmiddle
- Observability stack and CI gates: catching regressions before they shipmiddle
- Incident to enforcement: SLO burn to verified fix in 35 minutesmiddle
- Culture, economics, and org-scale performancesenior