Observability
Instrumenting RED in Prometheus: counters, histograms, and cardinality discipline
A team alerts on average request latency. A bug fix pushes p99 from 200 ms to 800 ms — but barely moves the mean. The on-call misses the incident for 40 minutes. The SLO review finds the average-latency alert has never fired on a real user impact. Histograms would have fired in 2 minutes.
The three canonical RED metrics
Every HTTP service should emit exactly three metric groups, named consistently:
http_requests_total # counter — Rate
http_request_errors_total # counter — Errors (5xx only, or a status label)
http_request_duration_seconds # histogram — DurationPrometheus PromQL then gives you all three RED dimensions:
- Rate:
rate(http_requests_total[5m]) - Error rate:
rate(http_request_errors_total[5m]) / rate(http_requests_total[5m]) - Duration p99:
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
Why Duration must be a histogram
The average hides everything users notice. A service with 99% of requests at 50 ms and 1% at 5000 ms has the same mean latency (~100 ms) as one with all requests at 100 ms. The first kills users on retries; the second does not.
Prometheus’s histogram_quantile(q, buckets) reads per-bucket counts accumulated over a time window and estimates the q-th percentile by linear interpolation between adjacent buckets. Accuracy depends entirely on bucket density near the percentile you care about.
The by (le) requirement. The correct form is always:
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))Dropping by (le) collapses all label dimensions including le (the bucket boundary label), leaving histogram_quantile with a single point rather than a distribution — the result is NaN or garbage. This is a real, common mistake that silently produces wrong values.
| Latency signal | What it hides | Use it for |
|---|---|---|
| Average (sum/count) | Slow-tail behavior that users notice | Never for SLO alerts |
| Prometheus summary | Cannot aggregate across replicas | Single-replica-owns-data only |
| Prometheus histogram | Accuracy depends on bucket density | Fleet-wide p99 alerts |
Bucket strategy
Default Prometheus client buckets — [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10] seconds — are wrong for most services. For a checkout API with a 200 ms SLO, most traffic falls between 50 ms and 250 ms. One bucket covers that entire range (100 ms to 250 ms), so p99 could be anywhere in it — unreadable.
Production rule: 10–15 buckets, densest around the SLO target. For a 200 ms SLO:
[0.01, 0.025, 0.05, 0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 10]Three buckets below 200 ms (100, 25, 50 ms boundaries give resolution), three above (400, 800, 1600 ms), hard cap at the service timeout (10 s). Adjacent buckets differ by ≤2× near the SLO.
Label discipline — the iron rule
Every unique combination of label values on a Prometheus metric creates a separate time series. A naive RED instrumentation labelled by user_id in a service with 100k active users grows from a few hundred series to hundreds of thousands within hours.
What belongs in labels:
route— the URL template (/cart, not/cart?u=12345)method— HTTP verb (GET / POST / …)status_class— 2xx / 4xx / 5xx (not the exact code)service— injected by the deployment as a meta-label
Forbidden in labels: user IDs, request IDs, customer email, session tokens, query strings, country code unless small and bounded. All of these have unbounded cardinality.
The cost math: collapsing 200/201/204 into 2xx cuts 60 unique status codes down to 4 classes. For 20 routes × 4 methods: 60 × 20 × 4 = 4,800 series → 4 × 20 × 4 = 320 series, a 15× reduction with no loss of useful alerting power.
Why this works
If you genuinely need to alert on a specific status code on a specific route, build that alert from logs — not from a metric with a high-cardinality label. Logs are the natural home of high-cardinality data (each event is one record). Metrics are the home of aggregated, time-series counts (each series is a separate in-memory counter). The split is architectural, not preference.
A Node.js RED middleware
const client = require('prom-client');
const reqs = new client.Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status_class'],
});
const errs = new client.Counter({
name: 'http_request_errors_total',
help: 'Failed HTTP requests (5xx)',
labelNames: ['method', 'route'],
});
const dur = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'Request duration',
labelNames: ['method', 'route', 'status_class'],
buckets: [0.01, 0.025, 0.05, 0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 10],
});
app.use((req, res, next) => {
const start = process.hrtime.bigint();
res.on('finish', () => {
const seconds = Number(process.hrtime.bigint() - start) / 1e9;
const sclass = `${Math.floor(res.statusCode / 100)}xx`;
const route = req.route?.path || 'unknown';
reqs.inc({ method: req.method, route, status_class: sclass });
dur.observe({ method: req.method, route, status_class: sclass }, seconds);
if (res.statusCode >= 500) errs.inc({ method: req.method, route });
});
next();
});req.route.path gives the matched template (/cart), not req.url which includes query strings. That one line prevents cardinality explosion.
A team alerts on the AVERAGE request latency across all replicas. Why is this dangerous?
A service emits an Errors counter labelled by exact error_message string. After a buggy release that throws unique stack traces, the metrics backend bill triples overnight. Why?
A senior engineer claims histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m]))) (no 'by (le)') gives the fleet-wide p99. Why is this wrong?
- 01Why must RED Duration be a histogram rather than sum/count (average)?
- 02What does the 'by (le)' clause do in a histogram_quantile query, and what happens without it?
- 03Name three label values forbidden on RED metrics and one label value that is always allowed.
RED in Prometheus is three metric groups: http_requests_total (counter for Rate), http_request_errors_total (counter for Errors), and http_request_duration_seconds (histogram for Duration). Duration must be a histogram because the average masks tail behavior that users feel — histogram_quantile reads per-bucket counts and interpolates the percentile, but only when sum by (le) preserves the bucket-boundary label. Bucket selection decides p99 accuracy: choose 10–15 buckets densest around the SLO target with adjacent buckets differing by ≤2× near the SLO. Label discipline is the other half: use route templates, HTTP method, and status class — never user IDs, request IDs, or exact error messages. Each unique label combination is a separate time series, billed separately, and stored in RAM on the Prometheus server.
appears again in167
- The journey of a request: seven stops from socket to responsejunior
- Accept and parse: from kernel queue to a typed requestmiddle
- Routing and middleware: choosing what runs, and in what ordermiddle
- Handler and response: from business logic to bytes on the wiremiddle
- Streaming and backpressure: when the client reads slower than you writesenior
- Timeouts and tail latency: budgets, deadlines, and the fan-out trapsenior
- Middleware and DI: the two patterns that shape every backendjunior
- Writing middleware: signatures, next(), and the three framework modelsmiddle
- Inversion of control: how dependencies reach a classmiddle
- DI scopes and lifecycles: singleton, request, transientmiddle
- DI as a testing seam: fakes, mocks, and the boundary that matterssenior
- DI containers in production: resolution graphs, circular deps, and when not tosenior
- Blocking vs non-blocking I/O: two ways to waitjunior
- The event loop: one thread, ordered phasesmiddle
- What blocks the loop: CPU work and sync callsmiddle
- Offloading CPU work: worker threads and the libuv poolmiddle
- Backpressure and bounded concurrencysenior
- Throughput under load: tail latency and saturationsenior
- Why pool: the cost of creating a connectionjunior
- Pool sizing: why bigger is not fastermiddle
- Acquisition and timeouts: the wait queue is the real latency dialmiddle
- Retry strategies: backoff, jitter, and thundering herdmiddle
- Observability, production failures, and global-scale designsenior
- Tasks, microtasks, and scheduler.yield()middle
- Timer accuracy, throttling, and idle workmiddle
- Node.js event loop: phases, nextTick, and loop lagsenior
- Rendering strategies: SSG, SSR, ISR, streaming, and hydrationjunior
- SSG, SSR, ISR, streaming, and RSC — how each worksmiddle
- Hydration cost: selective, progressive, islands, resumabilitymiddle
- Core Web Vitals: what LCP, INP, and CLS measurejunior
- LCP: four phases, one dominant costmiddle
- INP: input delay, processing, presentationmiddle
- Lab vs field: why the two disagree and how to use eachmiddle
- Metric tradeoffs, RUM attribution, and the CI+field loopsenior
- The full picture: URL to LCP to INP as a relay racejunior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- What an index is and how it speeds up queriesjunior
- The leading-column rule and composite index designmiddle
- Partial, expression, and covering indexesmiddle
- Index types: GIN, GiST, BRIN, Hash, Bloom, and HOT updatesmiddle
- Index-only scans, the Visibility Map, and INCLUDEsenior
- Production failure modes and the index audit playbooksenior
- Index design exercise: full-text search strategysenior
- EXPLAIN and execution plans: what the planner decides and whyjunior
- Scan types: Seq, Index, Bitmap, Index-Onlymiddle
- Join algorithms and the row-estimate cascademiddle
- pg_statistic, ANALYZE, and production observabilitymiddle
- Extended statistics: fixing correlated-column estimate failuressenior
- Plan cache, cost-constant tuning, and planner internalssenior
- Production failure modes and plan stabilitysenior
- Connection pools: amortising the cost of a Postgres backendjunior
- PgBouncer session, transaction, and statement modesmiddle
- Pool sizing: the (cores × 2) + spindles formula and the two-layer stackmiddle
- Pool exhaustion and idle-in-transaction: the 3 AM failure modemiddle
- Migrating to transaction mode: rollout playbook and PgBouncer 1.21 prepared statementsmiddle
- The Postgres process model and why raising max_connections degrades throughputsenior
- Pooler landscape 2026, serverless connection storms, and the full failure-mode taxonomysenior
- ADD COLUMN: instant in PG 11+ vs rewrite in older Postgresjunior
- The lock-queue failure mode: why instant DDL can freeze the databasemiddle
- Safe DDL patterns: NOT VALID, CONCURRENTLY, and unsafe-op fixesmiddle
- Migration failure taxonomy and production disciplinesenior
- Shard-key selection: hash, range, list, and directory strategiesmiddle
- Co-location and Citus: the invariant that makes sharding usablemiddle
- The hot-shard failure mode: detection, isolation, and durable policymiddle
- Online resharding, 2PC, and the operational cost of shardingsenior
- The seven acts: from CREATE TABLE to Citusjunior
- Acts 1–3 in depth: schema, indexes, and planner statisticsmiddle
- Acts 4–6 in depth: MVCC bloat, connection pooling, and safe migrationsmiddle
- Act 7 in depth: sharding, co-location, and the seven-tier tradeoff cascademiddle
- Observability, anti-patterns, and production triagesenior
- Bits on the wirejunior
- Latency mathmiddle
- Bufferbloat and congestionsenior
- The physical frontiersenior
- Sequence numbers and connection statemiddle
- Flow control and congestion controlmiddle
- BBR, production observability, and beyond TCPsenior
- CDN: putting content next doorjunior
- Anycast and GeoDNS: routing to the nearest edgemiddle
- Tiered cache and Cache-Controlmiddle
- Vary header and cache keysmiddle
- Stale-while-revalidate and cache stampedesenior
- Edge workers and edge-side compositionsenior
- CDN operations and observabilitysenior
- WebSocket: the HTTP upgrade handshakejunior
- WebSocket vs SSE vs long-polling: choosing the right transportmiddle
- WebSocket backpressure: when clients can''''t keep upmiddle
- Reconnection: jittered backoff, thundering herd, message resumptionsenior
- WebSocket at scale: HTTP/2 multiplexing, permessage-deflate, C10Msenior
- WebSocket in production: proxies, security, and distributed architecturesenior
- What reverse proxies dojunior
- Balancing algorithms: round-robin to power-of-two-choicesmiddle
- L4 vs L7 load balancing and client-IP preservationmiddle
- Health checks, connection draining, and slow startmiddle
- Retry storms, circuit breakers, and load sheddingsenior
- Resilient LB architecture: anycast, zone-aware routing, and observabilitysenior
- Why QUIC and not TCP+TLSjunior
- QUIC streams and head-of-line blockingjunior
- Integrated handshake and 1-RTTmiddle
- Connection IDs and network migrationmiddle
- Loss detection and congestion controlmiddle
- 0-RTT resumption and packet encryptionsenior
- Deployment tradeoffs and CPU costsenior
- DDoS: what it is and why it worksjunior
- Amplification attacks and state exhaustionmiddle
- Rate limiting: algorithms and architecturemiddle
- WAFs, firewalls, mTLS, and HSTSmiddle
- DNS cache poisoning and BGP hijackingsenior
- Defense-in-depth architecture and attack economicssenior
- The twelve layers: one URL, seven actorsjunior
- DNS, TCP, TLS in sequence: where the milliseconds gomiddle
- Critical render path and Core Web Vitalsmiddle
- Proxy intercepts and security gates: rate limiters, WAF, mTLSmiddle
- Alternate paths: QUIC 0-RTT, WebSocket upgrade, connection migrationmiddle
- Observability: distributed traces, USE/RED, and samplingsenior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- Why profile first: measure where time actually goesjunior
- Amdahl''''s law and self-time: the ceiling on every speedup you can shipmiddle
- The measurement loop: microbench, macrobench, prod profile, observer effectmiddle
- Reading flame graphs: shapes, per-language profilers, and the 60-second scanmiddle
- Statistical baselines: why one run is not a measurementmiddle
- Profiler history and microbenchmark pitfalls: Knuth to GWPsenior
- Hardware counters, cold-start profiles, and profile securitysenior
- Continuous profiling at scale: costs, CI gates, trace correlation, and anti-patternssenior
- What makes a hot path: symptom vs causejunior
- Five shapes of hotspot: CPU, alloc, cache, lock, syscallmiddle
- Reading parent and child chains: where to apply the fixmiddle
- JIT deopt, the fix-and-verify loop, and PR-time profilingmiddle
- Hardware counters and Intel TMA: sub-category diagnosissenior
- False sharing and native-bridge hot pathssenior
- Hot paths in production: security, tail latency, and tooling lineagesenior
- Memory hierarchy: why the same O(N) loop can be 17x slowerjunior
- Row-major vs column-major: access order and the 9x gapjunior
- Branch prediction and branchless codemiddle
- Hardware prefetcher, TLB, and memory-level parallelismsenior
- GC basics: what the runtime taxes you forjunior
- GC algorithms: generational, concurrent, and per-runtimemiddle
- GC tradeoffs: pause, throughput, heap — and object poolingmiddle
- GC tuning: pacing, heap shape, and allocation observabilitymiddle
- GC internals: tri-color invariant, write barriers, and per-runtime deep-divessenior
- GC in production: observability, security, edge cases, and fleet governancesenior
- N+1: one logical operation, many round-tripsjunior
- Fix families: JOIN, IN, preload, and DataLoadermiddle
- Detecting N+1: query logs, APM traces, and CI gatesmiddle
- DataLoader: batching across resolver treesmiddle
- Cross-protocol N+1: HTTP fan-out and Redis MGETmiddle
- N+1 at scale: pool exhaustion, plan changes, and denormalisationsenior
- Batching: amortize fixed cost per operationjunior
- The batching window: size and wait timemiddle
- Batching in Kafka and Postgresmiddle
- io_uring and observability of batchingmiddle
- From Nagle to io_uring: evolution of batchingmiddle
- Backpressure, failure isolation, and batch security in productionsenior
- What a bundle actually costs: download, parse, compile, executejunior
- Core Web Vitals: LCP, INP, and CLSmiddle
- Code splitting: route-level, component-level, vendor splittingmiddle
- Tree shaking and compression: removing what you don''''t usemiddle
- Third-party scripts: the silent budget killermiddle
- CI enforcement and RUM: making budgets stickmiddle
- V8 JIT pipeline, HTTP priorities, and bundle securitysenior
- The performance loop: discipline, not a projectjunior
- Classify and fix: matching bottleneck families to remediesmiddle
- Observability stack and CI gates: catching regressions before they shipmiddle
- Incident to enforcement: SLO burn to verified fix in 35 minutesmiddle
- Culture, economics, and org-scale performancesenior