Performance PERF · 08 · 03

Observability stack and CI gates: catching regressions before they ship

Five integrated data streams — metrics, logs, traces, profiles, RUM — plus pre-merge CI gates that fail regressions before they reach users. The cheapest fix is the one that never ships.

PERF Middle ◷ 14 min

Level

FoundationsJuniorMiddleSenior

An on-call engineer gets a p99 SLO burn alert at 2 am. She has continuous profiling, distributed traces, and RUM. She finds the root cause in 4 minutes. Her colleague at a different company gets the same alert but has only server logs. She is still triaging at 6 am. The difference is not the incident — it is the observability stack she had before the incident.

The five production signals

If you have ever stared at a p99 spike with nothing but server logs, you already know why each of these signals matters. Production-grade performance work needs five integrated data streams. Each answers a different question; together they give a complete picture.

Signal	Tool examples	Question answered	Refresh / retention
Metrics	Prometheus + Grafana	Is the service healthy? p50/p99 latency, RPS, error rate	15–60 s / 13 months
Logs	Loki / ELK / Datadog Logs	What happened for this request?	On-demand / 7–90 days
Traces	Tempo / Jaeger / Honeycomb	Which span owns the latency?	Per-request / 7–30 days
Profiles	Pyroscope / Parca / Polar Signals	Which function ate the CPU / allocations?	Continuous / 7–30 days
RUM	Sentry / Datadog RUM / Vercel Speed Insights	What did real users experience? Core Web Vitals	Per-session / 30–90 days

Linking the signals: the unified dashboard

The power is not in each signal in isolation — it is in the links between them. An SLO burn alert links to the RED dashboard. The RED dashboard links to the trace view filtered to the burn window. The trace links to the profile filtered by the trace-id. The profile links to the source code.

A well-instrumented service lets you travel this chain in under five minutes:

SLO alert fires (Prometheus → alertmanager).
Open RED dashboard: which of p99/error-rate/RPS moved?
Jump to traces filtered to the last 10 minutes: which span is slow?
Click “profile this trace” in Pyroscope: which function is hot?
Click “git blame” on the function: which commit and deploy?

Without the links (cross-signal join by trace-id), each step requires manual correlation — copying a timestamp, searching in a different tool, guessing which deployment window. The median MTTR without signal linking is 30 to 90 minutes; with it, 3 to 10 minutes.

OpenTelemetry standardises the wire format (OTLP) and SDK across all five signals. Teams that instrument with OTel can swap backends — from Datadog to a self-hosted stack — without reinstrumenting. The profile signal (standardised in OTel 2024-2026) joins metrics, logs, and traces in the same collector pipeline.

Pre-merge CI gates: catch it before it ships

After a postmortem, the question is always “what would have caught this before it shipped?” The answer is almost always a gate. The cheapest fix is the one that never reaches production. Four gates run on every PR, take 1 to 5 minutes, and catch 90%+ of would-be regressions:

Bundle-size gate — fails the PR if any route’s JS bundle exceeds the per-route budget. Set the budget at the post-fix level after a bundle optimisation; any PR that grows the bundle beyond it needs explicit sign-off. Tools: bundlewatch, Lighthouse CI, Next.js bundle analyzer in CI.

Query-count gate — fails the PR if any endpoint introduces N+1 queries. Implementation: a middleware in the test suite that counts DB queries per request and asserts the count is under a threshold. Catches the most common backend regression class before merge.

Allocation-rate diff — fails the PR if the allocation rate for a critical path benchmark regresses beyond a threshold. A benchmarking harness (go test -bench, criterion for Rust, JMH for Java) runs on the hot path and asserts alloc/op is within budget.

Load-test diff — runs a brief load scenario against the PR and fails if p99 on the critical path regressed beyond X%. Heavier than the others (5 to 15 minutes) but catches architectural regressions that microbenchmarks miss.

▸Why this works

The mental model for gates: each gate enforces the lesson from one past incident. Bundle gate = the lesson from the time a third-party script ballooned LCP to 4s. Query gate = the lesson from the N+1 that took /orders from 30ms to 1.5s. Allocation gate = the lesson from the logger allocation that spiked GC. Every incident retro should end with “what gate would have caught this?” and that gate should be added. The gate set is never finished — it grows with the system’s failure history.

Observability investment numbers

MTTR with full stack + signal linking: 3–10 minutes
MTTR without continuous profiling: 30–90 minutes
Pre-merge gate catch rate for known regression classes: 90%+
Pyroscope OSS infra cost per month (self-hosted): ~$500
Sentry RUM per year (small team): ~$30k
Engineer-time for performance firefighting without discipline: 20–40%
Engineer-time with discipline (steady state): 5–10%

The same 2 am alert, two different nights: five linked signals plus pre-merge gates turn hours of triage into minutes — and stop most regressions before they ship.

Quiz

A team has CI gates for bundle, query count, and allocation rate. They still get a production p99 regression once per quarter. Most likely gap?

Quiz

Why is the trace-id the critical join key between signals in a unified observability stack?

Order the steps

Order the five production signals from broadest (whole service view) to narrowest (single function in single request):

1 Metrics — p99 latency, RPS, error rate across all requests
2 RUM — Core Web Vitals from real user devices per route
3 Traces — per-request waterfall showing which span is slow
4 Logs — structured event detail for a specific request
5 Profiles — which function inside the span consumed CPU or allocations

Metrics is the service healthy? p99 / RPS

RUM what did real users feel? CWV

Traces which span owns the latency?

Logs what happened for this request?

Profiles which function ate CPU / allocs?

CI gates stop regressions before merge

Signals join by trace-id (broad to narrow); CI gates are the base layer — the cheapest fix is the one that never ships.

Recall before you leave

01
What is the role of the trace-id in linking the five observability signals?
02
Name the four CI gate types described, what each catches, and how it is implemented.
03
Why do CI gates and observability complement each other rather than substitute for each other?

Recap

Production-grade performance work runs on five integrated data streams: metrics (aggregate health), logs (per-request events), traces (per-request span waterfall), profiles (per-function CPU and allocation), and RUM — Real User Monitoring, which captures Core Web Vitals from actual user devices. The trace-id is the join key that links them: an SLO alert links to a trace, the trace links to a profile, the profile names the function. With the full stack and signal linking, MTTR drops from 30–90 minutes to 3–10. Before signals, four CI gates catch 90% of regressions before they ship: bundle-size, query-count, allocation-rate diff, and load-test diff. Each gate encodes the lesson from a past incident. The gate set is never complete — it grows after every production regression via incident retros that answer “what gate would have caught this?” Now when you run your next postmortem, you have a checklist: which of the five signals was missing, and which gate would have caught it first.

Connected lessons

builds on

Continuous profiling at scale: costs, CI gates, trace correlation, and anti-patternssenior

deepens into

appears again in289

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.