awesome-everything RU
↑ Back to the climb

Performance

Observability stack and CI gates: catching regressions before they ship

Crux Five integrated data streams — metrics, logs, traces, profiles, RUM — plus pre-merge CI gates that fail regressions before they reach users. The cheapest fix is the one that never ships.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 14 min

An on-call engineer gets a p99 SLO burn alert at 2 am. She has continuous profiling, distributed traces, and RUM. She finds the root cause in 4 minutes. Her colleague at a different company gets the same alert but has only server logs. She is still triaging at 6 am. The difference is not the incident — it is the observability stack she had before the incident.

The five production signals

Production-grade performance work needs five integrated data streams. Each answers a different question; together they give a complete picture.

SignalTool examplesQuestion answeredRefresh / retention
MetricsPrometheus + GrafanaIs the service healthy? p50/p99 latency, RPS, error rate15–60 s / 13 months
LogsLoki / ELK / Datadog LogsWhat happened for this request?On-demand / 7–90 days
TracesTempo / Jaeger / HoneycombWhich span owns the latency?Per-request / 7–30 days
ProfilesPyroscope / Parca / Polar SignalsWhich function ate the CPU / allocations?Continuous / 7–30 days
RUMSentry / Datadog RUM / Vercel Speed InsightsWhat did real users experience? Core Web VitalsPer-session / 30–90 days

Linking the signals: the unified dashboard

The power is not in each signal in isolation — it is in the links between them. An SLO burn alert links to the RED dashboard. The RED dashboard links to the trace view filtered to the burn window. The trace links to the profile filtered by the trace-id. The profile links to the source code.

A well-instrumented service lets you travel this chain in under five minutes:

  1. SLO alert fires (Prometheus → alertmanager).
  2. Open RED dashboard: which of p99/error-rate/RPS moved?
  3. Jump to traces filtered to the last 10 minutes: which span is slow?
  4. Click “profile this trace” in Pyroscope: which function is hot?
  5. Click “git blame” on the function: which commit and deploy?

Without the links (cross-signal join by trace-id), each step requires manual correlation — copying a timestamp, searching in a different tool, guessing which deployment window. The median MTTR without signal linking is 30 to 90 minutes; with it, 3 to 10 minutes.

OpenTelemetry standardises the wire format (OTLP) and SDK across all five signals. Teams that instrument with OTel can swap backends — from Datadog to a self-hosted stack — without reinstrumenting. The profile signal (standardised in OTel 2024-2026) joins metrics, logs, and traces in the same collector pipeline.

Pre-merge CI gates: catch it before it ships

The cheapest fix is the one that never reaches production. Four gates run on every PR, take 1 to 5 minutes, and catch 90%+ of would-be regressions:

Bundle-size gate — fails the PR if any route’s JS bundle exceeds the per-route budget. Set the budget at the post-fix level after a bundle optimisation; any PR that grows the bundle beyond it needs explicit sign-off. Tools: bundlewatch, Lighthouse CI, Next.js bundle analyzer in CI.

Query-count gate — fails the PR if any endpoint introduces N+1 queries. Implementation: a middleware in the test suite that counts DB queries per request and asserts the count is under a threshold. Catches the most common backend regression class before merge.

Allocation-rate diff — fails the PR if the allocation rate for a critical path benchmark regresses beyond a threshold. A benchmarking harness (go test -bench, criterion for Rust, JMH for Java) runs on the hot path and asserts alloc/op is within budget.

Load-test diff — runs a brief load scenario against the PR and fails if p99 on the critical path regressed beyond X%. Heavier than the others (5 to 15 minutes) but catches architectural regressions that microbenchmarks miss.

Why this works

The mental model for gates: each gate enforces the lesson from one past incident. Bundle gate = the lesson from the time a third-party script ballooned LCP to 4s. Query gate = the lesson from the N+1 that took /orders from 30ms to 1.5s. Allocation gate = the lesson from the logger allocation that spiked GC. Every incident retro should end with “what gate would have caught this?” and that gate should be added. The gate set is never finished — it grows with the system’s failure history.

Observability investment numbers
MTTR with full stack + signal linking
3–10 minutes
MTTR without continuous profiling
30–90 minutes
Pre-merge gate catch rate for known regression classes
90%+
Pyroscope OSS infra cost per month (self-hosted)
~$500
Sentry RUM per year (small team)
~$30k
Engineer-time for performance firefighting without discipline
20–40%
Engineer-time with discipline (steady state)
5–10%
Quiz

A team has CI gates for bundle, query count, and allocation rate. They still get a production p99 regression once per quarter. Most likely gap?

Quiz

Why is the trace-id the critical join key between signals in a unified observability stack?

Order the steps

Order the five production signals from broadest (whole service view) to narrowest (single function in single request):

  1. 1 Metrics — p99 latency, RPS, error rate across all requests
  2. 2 RUM — Core Web Vitals from real user devices per route
  3. 3 Traces — per-request waterfall showing which span is slow
  4. 4 Logs — structured event detail for a specific request
  5. 5 Profiles — which function inside the span consumed CPU or allocations
Recall before you leave
  1. 01
    What is the role of the trace-id in linking the five observability signals?
  2. 02
    Name the four CI gate types described, what each catches, and how it is implemented.
  3. 03
    Why do CI gates and observability complement each other rather than substitute for each other?
Recap

Production-grade performance work runs on five integrated data streams: metrics (aggregate health), logs (per-request events), traces (per-request span waterfall), profiles (per-function CPU and allocation), and RUM (real user Core Web Vitals). The trace-id is the join key that links them: an SLO alert links to a trace, the trace links to a profile, the profile names the function. With the full stack and signal linking, MTTR drops from 30–90 minutes to 3–10. Before signals, four CI gates catch 90% of regressions before they ship: bundle-size, query-count, allocation-rate diff, and load-test diff. Each gate encodes the lesson from a past incident. The gate set is never complete — it grows after every production regression via incident retros that answer “what gate would have caught this?”

Connected lessons
appears again in260
Continue the climb ↑Incident to enforcement: SLO burn to verified fix in 35 minutes
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources4
expand
  1. 01
  2. 02
  3. 03
  4. 04

Trademarks belong to their respective owners. Editorial reference only.