awesome-everything RU
↑ Back to the climb

Observability

Profiling in production: security, war stories, OTel profiles, and the infrastructure design

Crux Profiles expose function names and call patterns — treat them like debugger output. Five war stories show what continuous profiling catches and what dashboards miss. OTel profile signal is the fourth pillar closing the SLO → trace → profile loop.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 18 min

Stripe’s continuous profiler caught a regression two days after a deploy that no dashboard showed. A new feature flag was reading from disk on every request instead of from in-memory cache. CPU profile looked normal; off-CPU profile showed the disk wait. The fix was one line. The detection would have taken weeks without continuous profiling.

Profiles are security-sensitive artefacts

A profile contains function names (often private), call patterns, and sometimes allocation arguments — enough to reverse-engineer business logic. Some profilers capture argument values at allocation sites; poorly configured allocation profilers have leaked credentials.

In hostile contexts, a profile from a competitor’s binary can reveal proprietary algorithms — function names alone often telegraph what a service does. eBPF profilers running on shared kernels can in principle observe other tenants’ execution; this is why eBPF requires explicit capabilities and is namespace-scoped on modern kernels.

Production discipline:

  • Profiles are RBAC-gated by team (Pyroscope tenancy model).
  • Retention limited to 30-90 days; exports require approval.
  • Never shipped outside the organisation.
  • eBPF agent runs with CAP_PERFMON only, not full root.
  • Audit log of who pulled which profile.

Production war stories

Discord 2020: a chat service ran at 80% CPU with mysterious tail latency. CPU profile pointed at JSON serialisation. Switching to a faster JSON library dropped CPU to 30% and tail latency to baseline.

GitHub 2021: Ruby workers were OOMing on certain endpoints. Allocation profile showed a single template-rendering function allocating 200 MB per request because of an unbounded loop concatenating strings.

Stripe 2022: continuous profiling caught a regression two days after deploy. A new feature flag read from disk on every request instead of from in-memory cache. CPU profile looked normal; off-CPU profile showed the disk wait. Fix was one line.

Cloudflare 2023: a Worker runtime regression appeared in eBPF profiles as time spent in V8’s GC. The team rolled back a V8 update that introduced more aggressive collection.

Slack 2024: PHP service was spending 30% of CPU on autoloader. Profiler-guided opcache tuning cut it to 5%.

The shared pattern: every major engineering org has a profiling war story. The common thread: dashboards showed normal, but the profile showed the bottleneck. The fix was obvious from the flame graph; impossible to find without one.

Company / YearSymptomProfile typeRoot cause
Discord 202080% CPU, tail latencyCPU flame graphJSON serialisation hotspot
GitHub 2021OOM on endpointsAllocation profileString concat loop, 200 MB/req
Stripe 2022Post-deploy regressionOff-CPU profileFeature flag disk read on every req
Cloudflare 2023Worker runtime regressioneBPF CPU profileV8 GC update, more aggressive collection
Slack 2024High PHP CPUCPU flame graphAutoloader: 30% CPU, fixed with opcache

OTel profile signal: the fourth pillar

OpenTelemetry is standardising profiles as a fourth signal (after logs, metrics, traces). The spec defines:

  • A profile data model: samples with stacks, labels, and time ranges.
  • A transport: OTLP profile signal (added in 2024).
  • Integration with context propagation: trace-id tagging on every sample.

Adoption status: Datadog, Grafana, Honeycomb, Splunk are implementing OTel profile ingestion. Agents (OTel Collector + profiler side) emit OTel-formatted profiles. The OTel profile spec is in beta as of 2026 — most production deployments still use vendor-specific formats (pprof, JFR, Pyroscope-native). Choosing a tool today commits to a format for 2-3 years; the OTel trajectory is worth tracking.

The promise: cross-vendor portability and a unified collector pipeline — the same architecture as logs, metrics, and traces. The catch: the spec is young and implementations diverge at the edges.

Designing continuous profiling infrastructure

A 200-service polyglot platform (Go, Java, Node, Python) with the requirement to surface deploy regressions in 1 hour and enable trace-to-profile drill in under 30 seconds:

Layer 1 — Collection: eBPF DaemonSet on every node (Parca-style or Pyroscope eBPF) as the universal baseline — covers all languages, one agent per node. Per-language agents as supplements: pprof for Go, async-profiler for Java, py-spy for Python. The eBPF agent is the catchall; per-language agents provide allocation and mutex profiles.

Layer 2 — Backend: self-hosted Pyroscope 2.0 cluster. Object storage (S3 / GCS) with 30-day fine-grained retention and 90-day downsampled. Symbol deduplication keeps per-service storage under 10 GB/month.

Layer 3 — Trace correlation: profiles carry trace-id and span-id labels. Grafana links trace span → Pyroscope filtered by trace-id. Sub-30-second drill.

Layer 4 — Regression detection: CI job on every deploy: capture 5-minute profile of new version under canary traffic, diff against previous version’s profile, post flame-graph diff as PR comment, fail CI if a new function appears in top 5 by self-CPU. Hourly production diff against same-hour-yesterday baseline; Slack alert on shape changes.

Layer 5 — Cost controls: sample rate per service configurable in service.yaml (default 99 Hz; drop to 19 Hz for cheap baseline services). Budget alert at 80% of monthly cost ceiling.

Profiling infrastructure: design targets
Trace-to-profile drill time
<30 seconds
Deploy regression detection window
<1 hour
Pager-to-git-blame MTTR
<90 seconds
Storage per service per month
<10 GB (Pyroscope 2.0)
eBPF capability required
CAP_PERFMON only
Profile RBAC
Per-team tenancy
Quiz

A profile from your service leaks to a vendor's support team. What is the security concern?

Quiz

The OTel profile signal is in beta as of 2026. What is the practical implication for teams choosing a profiling tool today?

Recall before you leave
  1. 01
    Why are profiles treated as security-sensitive artefacts rather than just operational data?
  2. 02
    Design the profiling CI gate for a 50-service platform to catch CPU regressions at deploy time.
  3. 03
    What is the OTel profile signal and what does it standardise?
Recap

Profiles contain function names, call patterns, and sometimes allocation argument values — treat them as security-sensitive artefacts with RBAC, audit logs, and retention limits, never shared externally without approval. Five industry war stories (Discord, GitHub, Stripe, Cloudflare, Slack) follow the same pattern: dashboards showed normal, the profile showed the bottleneck, the fix was obvious from the flame graph. The OTel profile signal standardises profiles as the fourth observability pillar with a data model, OTLP transport, and trace-id integration; it is in beta as of 2026 but worth tracking when choosing tooling. Production profiling infrastructure for a 200-service polyglot fleet combines an eBPF DaemonSet (universal baseline), per-language native agents (depth), Pyroscope 2.0 self-hosted (storage), trace-id correlation (30-second drill), and CI differential profiles (1-hour regression detection). The cultural shift: senior on-call engineers in 2026 open the profile dashboard the same reflexive way they opened traces two years ago.

Connected lessons
appears again in167
Continue the climb ↑Profiling: multiple-choice review
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources4
expand
  1. 01
  2. 02
  3. 03
  4. 04

Trademarks belong to their respective owners. Editorial reference only.