Observability OBS · 07 · 07

Profiling in production: security, war stories, OTel profiles, and the infrastructure design

Profiles expose function names and call patterns — treat them like debugger output. Five war stories show what continuous profiling catches and what dashboards miss. OTel profile signal is the fourth pillar closing the SLO → trace → profile loop.

OBS Senior ◷ 18 min

Level

FoundationsJuniorMiddleSenior

Stripe’s continuous profiler caught a regression two days after a deploy that no dashboard showed. A new feature flag was reading from disk on every request instead of from in-memory cache. CPU profile looked normal; off-CPU profile showed the disk wait. The fix was one line. The detection would have taken weeks without continuous profiling.

Profiles are security-sensitive artefacts

A profile contains function names (often private), call patterns, and sometimes allocation arguments — enough to reverse-engineer business logic. Some profilers capture argument values at allocation sites; poorly configured allocation profilers have leaked credentials.

In hostile contexts, a profile from a competitor’s binary can reveal proprietary algorithms — function names alone often telegraph what a service does. eBPF profilers running on shared kernels can in principle observe other tenants’ execution; this is why eBPF requires explicit capabilities and is namespace-scoped on modern kernels.

Production discipline:

Profiles are RBAC-gated by team (Pyroscope tenancy model).
Retention limited to 30-90 days; exports require approval.
Never shipped outside the organisation.
eBPF agent runs with CAP_PERFMON only, not full root.
Audit log of who pulled which profile.

Together these controls treat a profile the same way you would treat a heap dump or a debugger session: useful internally, dangerous externally. Skip any one of them and you have created an unaudited window into your codebase’s internals.

Production war stories

Discord 2020: a chat service ran at 80% CPU with mysterious tail latency. CPU profile pointed at JSON serialisation. Switching to a faster JSON library dropped CPU to 30% and tail latency to baseline.

GitHub 2021: Ruby workers were OOMing on certain endpoints. Allocation profile showed a single template-rendering function allocating 200 MB per request because of an unbounded loop concatenating strings.

Stripe 2022: continuous profiling caught a regression two days after deploy. A new feature flag read from disk on every request instead of from in-memory cache. CPU profile looked normal; off-CPU profile showed the disk wait. Fix was one line.

Cloudflare 2023: a Worker runtime regression appeared in eBPF profiles as time spent in V8’s GC. The team rolled back a V8 update that introduced more aggressive collection.

Slack 2024: PHP service was spending 30% of CPU on autoloader. Profiler-guided opcache tuning cut it to 5%.

The shared pattern: every major engineering org has a profiling war story. The common thread: dashboards showed normal, but the profile showed the bottleneck. The fix was obvious from the flame graph; impossible to find without one.

Company / Year	Symptom	Profile type	Root cause
Discord 2020	80% CPU, tail latency	CPU flame graph	JSON serialisation hotspot
GitHub 2021	OOM on endpoints	Allocation profile	String concat loop, 200 MB/req
Stripe 2022	Post-deploy regression	Off-CPU profile	Feature flag disk read on every req
Cloudflare 2023	Worker runtime regression	eBPF CPU profile	V8 GC update, more aggressive collection
Slack 2024	High PHP CPU	CPU flame graph	Autoloader: 30% CPU, fixed with opcache

OTel profile signal: the fourth pillar

OpenTelemetry is standardising profiles as a fourth signal (after logs, metrics, traces). The spec defines:

A profile data model: samples with stacks, labels, and time ranges.
A transport: OTLP profile signal (added in 2024).
Integration with context propagation: trace-id tagging on every sample.

Adoption status: Datadog, Grafana, Honeycomb, Splunk are implementing OTel profile ingestion. Agents (OTel Collector + profiler side) emit OTel-formatted profiles. The OTel profile spec is in beta as of 2026 — most production deployments still use vendor-specific formats (pprof, JFR, Pyroscope-native). Choosing a tool today commits to a format for 2-3 years; the OTel trajectory is worth tracking.

The promise: cross-vendor portability and a unified collector pipeline — the same architecture as logs, metrics, and traces. The catch: the spec is young and implementations diverge at the edges.

Designing continuous profiling infrastructure

A 200-service polyglot platform (Go, Java, Node, Python) with the requirement to surface deploy regressions in 1 hour and enable trace-to-profile drill in under 30 seconds:

Layer 1 — Collection: eBPF DaemonSet on every node (Parca-style or Pyroscope eBPF) as the universal baseline — covers all languages, one agent per node. Per-language agents as supplements: pprof for Go, async-profiler for Java, py-spy for Python. The eBPF agent is the catchall; per-language agents provide allocation and mutex profiles.

Layer 2 — Backend: self-hosted Pyroscope 2.0 cluster. Object storage (S3 / GCS) with 30-day fine-grained retention and 90-day downsampled. Symbol deduplication keeps per-service storage under 10 GB/month.

Layer 3 — Trace correlation: profiles carry trace-id and span-id labels. Grafana links trace span → Pyroscope filtered by trace-id. Sub-30-second drill.

Layer 4 — Regression detection: CI job on every deploy: capture 5-minute profile of new version under canary traffic, diff against previous version’s profile, post flame-graph diff as PR comment, fail CI if a new function appears in top 5 by self-CPU. Hourly production diff against same-hour-yesterday baseline; Slack alert on shape changes.

Layer 5 — Cost controls: sample rate per service configurable in service.yaml (default 99 Hz; drop to 19 Hz for cheap baseline services). Budget alert at 80% of monthly cost ceiling.

The production blueprint reads bottom-up: an eBPF baseline collects across every language, Pyroscope stores it, trace-ids make profiles drillable in under 30s, CI diffs catch deploy regressions within an hour, and per-service sample rates cap the bill.

Profiling infrastructure: design targets

Trace-to-profile drill time: <30 seconds
Deploy regression detection window: <1 hour
Pager-to-git-blame MTTR: <90 seconds
Storage per service per month: <10 GB (Pyroscope 2.0)
eBPF capability required: CAP_PERFMON only
Profile RBAC: Per-team tenancy

Quiz

A profile from your service leaks to a vendor's support team. What is the security concern?

Quiz

The OTel profile signal is in beta as of 2026. What is the practical implication for teams choosing a profiling tool today?

Low-frequency sampling (≈100 Hz) under a per-host overhead budget keeps cost below 1%; each profile is tagged with version and commit so regressions are attributable when compared across deploys.

Recall before you leave

01
Why are profiles treated as security-sensitive artefacts rather than just operational data?
02
Design the profiling CI gate for a 50-service platform to catch CPU regressions at deploy time.
03
What is the OTel profile signal and what does it standardise?

Recap

Profiles contain function names, call patterns, and sometimes allocation argument values — treat them as security-sensitive artefacts with RBAC, audit logs, and retention limits, never shared externally without approval. Five industry war stories (Discord, GitHub, Stripe, Cloudflare, Slack) follow the same pattern: dashboards showed normal, the profile showed the bottleneck, the fix was obvious from the flame graph. The OTel profile signal standardises profiles as the fourth observability pillar with a data model, OTLP transport, and trace-id integration; it is in beta as of 2026 but worth tracking when choosing tooling. Production profiling infrastructure for a 200-service polyglot fleet combines an eBPF DaemonSet (universal baseline), per-language native agents (depth), Pyroscope 2.0 self-hosted (storage), trace-id correlation (30-second drill), and CI differential profiles (1-hour regression detection). The cultural shift: senior on-call engineers in 2026 open the profile dashboard the same reflexive way they opened traces two years ago.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Linux perf, eBPF internals, PGO, and the limits of samplingsenior

unlocks

Scale, security, and the ROI of observable systemssenior

appears again in170

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.