Observability OBS · 07 · 04

Continuous profiling: always-on flame graphs with eBPF and trace-id correlation

Continuous profiling runs always at 2-5% overhead — when an SLO burns the flame graph is already saved. eBPF captures polyglot stacks without language hooks; trace-id correlation drills any slow span to the exact function in 30 seconds.

OBS Middle ◷ 16 min

Level

FoundationsJuniorMiddleSenior

An SLO burns at 2:14 AM. The pager wakes the on-call. Traditional profiling requires them to SSH in, reproduce the issue under load, capture a profile, and parse it — at 2:14 AM, under pressure. Continuous profiling already has the flame graph for 2:14 AM waiting in a dashboard.

Traditional vs continuous profiling

Traditional (on-demand) profiling: SSH in, run perf record or hit /debug/pprof/profile, gather data, analyse, leave. Cost: only during capture. Limitation: requires you to be present and the issue to be actively occurring. For intermittent issues or incidents that self-resolve, you lose the evidence.

Continuous profiling: an agent runs on every host or container, sampling 100 times per second continuously, batching and shipping compressed profiles every 10-15 seconds to a backend. The backend stores them indexed by service, host, and time range. Storage: ~50-200 MB/day per service, ~1.5-6 GB/month. Overhead: 2-5% CPU. The critical win: when an SLO burns, the profile of the burning minutes is already saved.

The standing 2-5% CPU and ~50-200 MB/day are the price of having the incident's flame graph already there — on-demand pays nothing until capture, but self-resolving incidents leave no evidence to capture.

The sampler in every process feeds a per-host agent that ships compressed stacks to an aggregator and profile store; because collection is always-on, the flame-graph UI can query any past moment retroactively — no reproduction needed.

eBPF: language-agnostic profiling

Traditional language profilers (Go pprof, JFR, async-profiler, py-spy) require language-specific support — the runtime must expose stack walking APIs. For Python, Ruby, and older PHP interpreters, this requires hooks the runtime team must provide.

eBPF profilers (Pyroscope eBPF mode, Parca) read stacks from the kernel side: the kernel’s perf_event_open syscall plus a BPF-attached probe captures user-space stacks at sample time. This means:

Works for any language, any binary, with no application code change.
One agent covers all services on the host — Go, Java, Node, Python.
Catches third-party library overhead that language-specific profilers might miss.

The catch: symbol resolution. The kernel sees memory addresses; the profiler maps them back to function names using debug info (DWARF, BTF, JIT-emitted symbol files for V8 or JVM). Most production eBPF profilers handle this; occasional [unknown] frames appear when DWARF is stripped or JIT code is too volatile.

Cross-language profiler coverage

Language	Native profiler	eBPF coverage
Go	pprof (built-in)	Full — frame pointers standard
Java	JFR, async-profiler	Partial — needs JIT symbol maps
Python	py-spy, cProfile	Limited — interpreter frames opaque
Node.js	—prof, clinic.js, 0x	Partial — V8 needs —perf-prof flag
Rust / C / C++	perf, pprof-rs	Full — compiled with frame pointers

Trace-id correlation: from slow span to flame graph in 30 seconds

Each profile sample can carry the trace-id of the request being processed at the moment of sampling — stored in thread-local context. When a slow trace appears in the trace view, the matching profile samples (only those carrying that trace-id) can be filtered out and rendered as a flame graph for that specific request.

This is the bridge between “where did time go in the request” (trace span) and “what code ate the CPU” (profile). The workflow:

SLO alert fires — p99 latency over budget.
Open trace view — find slow spans, note trace-id.
Open profile view filtered by trace-id — flame graph for that exact request appears.
Widest frame is the function to fix.
Done in under 60 seconds.

These five steps close the loop that dashboards and traces alone cannot close: when you see a slow span, you now have a direct path to the exact line of code responsible — no reproduction, no guessing.

OpenTelemetry’s profile signal (stabilising in 2025-2026) standardises this linkage. Production-grade observability platforms (Datadog, Grafana with Pyroscope, Honeycomb) ship this drilldown out of the box.

Profile storage economics

Continuous profiling: cost and storage

Profile size per 30-second capture: ~50-500 KB compressed
Profiles per hour (15-s intervals): 240
Storage per service per day: ~50-200 MB
Storage per service per month: ~1.5-6 GB
Fleet of 200 services: 300 GB - 1.2 TB/month
Object storage cost: ~$0.02/GB ≈ $25/month
Pyroscope 2.0 storage improvement: ~3x vs v1 via symbol deduplication

Pyroscope 2.0 (released April 2026) cut storage 3x by deduplicating symbols across profiles from the same binary — function names and source paths are shared in a common symbol table instead of repeated in every profile.

Retention strategy: 7 days full-fidelity for active debugging, 30 days downsampled (one profile per 5 minutes), 90 days for long-term trend analysis. Budget-conscious teams cap at 14 days fine + 60 days coarse.

Quiz

An eBPF profiler shows many '[unknown]' frames for a Python service. What is the cause?

Quiz

What does trace-id correlation in continuous profiling enable that a standalone CPU profile cannot provide?

Recall before you leave

01
What is the critical operational advantage of continuous profiling over on-demand profiling during incidents?
02
Why does an eBPF profiler work for Go and Rust but produce [unknown] frames for Python?
03
How does trace-id correlation work mechanically?

Recap

Continuous profiling agents run on every host, sample stacks 100 times per second, and ship compressed profiles every 10-15 seconds to a backend like Pyroscope or Parca. At 2-5% overhead, this is affordable enough to leave always-on. eBPF agents capture stacks from the kernel side without language-specific hooks — one agent per host covers Go, Java, Node, and Python, though interpreter-based runtimes need extra support for accurate symbol resolution. Trace-id labels on every sample enable a flame graph filtered to one specific request in under 30 seconds. Pyroscope 2.0’s symbol deduplication cut storage costs 3x, making per-service monthly storage under 10 GB. The SLO → trace → profile workflow reduces MTTR for any CPU-bound incident to under 90 seconds. Now when an SLO alert fires at 2 AM, your first move is the profile dashboard — not SSH, not grep — because the data is already there.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 6 done

Connected lessons

builds on

Profile types: CPU, memory, off-CPU, mutex — which one to reach formiddle

unlocks

How flame graphs are built from samples, and the production workflows that use themmiddle

deepens into

appears again in170

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.