Observability OBS · 07 · 02

Sampling vs instrumentation profiling: why 99 Hz wins in production

Instrumentation wraps every function call for exact data but collapses under production load; sampling captures one stack every 10 ms at bounded overhead, making it the only viable production profiling strategy.

OBS Middle ◷ 13 min

Level

FoundationsJuniorMiddleSenior

Your staging server runs under profiling with no problem. The same profiler in production raises latency by 40%. The profiler didn’t change — the load did. Understanding why is the difference between a tool you can run always and one you can only use in emergencies.

Two ways to profile

Instrumentation profiling works by wrapping every function entry and exit with timing code. The runtime measures exactly how many times each function was called and exactly how long it ran. The cost is proportional to function call frequency — at 10 million calls per second, you pay 10 million timing measurements per second. On a dev machine with light load this is fine; under production traffic it can add 20-100% overhead. Instrumentation is great for small, isolated benchmarks; it is not viable for always-on production profiling.

Sampling profiling works differently: an external clock fires N times per second and captures the current call stack. The cost is exactly N stack walks per second regardless of how many function calls happen between them. At 100 Hz that is 100 stack walks per second — each costing ~5-20 microseconds — giving an overhead of roughly 0.5-2% on a busy CPU. Modern profilers (Go pprof, Linux perf, Java async-profiler, py-spy, eBPF-based profilers) are all sampling-based. The cost is bounded.

Sampling's cost is fixed by sample rate (low, statistical); instrumentation's cost scales with call frequency (high, exact) — the same program, two opposite overhead profiles.

The key statistical property

At sample rate R Hz, a function that uses X% of CPU will appear in X% of samples — regardless of how many times it was called. The statistic reported is “fraction of CPU time,” not “call count.” This is exactly the right metric for finding bottlenecks: you want to know which function is on the CPU most often, not which function is called most often.

The consequence: sampling at 100 Hz is sufficient to find functions consuming more than a few percent of CPU. A function using 10% of CPU will hit ~10 samples per second. A 30-second profile gives 300 samples of that function — plenty for a reliable estimate.

Sample rate choices in the wild

Linux perf defaults to 99 Hz — not 100 Hz. This is deliberate: 100 Hz can accidentally synchronise with periodic kernel timers and produce misleading results. 99 Hz avoids the resonance. Go pprof defaults to 100 Hz. eBPF-based continuous profilers typically use 19 or 49 Hz to minimise impact on very busy containers.

Approach	Cost model	Typical overhead	Use case
Instrumentation	Per function call	20-100%	Dev benchmarks, targeted micro-benchmarks
Sampling (Go pprof)	Per sample (100/s)	0.5-2%	Production on-demand or continuous
Sampling (eBPF)	Per sample (19-49/s)	1-3%	Continuous, polyglot fleets
Continuous profiler (full)	Sampling + batching + shipping	2-5%	Always-on production fleet

Instrumentation's overhead can hit 100% while every sampling approach stays at or under 5% — that cliff is the entire reason production profiling is sampling-based.

Sample rates and overhead

Go pprof default sample rate: 100 Hz
Linux perf default rate: 99 Hz (avoids timer sync)
eBPF profiler typical rate: 19 or 49 Hz
Stack walk cost per sample: ~5-20 μs
Overhead at 100 Hz, busy CPU: ~0.5-2%
Continuous profiler (full pipeline): 2-5% CPU

When the claimed 2-5% overhead becomes 12%

A continuous profiler claims 2% overhead; you measure 12% in production. The most common causes:

Sample rate was accidentally set 10x higher than default.
Stack walking is expensive because the language runtime uses JIT-compiled code requiring symbol resolution at sample time (Python, JVM without native hooks).
The agent is doing heavy symbol decompression or compression on the application thread instead of async.
Average stack depth is 120+ frames (deep middleware chains) — each sample costs proportionally more to walk.

Always check profiler config before assuming the tool is misbehaving.

Quiz

A service makes 5 million function calls per second. An instrumentation profiler adds 1 μs per call. A sampling profiler runs at 100 Hz and costs 10 μs per sample. Which adds less overhead?

Quiz

Linux perf defaults to 99 Hz, not 100 Hz. Why?

Recall before you leave

01
Why is instrumentation profiling impractical for always-on production profiling?
02
A sampling profiler reports a function at 8% of samples. What does this mean in terms of CPU usage?
03
Why does Linux perf use 99 Hz instead of 100 Hz?

Recap

Instrumentation profiling wraps every function call, giving exact data but collapsing under production load as overhead scales with call frequency. Sampling profiling fires a clock N times per second and captures the current stack — bounded overhead regardless of how many functions are called between samples. At 100 Hz and ~10 μs per stack walk, the overhead is 0.5-2%; a full continuous profiler pipeline (with batching and shipping) adds to 2-5%. The statistical property that makes sampling powerful: a function consuming X% of CPU appears in X% of samples, so you get CPU share estimates without touching every call. Linux perf defaults to 99 Hz to avoid synchronising with periodic kernel timers — a subtle correctness detail every senior engineer should know. Now when you see a profiler claiming 2-5% overhead and then measuring 12%, you know where to look first: sample rate, stack depth, and whether JIT symbol resolution is happening on the application thread.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Flame graphs: reading the picture that shows where time goesjunior

unlocks

Profile types: CPU, memory, off-CPU, mutex — which one to reach formiddle

deepens into

appears again in170

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.