Performance PERF · 01 · 07

Hardware counters, cold-start profiles, and profile security

A flame graph names a hot function. Hardware performance counters tell you why it is hot: memory-stalled at 0.5 IPC or compute-bound at 3.0 IPC. Cold-start profiles drive different fixes than steady-state ones. Profiles leak code structure and must be RBAC-gated.

PERF Senior ◷ 16 min

Level

FoundationsJuniorMiddleSenior

A JSON parser appears as 30% of a service’s CPU in the flame graph. The team switches to a faster parser — saves 10%. An engineer runs perf with cache-miss counters: IPC is 0.4, cache-miss rate 12%. The parser is memory-stalled, not compute-bound. Restructuring the input data layout saves 50%.

Hardware performance counters (HPCs)

A flame graph names a function. It does not tell you what the CPU is doing inside that function. Hardware performance counters expose the silicon-level cost.

Key counters:

cycles — raw CPU cycles consumed
instructions — instructions retired. IPC = instructions / cycles.
cache-misses — L3 cache miss count (each miss = ~100 ns stall)
branch-misses — branch misprediction count (each miss = ~15 cycle penalty)
page-faults — OS page fault count
dTLB-load-misses — data TLB miss count (translation lookaside buffer)

Interpreting IPC:

IPC < 1.0 — memory-bound. The CPU is stalled waiting for data from cache or RAM. Algorithmic rewrites will not help; data-layout fixes (struct-of-arrays, cache-friendly traversal, prefetching) are the lever.
IPC 1.0–2.5 — mixed. Investigate specific misses.
IPC > 2.5 — compute-bound. The algorithm is doing useful work; vectorisation or smarter math is the lever.

Same wide leaf on the flame graph, opposite fixes: IPC below 1.0 means memory-stalled (fix data layout); above 2.5 means compute-bound (fix the algorithm).

Usage on Linux:

# Profile cycles, instructions, cache-misses, branch-misses together
perf record -e cycles,instructions,cache-references,cache-misses,branch-misses \
    -g ./myapp workload
perf report  # shows per-function counter breakdowns

Signal	Question answered	Fix direction
IPC < 1.0 + high cache-miss	Memory-stalled: CPU waits for RAM	Data layout, prefetch, smaller structs
IPC > 2.5 + low cache-miss	Compute-bound: algorithm is the limit	Vectorisation, SIMD, smarter algorithm
High branch-misses	Branch predictor failing on irregular data	Branchless code, sorted input, lookup tables

Cold-start vs steady-state profiles

If you have only ever profiled a running service under load, you have been missing half the picture. The first minute after launch is a completely different system.

A profile of the first 60 seconds after process start looks nothing like a profile after an hour of traffic.

Cold-start phase:

JIT runtimes compile hot code paths (HotSpot, V8, .NET CLR) — compilation shows up as CPU cost.
Caches are cold: connection pools establishing, lazy-loaded modules loading, L3 cache empty.
Optimisations: AOT compilation (GraalVM native-image, .NET ReadyToRun), eager module loading, connection pre-warming.

Steady-state phase:

JIT is fully optimised; caches are warm.
Optimisations: algorithmic fixes, data-layout changes, lock reduction.

Confusing the two is a common failure: a team optimises the steady-state hotspot and is surprised when autoscaler scale-out events still degrade tail latency — the cold-start path was never measured.

Production-grade profiling captures both: a cold-start profile (first 30-60 seconds post-launch) and a steady-state profile (after warmup, under representative load). Maintain separate dashboards for both phases.

Profile security

A profile contains function names — often including private internal APIs, undocumented endpoints, and build paths revealing the deploy environment. Memory profiles can include allocation arguments (string contents, JSON bodies) when poorly configured.

Real incidents: pprof endpoints accidentally exposed via /debug/pprof on a public port, leaking source paths and feature flag names. Allocation profilers leaking session tokens from query strings.

Production discipline:

pprof endpoints bound to localhost or an authenticated admin-only path only.
eBPF-based profilers run with minimal capabilities (CAP_PERFMON on Linux 5.8+, not CAP_SYS_ADMIN).
Continuous-profile backends RBAC-gated by team.
Profile exports require manager approval.

Profiles are operational data with security implications, not “ops-only artefacts safe to share.”

▸Why this works

Linux 5.8 (2020) split the profiling capability from CAP_SYS_ADMIN into a dedicated CAP_PERFMON capability. This was specifically to allow profiling tools to run without granting full system administration access. On multi-tenant Kubernetes clusters, eBPF profilers should run with CAP_PERFMON only, namespace-scoped, to prevent tenant cross-visibility of stack frames.

Quiz

A flame graph shows a JSON deserialisation function consuming 35% of CPU. Hardware counters show IPC = 0.4 and cache-miss rate = 11%. What kind of fix is most likely to help?

Quiz

A team optimises the steady-state CPU hotspot. After deploy, scale-out events still cause high tail latency for 60 seconds. What measurement did they miss?

Capture two profiles: cold-start (first 30-60s, drives JIT/warmup fixes) and steady-state (drives algorithmic fixes).

Recall before you leave

01
Explain why hardware performance counters are necessary alongside stack-sampling profilers, and walk through a concrete diagnosis scenario where the flame graph alone would mislead.
02
What are the production security constraints for running profilers, and what is the minimum-capability principle?

Recap

Stack-sampling profilers name the hot function; hardware performance counters name why it is hot. IPC below 1.0 with high cache-miss rate identifies memory-stalled code where data-layout fixes (smaller structs, cache-friendly traversal) outperform algorithmic rewrites. IPC above 2.5 identifies compute-bound code where vectorisation or algorithm improvements are the lever. Cold-start profiles capture the JIT compilation and cache-warm phase that dominates the first 30-60 seconds after a new process launches — critical for autoscaler scale-out correctness. Steady-state profiles capture production behaviour after warmup. Profiles expose function names and may expose allocation payloads; gate pprof endpoints on localhost, run eBPF profilers with CAP_PERFMON (not CAP_SYS_ADMIN), and RBAC-gate profile backend access. Now when you see a flame graph naming a hot function, ask the next question: is it memory-stalled or compute-bound? One perf stat command gives you the answer.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Profiler history and microbenchmark pitfalls: Knuth to GWPsenior

unlocks

Continuous profiling at scale: costs, CI gates, trace correlation, and anti-patternssenior

appears again in162

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.