awesome-everything RU
↑ Back to the climb

Performance

Hardware counters, cold-start profiles, and profile security

Crux A flame graph names a hot function. Hardware performance counters tell you why it is hot: memory-stalled at 0.5 IPC or compute-bound at 3.0 IPC. Cold-start profiles drive different fixes than steady-state ones. Profiles leak code structure and must be RBAC-gated.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 16 min

A JSON parser appears as 30% of a service’s CPU in the flame graph. The team switches to a faster parser — saves 10%. An engineer runs perf with cache-miss counters: IPC is 0.4, cache-miss rate 12%. The parser is memory-stalled, not compute-bound. Restructuring the input data layout saves 50%.

Hardware performance counters (HPCs)

A flame graph names a function. It does not tell you what the CPU is doing inside that function. Hardware performance counters expose the silicon-level cost.

Key counters:

  • cycles — raw CPU cycles consumed
  • instructions — instructions retired. IPC = instructions / cycles.
  • cache-misses — L3 cache miss count (each miss = ~100 ns stall)
  • branch-misses — branch misprediction count (each miss = ~15 cycle penalty)
  • page-faults — OS page fault count
  • dTLB-load-misses — data TLB miss count (translation lookaside buffer)

Interpreting IPC:

  • IPC < 1.0 — memory-bound. The CPU is stalled waiting for data from cache or RAM. Algorithmic rewrites will not help; data-layout fixes (struct-of-arrays, cache-friendly traversal, prefetching) are the lever.
  • IPC 1.0–2.5 — mixed. Investigate specific misses.
  • IPC > 2.5 — compute-bound. The algorithm is doing useful work; vectorisation or smarter math is the lever.

Usage on Linux:

# Profile cycles, instructions, cache-misses, branch-misses together
perf record -e cycles,instructions,cache-references,cache-misses,branch-misses \
    -g ./myapp workload
perf report  # shows per-function counter breakdowns
SignalQuestion answeredFix direction
IPC < 1.0 + high cache-missMemory-stalled: CPU waits for RAMData layout, prefetch, smaller structs
IPC > 2.5 + low cache-missCompute-bound: algorithm is the limitVectorisation, SIMD, smarter algorithm
High branch-missesBranch predictor failing on irregular dataBranchless code, sorted input, lookup tables

Cold-start vs steady-state profiles

A profile of the first 60 seconds after process start looks nothing like a profile after an hour of traffic.

Cold-start phase:

  • JIT runtimes compile hot code paths (HotSpot, V8, .NET CLR) — compilation shows up as CPU cost.
  • Caches are cold: connection pools establishing, lazy-loaded modules loading, L3 cache empty.
  • Optimisations: AOT compilation (GraalVM native-image, .NET ReadyToRun), eager module loading, connection pre-warming.

Steady-state phase:

  • JIT is fully optimised; caches are warm.
  • Optimisations: algorithmic fixes, data-layout changes, lock reduction.

Confusing the two is a common failure: a team optimises the steady-state hotspot and is surprised when autoscaler scale-out events still degrade tail latency — the cold-start path was never measured.

Production-grade profiling captures both: a cold-start profile (first 30-60 seconds post-launch) and a steady-state profile (after warmup, under representative load). Maintain separate dashboards for both phases.

Profile security

A profile contains function names — often including private internal APIs, undocumented endpoints, and build paths revealing the deploy environment. Memory profiles can include allocation arguments (string contents, JSON bodies) when poorly configured.

Real incidents: pprof endpoints accidentally exposed via /debug/pprof on a public port, leaking source paths and feature flag names. Allocation profilers leaking session tokens from query strings.

Production discipline:

  • pprof endpoints bound to localhost or an authenticated admin-only path only.
  • eBPF-based profilers run with minimal capabilities (CAP_PERFMON on Linux 5.8+, not CAP_SYS_ADMIN).
  • Continuous-profile backends RBAC-gated by team.
  • Profile exports require manager approval.

Profiles are operational data with security implications, not “ops-only artefacts safe to share.”

Why this works

Linux 5.8 (2020) split the profiling capability from CAP_SYS_ADMIN into a dedicated CAP_PERFMON capability. This was specifically to allow profiling tools to run without granting full system administration access. On multi-tenant Kubernetes clusters, eBPF profilers should run with CAP_PERFMON only, namespace-scoped, to prevent tenant cross-visibility of stack frames.

Quiz

A flame graph shows a JSON deserialisation function consuming 35% of CPU. Hardware counters show IPC = 0.4 and cache-miss rate = 11%. What kind of fix is most likely to help?

Quiz

A team optimises the steady-state CPU hotspot. After deploy, scale-out events still cause high tail latency for 60 seconds. What measurement did they miss?

Recall before you leave
  1. 01
    Explain why hardware performance counters are necessary alongside stack-sampling profilers, and walk through a concrete diagnosis scenario where the flame graph alone would mislead.
  2. 02
    What are the production security constraints for running profilers, and what is the minimum-capability principle?
Recap

Stack-sampling profilers name the hot function; hardware performance counters name why it is hot. IPC below 1.0 with high cache-miss rate identifies memory-stalled code where data-layout fixes (smaller structs, cache-friendly traversal) outperform algorithmic rewrites. IPC above 2.5 identifies compute-bound code where vectorisation or algorithm improvements are the lever. Cold-start profiles capture the JIT compilation and cache-warm phase that dominates the first 30-60 seconds after a new process launches — critical for autoscaler scale-out correctness. Steady-state profiles capture production behaviour after warmup. Profiles expose function names and may expose allocation payloads; gate pprof endpoints on localhost, run eBPF profilers with CAP_PERFMON (not CAP_SYS_ADMIN), and RBAC-gate profile backend access.

Connected lessons
appears again in159
Continue the climb ↑Continuous profiling at scale: costs, CI gates, trace correlation, and anti-patterns
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.