Performance
Hardware counters, cold-start profiles, and profile security
A JSON parser appears as 30% of a service’s CPU in the flame graph. The team switches to a faster parser — saves 10%. An engineer runs perf with cache-miss counters: IPC is 0.4, cache-miss rate 12%. The parser is memory-stalled, not compute-bound. Restructuring the input data layout saves 50%.
Hardware performance counters (HPCs)
A flame graph names a function. It does not tell you what the CPU is doing inside that function. Hardware performance counters expose the silicon-level cost.
Key counters:
- cycles — raw CPU cycles consumed
- instructions — instructions retired. IPC = instructions / cycles.
- cache-misses — L3 cache miss count (each miss = ~100 ns stall)
- branch-misses — branch misprediction count (each miss = ~15 cycle penalty)
- page-faults — OS page fault count
- dTLB-load-misses — data TLB miss count (translation lookaside buffer)
Interpreting IPC:
- IPC < 1.0 — memory-bound. The CPU is stalled waiting for data from cache or RAM. Algorithmic rewrites will not help; data-layout fixes (struct-of-arrays, cache-friendly traversal, prefetching) are the lever.
- IPC 1.0–2.5 — mixed. Investigate specific misses.
- IPC > 2.5 — compute-bound. The algorithm is doing useful work; vectorisation or smarter math is the lever.
Usage on Linux:
# Profile cycles, instructions, cache-misses, branch-misses together
perf record -e cycles,instructions,cache-references,cache-misses,branch-misses \
-g ./myapp workload
perf report # shows per-function counter breakdowns| Signal | Question answered | Fix direction |
|---|---|---|
| IPC < 1.0 + high cache-miss | Memory-stalled: CPU waits for RAM | Data layout, prefetch, smaller structs |
| IPC > 2.5 + low cache-miss | Compute-bound: algorithm is the limit | Vectorisation, SIMD, smarter algorithm |
| High branch-misses | Branch predictor failing on irregular data | Branchless code, sorted input, lookup tables |
Cold-start vs steady-state profiles
A profile of the first 60 seconds after process start looks nothing like a profile after an hour of traffic.
Cold-start phase:
- JIT runtimes compile hot code paths (HotSpot, V8, .NET CLR) — compilation shows up as CPU cost.
- Caches are cold: connection pools establishing, lazy-loaded modules loading, L3 cache empty.
- Optimisations: AOT compilation (GraalVM native-image, .NET ReadyToRun), eager module loading, connection pre-warming.
Steady-state phase:
- JIT is fully optimised; caches are warm.
- Optimisations: algorithmic fixes, data-layout changes, lock reduction.
Confusing the two is a common failure: a team optimises the steady-state hotspot and is surprised when autoscaler scale-out events still degrade tail latency — the cold-start path was never measured.
Production-grade profiling captures both: a cold-start profile (first 30-60 seconds post-launch) and a steady-state profile (after warmup, under representative load). Maintain separate dashboards for both phases.
Profile security
A profile contains function names — often including private internal APIs, undocumented endpoints, and build paths revealing the deploy environment. Memory profiles can include allocation arguments (string contents, JSON bodies) when poorly configured.
Real incidents: pprof endpoints accidentally exposed via /debug/pprof on a public port, leaking source paths and feature flag names. Allocation profilers leaking session tokens from query strings.
Production discipline:
- pprof endpoints bound to localhost or an authenticated admin-only path only.
- eBPF-based profilers run with minimal capabilities (
CAP_PERFMONon Linux 5.8+, notCAP_SYS_ADMIN). - Continuous-profile backends RBAC-gated by team.
- Profile exports require manager approval.
Profiles are operational data with security implications, not “ops-only artefacts safe to share.”
Why this works
Linux 5.8 (2020) split the profiling capability from CAP_SYS_ADMIN into a dedicated CAP_PERFMON capability. This was specifically to allow profiling tools to run without granting full system administration access. On multi-tenant Kubernetes clusters, eBPF profilers should run with CAP_PERFMON only, namespace-scoped, to prevent tenant cross-visibility of stack frames.
A flame graph shows a JSON deserialisation function consuming 35% of CPU. Hardware counters show IPC = 0.4 and cache-miss rate = 11%. What kind of fix is most likely to help?
A team optimises the steady-state CPU hotspot. After deploy, scale-out events still cause high tail latency for 60 seconds. What measurement did they miss?
- 01Explain why hardware performance counters are necessary alongside stack-sampling profilers, and walk through a concrete diagnosis scenario where the flame graph alone would mislead.
- 02What are the production security constraints for running profilers, and what is the minimum-capability principle?
Stack-sampling profilers name the hot function; hardware performance counters name why it is hot. IPC below 1.0 with high cache-miss rate identifies memory-stalled code where data-layout fixes (smaller structs, cache-friendly traversal) outperform algorithmic rewrites. IPC above 2.5 identifies compute-bound code where vectorisation or algorithm improvements are the lever. Cold-start profiles capture the JIT compilation and cache-warm phase that dominates the first 30-60 seconds after a new process launches — critical for autoscaler scale-out correctness. Steady-state profiles capture production behaviour after warmup. Profiles expose function names and may expose allocation payloads; gate pprof endpoints on localhost, run eBPF profilers with CAP_PERFMON (not CAP_SYS_ADMIN), and RBAC-gate profile backend access.
appears again in159
- The journey of a request: seven stops from socket to responsejunior
- Accept and parse: from kernel queue to a typed requestmiddle
- Routing and middleware: choosing what runs, and in what ordermiddle
- Handler and response: from business logic to bytes on the wiremiddle
- Streaming and backpressure: when the client reads slower than you writesenior
- Timeouts and tail latency: budgets, deadlines, and the fan-out trapsenior
- Middleware and DI: the two patterns that shape every backendjunior
- Writing middleware: signatures, next(), and the three framework modelsmiddle
- Inversion of control: how dependencies reach a classmiddle
- DI scopes and lifecycles: singleton, request, transientmiddle
- DI as a testing seam: fakes, mocks, and the boundary that matterssenior
- DI containers in production: resolution graphs, circular deps, and when not tosenior
- Blocking vs non-blocking I/O: two ways to waitjunior
- The event loop: one thread, ordered phasesmiddle
- What blocks the loop: CPU work and sync callsmiddle
- Offloading CPU work: worker threads and the libuv poolmiddle
- Backpressure and bounded concurrencysenior
- Throughput under load: tail latency and saturationsenior
- Why pool: the cost of creating a connectionjunior
- Pool sizing: why bigger is not fastermiddle
- Acquisition and timeouts: the wait queue is the real latency dialmiddle
- Retry strategies: backoff, jitter, and thundering herdmiddle
- Observability, production failures, and global-scale designsenior
- Tasks, microtasks, and scheduler.yield()middle
- Timer accuracy, throttling, and idle workmiddle
- Node.js event loop: phases, nextTick, and loop lagsenior
- Rendering strategies: SSG, SSR, ISR, streaming, and hydrationjunior
- SSG, SSR, ISR, streaming, and RSC — how each worksmiddle
- Hydration cost: selective, progressive, islands, resumabilitymiddle
- Core Web Vitals: what LCP, INP, and CLS measurejunior
- LCP: four phases, one dominant costmiddle
- INP: input delay, processing, presentationmiddle
- Lab vs field: why the two disagree and how to use eachmiddle
- Metric tradeoffs, RUM attribution, and the CI+field loopsenior
- The full picture: URL to LCP to INP as a relay racejunior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- What an index is and how it speeds up queriesjunior
- The leading-column rule and composite index designmiddle
- Partial, expression, and covering indexesmiddle
- Index types: GIN, GiST, BRIN, Hash, Bloom, and HOT updatesmiddle
- Index-only scans, the Visibility Map, and INCLUDEsenior
- Production failure modes and the index audit playbooksenior
- Index design exercise: full-text search strategysenior
- EXPLAIN and execution plans: what the planner decides and whyjunior
- Scan types: Seq, Index, Bitmap, Index-Onlymiddle
- Join algorithms and the row-estimate cascademiddle
- pg_statistic, ANALYZE, and production observabilitymiddle
- Extended statistics: fixing correlated-column estimate failuressenior
- Plan cache, cost-constant tuning, and planner internalssenior
- Production failure modes and plan stabilitysenior
- Connection pools: amortising the cost of a Postgres backendjunior
- PgBouncer session, transaction, and statement modesmiddle
- Pool sizing: the (cores × 2) + spindles formula and the two-layer stackmiddle
- Pool exhaustion and idle-in-transaction: the 3 AM failure modemiddle
- Migrating to transaction mode: rollout playbook and PgBouncer 1.21 prepared statementsmiddle
- The Postgres process model and why raising max_connections degrades throughputsenior
- Pooler landscape 2026, serverless connection storms, and the full failure-mode taxonomysenior
- ADD COLUMN: instant in PG 11+ vs rewrite in older Postgresjunior
- The lock-queue failure mode: why instant DDL can freeze the databasemiddle
- Safe DDL patterns: NOT VALID, CONCURRENTLY, and unsafe-op fixesmiddle
- Migration failure taxonomy and production disciplinesenior
- Shard-key selection: hash, range, list, and directory strategiesmiddle
- Co-location and Citus: the invariant that makes sharding usablemiddle
- The hot-shard failure mode: detection, isolation, and durable policymiddle
- Online resharding, 2PC, and the operational cost of shardingsenior
- The seven acts: from CREATE TABLE to Citusjunior
- Acts 1–3 in depth: schema, indexes, and planner statisticsmiddle
- Acts 4–6 in depth: MVCC bloat, connection pooling, and safe migrationsmiddle
- Act 7 in depth: sharding, co-location, and the seven-tier tradeoff cascademiddle
- Observability, anti-patterns, and production triagesenior
- Bits on the wirejunior
- Latency mathmiddle
- Bufferbloat and congestionsenior
- The physical frontiersenior
- Sequence numbers and connection statemiddle
- Flow control and congestion controlmiddle
- BBR, production observability, and beyond TCPsenior
- CDN: putting content next doorjunior
- Anycast and GeoDNS: routing to the nearest edgemiddle
- Tiered cache and Cache-Controlmiddle
- Vary header and cache keysmiddle
- Stale-while-revalidate and cache stampedesenior
- Edge workers and edge-side compositionsenior
- CDN operations and observabilitysenior
- WebSocket: the HTTP upgrade handshakejunior
- WebSocket vs SSE vs long-polling: choosing the right transportmiddle
- WebSocket backpressure: when clients can''''t keep upmiddle
- Reconnection: jittered backoff, thundering herd, message resumptionsenior
- WebSocket at scale: HTTP/2 multiplexing, permessage-deflate, C10Msenior
- WebSocket in production: proxies, security, and distributed architecturesenior
- What reverse proxies dojunior
- Balancing algorithms: round-robin to power-of-two-choicesmiddle
- L4 vs L7 load balancing and client-IP preservationmiddle
- Health checks, connection draining, and slow startmiddle
- Retry storms, circuit breakers, and load sheddingsenior
- Resilient LB architecture: anycast, zone-aware routing, and observabilitysenior
- Why QUIC and not TCP+TLSjunior
- QUIC streams and head-of-line blockingjunior
- Integrated handshake and 1-RTTmiddle
- Connection IDs and network migrationmiddle
- Loss detection and congestion controlmiddle
- 0-RTT resumption and packet encryptionsenior
- Deployment tradeoffs and CPU costsenior
- DDoS: what it is and why it worksjunior
- Amplification attacks and state exhaustionmiddle
- Rate limiting: algorithms and architecturemiddle
- WAFs, firewalls, mTLS, and HSTSmiddle
- DNS cache poisoning and BGP hijackingsenior
- Defense-in-depth architecture and attack economicssenior
- The twelve layers: one URL, seven actorsjunior
- DNS, TCP, TLS in sequence: where the milliseconds gomiddle
- Critical render path and Core Web Vitalsmiddle
- Proxy intercepts and security gates: rate limiters, WAF, mTLSmiddle
- Alternate paths: QUIC 0-RTT, WebSocket upgrade, connection migrationmiddle
- Observability: distributed traces, USE/RED, and samplingsenior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- What the three signals are: logs, metrics, and tracesjunior
- Metrics and cardinality: the cost model of a time-series databasemiddle
- Logs and volume: the cost model of structured loggingmiddle
- Traces and sampling: the cost model of distributed tracingmiddle
- Join keys and exemplars: making the three signals composemiddle
- Observability 2.0: wide events and the cost shiftsenior
- Failure modes and engineering practice: cardinality budgets, PII, and samplingsenior
- Why structured logs exist: the diary vs the spreadsheetjunior
- The production log schema: fields every line must carrymiddle
- Log levels and alert routingmiddle
- Sampling strategies and log costmiddle
- PII redaction and log injectionsenior
- Trace context propagation in logssenior
- OTel Logs Data Model and audit logs as a subsystemsenior
- OTel signals, Semantic Conventions, and the OTLP wire formatmiddle
- Auto-instrumentation and manual spans: the 80/20 of OTelmiddle
- The OTel Collector: receivers, processors, exporters, and deployment patternsmiddle
- Sampling strategies: head, tail, and parent-basedmiddle
- Vendor neutrality, eBPF instrumentation, the Operator, and browser/serverless OTelsenior
- Operating the OTel Collector: reliability, version skew, failure modes, and governancesenior
- RED and USE: two checklists, one triage disciplinejunior
- Instrumenting RED in Prometheus: counters, histograms, and cardinality disciplinemiddle
- USE on Linux: CPU, memory, disk, network, and PSImiddle
- Golden signals, dashboard layout, and service mesh auto-REDmiddle
- Cardinality as a cost driver: labels, PII, exemplars, and samplingmiddle
- Native histograms, SLO tie-in, and production failure patternsmiddle
- Choosing SLIs and SLO targets: ratios, not feelingsmiddle
- Multi-window multi-burn-rate alerting: why AND beats ORmiddle
- Error budget policy, latency SLOs, and composite journeysmiddle
- Iceberg SLIs, composite SLO math, and SLA vs SLOsenior
- Flame graphs: reading the picture that shows where time goesjunior
- Sampling vs instrumentation profiling: why 99 Hz wins in productionmiddle
- Profile types: CPU, memory, off-CPU, mutex — which one to reach formiddle
- Continuous profiling: always-on flame graphs with eBPF and trace-id correlationmiddle
- How flame graphs are built from samples, and the production workflows that use themmiddle
- Linux perf, eBPF internals, PGO, and the limits of samplingsenior
- Profiling in production: security, war stories, OTel profiles, and the infrastructure designsenior
- The debugging funnel: SLO → RED → trace → profilejunior
- OTel architecture: one SDK, four signals, one wire formatmiddle
- Cost discipline: keeping observability under 5% of infra spendmiddle
- Scale, security, and the ROI of observable systemssenior