Performance
JIT deopt, the fix-and-verify loop, and PR-time profiling
A Node service has a wide leaf that flamegraphs show as the V8 interpreter (InterpreterCallStub), not TurboFan. The function is hot. The JIT is not optimising it. Every call pays interpreter overhead. Switching to a faster algorithm does nothing until the deopt is fixed. Understanding why the JIT bailed is the diagnosis.
JIT deoptimisation: a sixth shape
JIT runtimes (V8, JVM HotSpot, .NET, PyPy) compile hot code to native machine code under typed assumptions. If assumptions break — a function receives an unexpected type, a hidden class transitions, a megamorphic call site fans out — the JIT bails to the interpreter or a slower compilation tier.
Signature in the flame graph: the function shows wide, but the wide frame is the interpreter (Interpreter::execute, InterpreterCallStub) or a baseline JIT frame (V8 Sparkplug) instead of the optimised compiler’s frame (V8 TurboFan, HotSpot C2).
Cost: a single deopt is microseconds. A deopt loop (deopt → recompile → deopt) can multiply per-call cost 10–100x silently. Latency spikes that don’t correlate with traffic, periodic pauses without GC running, and baseline-tier frames intermittently dominating the flame graph are all deopt-loop symptoms.
Fix: stabilise types.
- V8: keep hot object shapes to ≤4 hidden classes; no late property addition in JS inside hot loops.
- HotSpot: monitor
-XX:+PrintCompilationfor repeated deopts; avoid boxing in hot code. - PyPy: watch
jit-summaryfor guard failures; write type-stable loops.
Verification: re-profile and check that the optimised compiler’s frame (TurboFan, C2) is back in the hot stack.
| Runtime | Deopt signal in profile | Diagnosis tool |
|---|---|---|
| V8 (Node.js) | Sparkplug / Interpreter frames instead of TurboFan | —trace-deopt |
| JVM HotSpot | C1 compiled frames instead of C2 | -XX:+PrintCompilation -XX:+TraceDeoptimization |
| .NET RyuJIT | Interpreter / tier-0 frames | PerfView with Tiered JIT counters |
| PyPy | Interpreter frames; jit-summary guard failures | —jit-summary |
The fix-and-verify loop
Every performance fix has five required steps:
- Name the hotspot and classify it (one of the six shapes including JIT deopt).
- Pick the categorical fix family that matches the classification.
- Write the fix with no scope creep — only the change predicted in step 2.
- Capture a profile under the same load and diff against the baseline.
- Verify both: the local frame shrank AND the headline metric improved (p99, throughput, CPU%, whatever the SLO names).
If the frame shrank but the metric did not move: look at where the time went instead — often a second hotspot is now visible that was masked by the first. This is not failure; it is the next iteration.
If the metric moved but the frame did not shrink: the fix worked through a side effect you did not predict. Investigate; you may have hit something orthogonal. Both outcomes require evidence and drive the next move.
The loop is the senior performance habit: fix one thing, prove it landed, find the next.
Microbenchmark-driven vs production-profile-driven fixes
A microbenchmark in isolation may say a new algorithm is 5x faster. The production profile may show that algorithm is now 8% of total time instead of 15%, but other paths got slower because the new algorithm allocates more and pushed GC pressure up.
The fix-and-verify loop catches this: capturing the production profile after the change tells you the whole-system effect, not just the local one. Microbenchmark claims are predictions; production profile diffs are the verdict.
Production-grade teams require both: a microbenchmark that shows the local change does what is claimed, AND a production profile diff that shows the system-wide effect is positive. PRs with only one or the other ship regressions that look like wins.
PR-time vs incident-time profiling
Two modes of applying hot-path methodology:
Incident-time: the service is on fire, on-call catches the hotspot in minutes, fixes, verifies, ships. Reactive mode — same methodology, clock ticking.
PR-time: before release, CI captures the PR’s profile against the main branch baseline and flags regressions before they reach production. Proactive mode — same methodology, no pressure.
Senior teams invest in both: incident-time runbooks for on-call, PR-time CI gates for prevention. Every incident retro adds one rule to the PR-time gate: if the exact regression could have been caught in CI, encode the signature. Over time the PR-time gate catches most regressions before release; incident-time runbooks handle the rest.
Why this works
Cross-pollination between incident-time and PR-time is the mechanism that makes performance discipline self-compounding. Each incident retro that encodes a CI rule reduces future on-call load by one class of regression. The mature signature: perf incidents per quarter trending down, not flat. Teams that do not cross-pollinate stay on the “heroic on-call” stage indefinitely.
Order the five steps of the fix-and-verify loop:
- 1 Name the hotspot and classify it (CPU, alloc, cache, lock, syscall, or JIT deopt)
- 2 Pick the categorical fix family matching the classification
- 3 Write only the predicted change — no scope creep
- 4 Capture a new profile under the same load and diff against baseline
- 5 Verify: local frame shrank AND headline metric improved — both required
A Node flame graph shows InterpreterCallStub frames dominating a function that should be hot. What is the most likely cause and fix?
A microbenchmark shows a new algorithm is 5x faster locally. The production profile diff shows the function dropped from 15% to 8% CPU, but total CPU% is unchanged and p99 is worse. What is the most likely explanation?
- 01What are the tell-tale signs of a JIT deopt loop in a flame graph, and what is the fix for V8 specifically?
- 02Why must the fix-and-verify loop check BOTH the local frame and the headline metric, and what does each failure mode mean?
JIT deoptimisation is a sixth hotspot shape: the flamegraph shows interpreter or baseline-JIT frames where an optimised compiler’s output should appear. The fix is type stabilisation, not algorithmic rewrite. The fix-and-verify loop applies to all six shapes: classify, write one targeted change, capture a diff profile under the same load, verify both local shrinkage and headline improvement. Microbenchmarks are predictions; production diffs are verdicts. PR-time CI gates that encode lessons from incident retros turn reactive performance work into proactive prevention.
- Hardware counters and Intel TMA: sub-category diagnosissenior
- False sharing and native-bridge hot pathssenior
- Hot paths in production: security, tail latency, and tooling lineagesenior
- Hot paths: diagnose and fix two shapessenior
- Hot paths: multiple-choice reviewsenior
- Hot paths: code and counter readingsenior
- Hot paths: free-recall reviewsenior
appears again in159
- The journey of a request: seven stops from socket to responsejunior
- Accept and parse: from kernel queue to a typed requestmiddle
- Routing and middleware: choosing what runs, and in what ordermiddle
- Handler and response: from business logic to bytes on the wiremiddle
- Streaming and backpressure: when the client reads slower than you writesenior
- Timeouts and tail latency: budgets, deadlines, and the fan-out trapsenior
- Middleware and DI: the two patterns that shape every backendjunior
- Writing middleware: signatures, next(), and the three framework modelsmiddle
- Inversion of control: how dependencies reach a classmiddle
- DI scopes and lifecycles: singleton, request, transientmiddle
- DI as a testing seam: fakes, mocks, and the boundary that matterssenior
- DI containers in production: resolution graphs, circular deps, and when not tosenior
- Blocking vs non-blocking I/O: two ways to waitjunior
- The event loop: one thread, ordered phasesmiddle
- What blocks the loop: CPU work and sync callsmiddle
- Offloading CPU work: worker threads and the libuv poolmiddle
- Backpressure and bounded concurrencysenior
- Throughput under load: tail latency and saturationsenior
- Why pool: the cost of creating a connectionjunior
- Pool sizing: why bigger is not fastermiddle
- Acquisition and timeouts: the wait queue is the real latency dialmiddle
- Retry strategies: backoff, jitter, and thundering herdmiddle
- Observability, production failures, and global-scale designsenior
- Tasks, microtasks, and scheduler.yield()middle
- Timer accuracy, throttling, and idle workmiddle
- Node.js event loop: phases, nextTick, and loop lagsenior
- Rendering strategies: SSG, SSR, ISR, streaming, and hydrationjunior
- SSG, SSR, ISR, streaming, and RSC — how each worksmiddle
- Hydration cost: selective, progressive, islands, resumabilitymiddle
- Core Web Vitals: what LCP, INP, and CLS measurejunior
- LCP: four phases, one dominant costmiddle
- INP: input delay, processing, presentationmiddle
- Lab vs field: why the two disagree and how to use eachmiddle
- Metric tradeoffs, RUM attribution, and the CI+field loopsenior
- The full picture: URL to LCP to INP as a relay racejunior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- What an index is and how it speeds up queriesjunior
- The leading-column rule and composite index designmiddle
- Partial, expression, and covering indexesmiddle
- Index types: GIN, GiST, BRIN, Hash, Bloom, and HOT updatesmiddle
- Index-only scans, the Visibility Map, and INCLUDEsenior
- Production failure modes and the index audit playbooksenior
- Index design exercise: full-text search strategysenior
- EXPLAIN and execution plans: what the planner decides and whyjunior
- Scan types: Seq, Index, Bitmap, Index-Onlymiddle
- Join algorithms and the row-estimate cascademiddle
- pg_statistic, ANALYZE, and production observabilitymiddle
- Extended statistics: fixing correlated-column estimate failuressenior
- Plan cache, cost-constant tuning, and planner internalssenior
- Production failure modes and plan stabilitysenior
- Connection pools: amortising the cost of a Postgres backendjunior
- PgBouncer session, transaction, and statement modesmiddle
- Pool sizing: the (cores × 2) + spindles formula and the two-layer stackmiddle
- Pool exhaustion and idle-in-transaction: the 3 AM failure modemiddle
- Migrating to transaction mode: rollout playbook and PgBouncer 1.21 prepared statementsmiddle
- The Postgres process model and why raising max_connections degrades throughputsenior
- Pooler landscape 2026, serverless connection storms, and the full failure-mode taxonomysenior
- ADD COLUMN: instant in PG 11+ vs rewrite in older Postgresjunior
- The lock-queue failure mode: why instant DDL can freeze the databasemiddle
- Safe DDL patterns: NOT VALID, CONCURRENTLY, and unsafe-op fixesmiddle
- Migration failure taxonomy and production disciplinesenior
- Shard-key selection: hash, range, list, and directory strategiesmiddle
- Co-location and Citus: the invariant that makes sharding usablemiddle
- The hot-shard failure mode: detection, isolation, and durable policymiddle
- Online resharding, 2PC, and the operational cost of shardingsenior
- The seven acts: from CREATE TABLE to Citusjunior
- Acts 1–3 in depth: schema, indexes, and planner statisticsmiddle
- Acts 4–6 in depth: MVCC bloat, connection pooling, and safe migrationsmiddle
- Act 7 in depth: sharding, co-location, and the seven-tier tradeoff cascademiddle
- Observability, anti-patterns, and production triagesenior
- Bits on the wirejunior
- Latency mathmiddle
- Bufferbloat and congestionsenior
- The physical frontiersenior
- Sequence numbers and connection statemiddle
- Flow control and congestion controlmiddle
- BBR, production observability, and beyond TCPsenior
- CDN: putting content next doorjunior
- Anycast and GeoDNS: routing to the nearest edgemiddle
- Tiered cache and Cache-Controlmiddle
- Vary header and cache keysmiddle
- Stale-while-revalidate and cache stampedesenior
- Edge workers and edge-side compositionsenior
- CDN operations and observabilitysenior
- WebSocket: the HTTP upgrade handshakejunior
- WebSocket vs SSE vs long-polling: choosing the right transportmiddle
- WebSocket backpressure: when clients can''''t keep upmiddle
- Reconnection: jittered backoff, thundering herd, message resumptionsenior
- WebSocket at scale: HTTP/2 multiplexing, permessage-deflate, C10Msenior
- WebSocket in production: proxies, security, and distributed architecturesenior
- What reverse proxies dojunior
- Balancing algorithms: round-robin to power-of-two-choicesmiddle
- L4 vs L7 load balancing and client-IP preservationmiddle
- Health checks, connection draining, and slow startmiddle
- Retry storms, circuit breakers, and load sheddingsenior
- Resilient LB architecture: anycast, zone-aware routing, and observabilitysenior
- Why QUIC and not TCP+TLSjunior
- QUIC streams and head-of-line blockingjunior
- Integrated handshake and 1-RTTmiddle
- Connection IDs and network migrationmiddle
- Loss detection and congestion controlmiddle
- 0-RTT resumption and packet encryptionsenior
- Deployment tradeoffs and CPU costsenior
- DDoS: what it is and why it worksjunior
- Amplification attacks and state exhaustionmiddle
- Rate limiting: algorithms and architecturemiddle
- WAFs, firewalls, mTLS, and HSTSmiddle
- DNS cache poisoning and BGP hijackingsenior
- Defense-in-depth architecture and attack economicssenior
- The twelve layers: one URL, seven actorsjunior
- DNS, TCP, TLS in sequence: where the milliseconds gomiddle
- Critical render path and Core Web Vitalsmiddle
- Proxy intercepts and security gates: rate limiters, WAF, mTLSmiddle
- Alternate paths: QUIC 0-RTT, WebSocket upgrade, connection migrationmiddle
- Observability: distributed traces, USE/RED, and samplingsenior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- What the three signals are: logs, metrics, and tracesjunior
- Metrics and cardinality: the cost model of a time-series databasemiddle
- Logs and volume: the cost model of structured loggingmiddle
- Traces and sampling: the cost model of distributed tracingmiddle
- Join keys and exemplars: making the three signals composemiddle
- Observability 2.0: wide events and the cost shiftsenior
- Failure modes and engineering practice: cardinality budgets, PII, and samplingsenior
- Why structured logs exist: the diary vs the spreadsheetjunior
- The production log schema: fields every line must carrymiddle
- Log levels and alert routingmiddle
- Sampling strategies and log costmiddle
- PII redaction and log injectionsenior
- Trace context propagation in logssenior
- OTel Logs Data Model and audit logs as a subsystemsenior
- OTel signals, Semantic Conventions, and the OTLP wire formatmiddle
- Auto-instrumentation and manual spans: the 80/20 of OTelmiddle
- The OTel Collector: receivers, processors, exporters, and deployment patternsmiddle
- Sampling strategies: head, tail, and parent-basedmiddle
- Vendor neutrality, eBPF instrumentation, the Operator, and browser/serverless OTelsenior
- Operating the OTel Collector: reliability, version skew, failure modes, and governancesenior
- RED and USE: two checklists, one triage disciplinejunior
- Instrumenting RED in Prometheus: counters, histograms, and cardinality disciplinemiddle
- USE on Linux: CPU, memory, disk, network, and PSImiddle
- Golden signals, dashboard layout, and service mesh auto-REDmiddle
- Cardinality as a cost driver: labels, PII, exemplars, and samplingmiddle
- Native histograms, SLO tie-in, and production failure patternsmiddle
- Choosing SLIs and SLO targets: ratios, not feelingsmiddle
- Multi-window multi-burn-rate alerting: why AND beats ORmiddle
- Error budget policy, latency SLOs, and composite journeysmiddle
- Iceberg SLIs, composite SLO math, and SLA vs SLOsenior
- Flame graphs: reading the picture that shows where time goesjunior
- Sampling vs instrumentation profiling: why 99 Hz wins in productionmiddle
- Profile types: CPU, memory, off-CPU, mutex — which one to reach formiddle
- Continuous profiling: always-on flame graphs with eBPF and trace-id correlationmiddle
- How flame graphs are built from samples, and the production workflows that use themmiddle
- Linux perf, eBPF internals, PGO, and the limits of samplingsenior
- Profiling in production: security, war stories, OTel profiles, and the infrastructure designsenior
- The debugging funnel: SLO → RED → trace → profilejunior
- OTel architecture: one SDK, four signals, one wire formatmiddle
- Cost discipline: keeping observability under 5% of infra spendmiddle
- Scale, security, and the ROI of observable systemssenior