awesome-everything RU
↑ Back to the climb

Performance

JIT deopt, the fix-and-verify loop, and PR-time profiling

Crux JIT deopt loops silently multiply hot-path cost 10–100x. The fix-and-verify loop is the discipline that proves a fix landed. PR-time profiling catches regressions before production.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 18 min

A Node service has a wide leaf that flamegraphs show as the V8 interpreter (InterpreterCallStub), not TurboFan. The function is hot. The JIT is not optimising it. Every call pays interpreter overhead. Switching to a faster algorithm does nothing until the deopt is fixed. Understanding why the JIT bailed is the diagnosis.

JIT deoptimisation: a sixth shape

JIT runtimes (V8, JVM HotSpot, .NET, PyPy) compile hot code to native machine code under typed assumptions. If assumptions break — a function receives an unexpected type, a hidden class transitions, a megamorphic call site fans out — the JIT bails to the interpreter or a slower compilation tier.

Signature in the flame graph: the function shows wide, but the wide frame is the interpreter (Interpreter::execute, InterpreterCallStub) or a baseline JIT frame (V8 Sparkplug) instead of the optimised compiler’s frame (V8 TurboFan, HotSpot C2).

Cost: a single deopt is microseconds. A deopt loop (deopt → recompile → deopt) can multiply per-call cost 10–100x silently. Latency spikes that don’t correlate with traffic, periodic pauses without GC running, and baseline-tier frames intermittently dominating the flame graph are all deopt-loop symptoms.

Fix: stabilise types.

  • V8: keep hot object shapes to ≤4 hidden classes; no late property addition in JS inside hot loops.
  • HotSpot: monitor -XX:+PrintCompilation for repeated deopts; avoid boxing in hot code.
  • PyPy: watch jit-summary for guard failures; write type-stable loops.

Verification: re-profile and check that the optimised compiler’s frame (TurboFan, C2) is back in the hot stack.

RuntimeDeopt signal in profileDiagnosis tool
V8 (Node.js)Sparkplug / Interpreter frames instead of TurboFan—trace-deopt
JVM HotSpotC1 compiled frames instead of C2-XX:+PrintCompilation -XX:+TraceDeoptimization
.NET RyuJITInterpreter / tier-0 framesPerfView with Tiered JIT counters
PyPyInterpreter frames; jit-summary guard failures—jit-summary

The fix-and-verify loop

Every performance fix has five required steps:

  1. Name the hotspot and classify it (one of the six shapes including JIT deopt).
  2. Pick the categorical fix family that matches the classification.
  3. Write the fix with no scope creep — only the change predicted in step 2.
  4. Capture a profile under the same load and diff against the baseline.
  5. Verify both: the local frame shrank AND the headline metric improved (p99, throughput, CPU%, whatever the SLO names).

If the frame shrank but the metric did not move: look at where the time went instead — often a second hotspot is now visible that was masked by the first. This is not failure; it is the next iteration.

If the metric moved but the frame did not shrink: the fix worked through a side effect you did not predict. Investigate; you may have hit something orthogonal. Both outcomes require evidence and drive the next move.

The loop is the senior performance habit: fix one thing, prove it landed, find the next.

Microbenchmark-driven vs production-profile-driven fixes

A microbenchmark in isolation may say a new algorithm is 5x faster. The production profile may show that algorithm is now 8% of total time instead of 15%, but other paths got slower because the new algorithm allocates more and pushed GC pressure up.

The fix-and-verify loop catches this: capturing the production profile after the change tells you the whole-system effect, not just the local one. Microbenchmark claims are predictions; production profile diffs are the verdict.

Production-grade teams require both: a microbenchmark that shows the local change does what is claimed, AND a production profile diff that shows the system-wide effect is positive. PRs with only one or the other ship regressions that look like wins.

PR-time vs incident-time profiling

Two modes of applying hot-path methodology:

Incident-time: the service is on fire, on-call catches the hotspot in minutes, fixes, verifies, ships. Reactive mode — same methodology, clock ticking.

PR-time: before release, CI captures the PR’s profile against the main branch baseline and flags regressions before they reach production. Proactive mode — same methodology, no pressure.

Senior teams invest in both: incident-time runbooks for on-call, PR-time CI gates for prevention. Every incident retro adds one rule to the PR-time gate: if the exact regression could have been caught in CI, encode the signature. Over time the PR-time gate catches most regressions before release; incident-time runbooks handle the rest.

Why this works

Cross-pollination between incident-time and PR-time is the mechanism that makes performance discipline self-compounding. Each incident retro that encodes a CI rule reduces future on-call load by one class of regression. The mature signature: perf incidents per quarter trending down, not flat. Teams that do not cross-pollinate stay on the “heroic on-call” stage indefinitely.

Order the steps

Order the five steps of the fix-and-verify loop:

  1. 1 Name the hotspot and classify it (CPU, alloc, cache, lock, syscall, or JIT deopt)
  2. 2 Pick the categorical fix family matching the classification
  3. 3 Write only the predicted change — no scope creep
  4. 4 Capture a new profile under the same load and diff against baseline
  5. 5 Verify: local frame shrank AND headline metric improved — both required
Quiz

A Node flame graph shows InterpreterCallStub frames dominating a function that should be hot. What is the most likely cause and fix?

Quiz

A microbenchmark shows a new algorithm is 5x faster locally. The production profile diff shows the function dropped from 15% to 8% CPU, but total CPU% is unchanged and p99 is worse. What is the most likely explanation?

Recall before you leave
  1. 01
    What are the tell-tale signs of a JIT deopt loop in a flame graph, and what is the fix for V8 specifically?
  2. 02
    Why must the fix-and-verify loop check BOTH the local frame and the headline metric, and what does each failure mode mean?
Recap

JIT deoptimisation is a sixth hotspot shape: the flamegraph shows interpreter or baseline-JIT frames where an optimised compiler’s output should appear. The fix is type stabilisation, not algorithmic rewrite. The fix-and-verify loop applies to all six shapes: classify, write one targeted change, capture a diff profile under the same load, verify both local shrinkage and headline improvement. Microbenchmarks are predictions; production diffs are verdicts. PR-time CI gates that encode lessons from incident retros turn reactive performance work into proactive prevention.

Connected lessons
appears again in159
Continue the climb ↑Hardware counters and Intel TMA: sub-category diagnosis
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.