Performance PERF · 02 · 06

False sharing and native-bridge hot paths

Two classes of hot path that defeat naive fixes: false sharing (lock-free code that performs worse than locked) and native-bridge overhead (stub wider than the native function it calls). Both need hardware counters or cross-language profilers to see.

PERF Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

The team spent a week making a counter array lock-free with atomic operations. Under load, it’s slower than the locked version was. The flame graph shows updateCounter wide, IPC 0.42. Meanwhile, a Rust crypto library is idle 92% of the time. The Node service calls it 10,000 times per second — and 40% of CPU is in the N-API stub, not the crypto code.

False sharing happens when multiple threads write to different fields that happen to share the same cache line. The hardware’s MESI coherency protocol treats a cache line as the atomic unit of ownership. When one CPU writes to any byte in a 64-byte line, it acquires exclusive ownership and invalidates the line in every other CPU’s cache. Every other CPU that subsequently reads or writes any byte in that line must re-fetch it through the coherency fabric — at L3 or DRAM latency (~150–300 cycles), not L1 (~5 cycles).

The result: atomic operations that appear non-contending at the code level contend heavily at the hardware level because their data lives on the same cache line.

Signature in profiles

False sharing does not look like lock contention in a standard CPU profile. There is no visible mutex, no blocked thread. Instead:

IPC collapses (typically 0.3–0.6 on affected code, compared to 2–4 for compute-bound code).
Cache-miss rate is extreme (60–80%), even though the data is small and “should” be hot.
The hot function is innocent-looking — an atomic increment, a simple field write.
Performance degrades as thread count increases, not improves.

Hardware counters that expose it

The hardware event MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM (Intel) counts loads that were satisfied by a modified copy in another CPU’s cache — a direct false-sharing signal. On Linux, perf stat -e cache-references,cache-misses,instructions paired with scaling the thread count exposes it indirectly.

Observation	False sharing suspect	Lock contention suspect
CPU profile width	Wide (CPU is stalled on memory)	Narrow in CPU, wide in off-CPU
IPC	0.3–0.6 (memory-stalled)	Near 0 (thread not running)
Off-CPU profile	Narrow (not waiting on lock)	Wide (futex wait / monitor wait)
Scales with threads	Gets worse (more writers, more bounces)	Gets worse (more waiters)
XSNP_HITM counter	Very high	Low

Fix: cache-line padding

The fix is to ensure each independently-written field occupies its own cache line. On x86, a cache line is 64 bytes; on ARM, 64 or 128 bytes.

// BEFORE: 16 uint64 counters share 2 cache lines (8 per line)
var counters [16]uint64

// AFTER: each counter on its own 64-byte line
type paddedCounter struct {
    value uint64
    _     [56]byte // pad to 64 bytes
}
var counters [16]paddedCounter

In Java, @Contended (from sun.misc.Contended, or jdk.internal.vm.annotation.Contended) inserts padding automatically. In Rust, crossbeam::CachePadded wraps values. In C++, alignas(64) on struct fields. The Disruptor (Java) and DPDK (C) bake explicit cache-line padding into their core data structures as a non-negotiable invariant.

Debug this

Diagnose a false-sharing regression from perf counter output

log

# perf stat -e cache-references,cache-misses,L1-dcache-load-misses,instructions ./service

 1,250,000,000      cache-references
   950,000,000      cache-misses           # 76% miss rate — extreme
 1,200,000,000      L1-dcache-load-misses  # nearly every L1 access misses
 3,000,000,000      instructions
                    IPC = 0.42             # CPU stalled 58% of the time

# Profile shows hot leaf:
#   updateCounter(idx int):
#     atomic.AddUint64(&counters[idx], 1)   # supposed lock-free fast path

# counters[] is a flat array of 16 uint64 values, accessed by 16 worker
# goroutines (each goroutine increments its own index).
# CPU is 16-core. Each uint64 is 8 bytes; cache line is 64 bytes.

A lock-free counter array shows IPC 0.42 (memory-stalled) despite using atomic operations and per-thread indices. Cache-miss rate 76%. What's the diagnosis and the fix?

▸Why this works

The Linux kernel’s task_struct, Java’s Disruptor ring buffer, and DPDK’s per-core packet queues all carry explicit cache-line alignment annotations. Senior performance engineers add the same discipline to any struct whose fields are written by multiple CPUs simultaneously. Reviewers should flag struct definitions that pack multiple atomically-written fields tightly.

Native-bridge hot paths: the FFI overhead trap

Ask yourself: if the native function is only 40 ns of work, what happens when the crossing to reach it costs 160 ns? The stub becomes the bottleneck, and a standard single-language profiler will never show it.

Modern runtimes bridge to native code via FFI: Node’s N-API, Java’s JNI, Python’s ctypes / cffi / Cython, Go’s cgo. Each bridge crossing carries fixed overhead:

N-API (Node → native addon): ~50–200 ns per call.
JNI (Java → native): ~100–500 ns per call.
cgo (Go → C): ~200–500 ns per call (includes goroutine stack switch).
Python ctypes: ~1–5 μs per call.

When the bridged function is expensive (milliseconds), this overhead is irrelevant. When the bridged function is cheap (nanoseconds), the bridge stub can dominate.

Signature in a cross-language flame graph

A standard single-language profiler shows only its own stack. A cross-language profile (eBPF, Datadog continuous profiler, or a manually stitched perf + async-profiler capture) shows both stacks. The false-sharing signature from the profiler’s perspective:

The native function itself is narrow (small self-time).
The bridge stub (Cgo_runtime_cgocall, JNIEnv::CallStaticVoidMethod, napi_call_function) is wide.

Real-world example

A Node service called a Rust crypto routine via N-API: 10,000 calls per second, each call computing a 32-byte HMAC. The Rust function itself took ~40 ns. The N-API stub added ~160 ns per call — 4x the work. CPU profile: 40% in the stub, 8% in the actual crypto function.

Fix: batch 64 operations per N-API call. The Rust function receives a slice of 64 inputs and returns a slice of 64 outputs. Per-item overhead drops from 200 ns to 43 ns (160 ns stub / 64 items). CPU profile after: 12% crypto function, stub invisible.

Batching 64 ops per crossing amortises the 160 ns N-API stub to 2.5 ns per item, so the 40 ns native work dominates again — the ~4.7x per-item drop the bridge overhead was hiding.

FFI	Per-call overhead	Break-even threshold (native work needed to amortise)
N-API (Node)	50–200 ns	~500 ns native work per call
JNI (Java)	100–500 ns	~1 μs native work per call
cgo (Go)	200–500 ns	~2 μs native work per call
ctypes (Python)	1–5 μs	~10 μs native work per call

Fix families for native-bridge overhead:

Batch per crossing — pass a slice of inputs, receive a slice of outputs. Amortise the fixed overhead over N items.
Push the loop into native — instead of calling native N times, call native once with the loop body inside the native function.
Raise the boundary — move the FFI boundary to a coarser operation so fewer crossings happen per unit of work.

Quiz

A lock-free atomic counter array shows IPC 0.4 and 72% cache-miss rate as thread count rises. The correct diagnosis is:

Edge cases where “wider frame = bigger problem” lies

Three situations where the widest leaf is not the right attack target.

1. Sampled-out short hot paths

A function called 500,000 times per second for 200 ns each runs for 100 ms/s total — 10% of a single CPU second. At a standard 100 Hz sampling rate, the profiler fires ~10 samples per second. Expected samples: 1. Actual samples: 0 or 1, depending on alignment.

The frame is narrow in the flame graph, but it is a top consumer. Diagnosis: instrument with cheap counters (atomic increments + a Prometheus histogram), or raise sample rate temporarily to 1000 Hz during a dedicated profiling window.

2. Spin-wait dominating the CPU profile

A CPU profile shows a function wide because the program spin-waited inside it — busy-looping until a condition holds. The thread is on CPU, consuming cycles, but doing no real work. The fix is not to optimise the spin’s body; it is to convert the spin into a proper wait (futex, condition variable, channel).

Signature: function body is a tight branch back to itself; IPC is low despite being CPU-bound in the profile; context-switch rate is low (the thread never yields).

3. Symbol resolution failures

A wide [unknown] frame is not a function — it is a stack the profiler cannot resolve. Common causes: JIT-compiled code without perf maps (Node needs --perf-basic-prof; JVM needs -XX:+PreserveFramePointer), stripped DWARF debug info, missing kernel symbols.

Before treating [unknown] as a target, fix the symbol resolution. The underlying function may be the real hot path, hidden by a diagnostic gap.

Order the steps

Order the steps to diagnose and fix a false-sharing regression:

1 Observe: IPC <1, high cache-miss rate, performance worsens with thread count
2 Run perf stat with XSNP_HITM (or cache-misses) to confirm cache-line bouncing
3 Identify which struct fields are written by multiple threads simultaneously
4 Calculate how many fields fit on one 64-byte cache line
5 Pad each independently-written field to occupy a full cache line
6 Re-run perf stat: IPC should rise, cache-miss rate should drop, throughput should increase

Quiz

A Node service calls a native Rust function via N-API 10,000 times/s. The Rust function takes 40 ns. The N-API stub takes 160 ns per call. What is the right fix?

Core 0 writes var A, Core 1 writes var B — different variables, but the same cache line. Each write invalidates the line in the other core's cache (MESI), so every write costs an L3/DRAM re-fetch (~150–300 cycles) instead of L1 (~5).

Recall before you leave

01
Walk through diagnosing false sharing: what does the profile show, which hardware counter confirms it, and what is the fix?
02
Give two concrete examples of hot paths that appear wide in a flame graph but are NOT the right fix target, and explain why.

Recap

False sharing and native-bridge overhead are two senior-level hot-path gotchas invisible to naive profiling. False sharing occurs when threads write to different fields on the same cache line; the MESI protocol serialises the writes at hardware level, collapsing IPC and spiking cache-miss rate despite lock-free code. The fix is cache-line padding. Native-bridge overhead occurs when the FFI stub (N-API, JNI, cgo) costs more than the native function it calls; the fix is batching operations per crossing. Both require hardware counters or cross-language profilers to diagnose. Three edge cases subvert the “widest frame = biggest problem” heuristic: sampled-out short hot paths, spin-wait spinning on CPU, and symbol-resolution gaps showing as [unknown]. Now when you see lock-free code underperforming a locked version, you will check IPC and cache-miss rate before blaming atomics — the hardware may be telling a very different story.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Hardware counters and Intel TMA: sub-category diagnosissenior

appears again in162

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.