Performance
False sharing and native-bridge hot paths
The team spent a week making a counter array lock-free with atomic operations. Under load, it’s slower than the locked version was. The flame graph shows updateCounter wide, IPC 0.42. Meanwhile, a Rust crypto library is idle 92% of the time. The Node service calls it 10,000 times per second — and 40% of CPU is in the N-API stub, not the crypto code.
False sharing: when “lock-free” is slower than locked
False sharing happens when multiple threads write to different fields that happen to share the same cache line. The hardware’s MESI coherency protocol treats a cache line as the atomic unit of ownership. When one CPU writes to any byte in a 64-byte line, it acquires exclusive ownership and invalidates the line in every other CPU’s cache. Every other CPU that subsequently reads or writes any byte in that line must re-fetch it through the coherency fabric — at L3 or DRAM latency (~150–300 cycles), not L1 (~5 cycles).
The result: atomic operations that appear non-contending at the code level contend heavily at the hardware level because their data lives on the same cache line.
Signature in profiles
False sharing does not look like lock contention in a standard CPU profile. There is no visible mutex, no blocked thread. Instead:
- IPC collapses (typically 0.3–0.6 on affected code, compared to 2–4 for compute-bound code).
- Cache-miss rate is extreme (60–80%), even though the data is small and “should” be hot.
- The hot function is innocent-looking — an atomic increment, a simple field write.
- Performance degrades as thread count increases, not improves.
Hardware counters that expose it
The hardware event MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM (Intel) counts loads that were satisfied by a modified copy in another CPU’s cache — a direct false-sharing signal. On Linux, perf stat -e cache-references,cache-misses,instructions paired with scaling the thread count exposes it indirectly.
| Observation | False sharing suspect | Lock contention suspect |
|---|---|---|
| CPU profile width | Wide (CPU is stalled on memory) | Narrow in CPU, wide in off-CPU |
| IPC | 0.3–0.6 (memory-stalled) | Near 0 (thread not running) |
| Off-CPU profile | Narrow (not waiting on lock) | Wide (futex wait / monitor wait) |
| Scales with threads | Gets worse (more writers, more bounces) | Gets worse (more waiters) |
| XSNP_HITM counter | Very high | Low |
Fix: cache-line padding
The fix is to ensure each independently-written field occupies its own cache line. On x86, a cache line is 64 bytes; on ARM, 64 or 128 bytes.
// BEFORE: 16 uint64 counters share 2 cache lines (8 per line)
var counters [16]uint64
// AFTER: each counter on its own 64-byte line
type paddedCounter struct {
value uint64
_ [56]byte // pad to 64 bytes
}
var counters [16]paddedCounterIn Java, @Contended (from sun.misc.Contended, or jdk.internal.vm.annotation.Contended) inserts padding automatically. In Rust, crossbeam::CachePadded wraps values. In C++, alignas(64) on struct fields. The Disruptor (Java) and DPDK (C) bake explicit cache-line padding into their core data structures as a non-negotiable invariant.
Diagnose a false-sharing regression from perf counter output
# perf stat -e cache-references,cache-misses,L1-dcache-load-misses,instructions ./service
1,250,000,000 cache-references
950,000,000 cache-misses # 76% miss rate — extreme
1,200,000,000 L1-dcache-load-misses # nearly every L1 access misses
3,000,000,000 instructions
IPC = 0.42 # CPU stalled 58% of the time
# Profile shows hot leaf:
# updateCounter(idx int):
# atomic.AddUint64(&counters[idx], 1) # supposed lock-free fast path
# counters[] is a flat array of 16 uint64 values, accessed by 16 worker
# goroutines (each goroutine increments its own index).
# CPU is 16-core. Each uint64 is 8 bytes; cache line is 64 bytes. A lock-free counter array shows IPC 0.42 (memory-stalled) despite using atomic operations and per-thread indices. Cache-miss rate 76%. What's the diagnosis and the fix?
Why this works
The Linux kernel’s task_struct, Java’s Disruptor ring buffer, and DPDK’s per-core packet queues all carry explicit cache-line alignment annotations. Senior performance engineers add the same discipline to any struct whose fields are written by multiple CPUs simultaneously. Reviewers should flag struct definitions that pack multiple atomically-written fields tightly.
Native-bridge hot paths: the FFI overhead trap
Modern runtimes bridge to native code via FFI: Node’s N-API, Java’s JNI, Python’s ctypes / cffi / Cython, Go’s cgo. Each bridge crossing carries fixed overhead:
- N-API (Node → native addon): ~50–200 ns per call.
- JNI (Java → native): ~100–500 ns per call.
- cgo (Go → C): ~200–500 ns per call (includes goroutine stack switch).
- Python ctypes: ~1–5 μs per call.
When the bridged function is expensive (milliseconds), this overhead is irrelevant. When the bridged function is cheap (nanoseconds), the bridge stub can dominate.
Signature in a cross-language flame graph
A standard single-language profiler shows only its own stack. A cross-language profile (eBPF, Datadog continuous profiler, or a manually stitched perf + async-profiler capture) shows both stacks. The false-sharing signature from the profiler’s perspective:
- The native function itself is narrow (small self-time).
- The bridge stub (
Cgo_runtime_cgocall,JNIEnv::CallStaticVoidMethod,napi_call_function) is wide.
Real-world example
A Node service called a Rust crypto routine via N-API: 10,000 calls per second, each call computing a 32-byte HMAC. The Rust function itself took ~40 ns. The N-API stub added ~160 ns per call — 4x the work. CPU profile: 40% in the stub, 8% in the actual crypto function.
Fix: batch 64 operations per N-API call. The Rust function receives a slice of 64 inputs and returns a slice of 64 outputs. Per-item overhead drops from 200 ns to 43 ns (160 ns stub / 64 items). CPU profile after: 12% crypto function, stub invisible.
| FFI | Per-call overhead | Break-even threshold (native work needed to amortise) |
|---|---|---|
| N-API (Node) | 50–200 ns | ~500 ns native work per call |
| JNI (Java) | 100–500 ns | ~1 μs native work per call |
| cgo (Go) | 200–500 ns | ~2 μs native work per call |
| ctypes (Python) | 1–5 μs | ~10 μs native work per call |
Fix families for native-bridge overhead:
- Batch per crossing — pass a slice of inputs, receive a slice of outputs. Amortise the fixed overhead over N items.
- Push the loop into native — instead of calling native N times, call native once with the loop body inside the native function.
- Raise the boundary — move the FFI boundary to a coarser operation so fewer crossings happen per unit of work.
A lock-free atomic counter array shows IPC 0.4 and 72% cache-miss rate as thread count rises. The correct diagnosis is:
Edge cases where “wider frame = bigger problem” lies
Three situations where the widest leaf is not the right attack target.
1. Sampled-out short hot paths
A function called 500,000 times per second for 200 ns each runs for 100 ms/s total — 10% of a single CPU second. At a standard 100 Hz sampling rate, the profiler fires ~10 samples per second. Expected samples: 1. Actual samples: 0 or 1, depending on alignment.
The frame is narrow in the flame graph, but it is a top consumer. Diagnosis: instrument with cheap counters (atomic increments + a Prometheus histogram), or raise sample rate temporarily to 1000 Hz during a dedicated profiling window.
2. Spin-wait dominating the CPU profile
A CPU profile shows a function wide because the program spin-waited inside it — busy-looping until a condition holds. The thread is on CPU, consuming cycles, but doing no real work. The fix is not to optimise the spin’s body; it is to convert the spin into a proper wait (futex, condition variable, channel).
Signature: function body is a tight branch back to itself; IPC is low despite being CPU-bound in the profile; context-switch rate is low (the thread never yields).
3. Symbol resolution failures
A wide [unknown] frame is not a function — it is a stack the profiler cannot resolve. Common causes: JIT-compiled code without perf maps (Node needs --perf-basic-prof; JVM needs -XX:+PreserveFramePointer), stripped DWARF debug info, missing kernel symbols.
Before treating [unknown] as a target, fix the symbol resolution. The underlying function may be the real hot path, hidden by a diagnostic gap.
Order the steps to diagnose and fix a false-sharing regression:
- 1 Observe: IPC <1, high cache-miss rate, performance worsens with thread count
- 2 Run perf stat with XSNP_HITM (or cache-misses) to confirm cache-line bouncing
- 3 Identify which struct fields are written by multiple threads simultaneously
- 4 Calculate how many fields fit on one 64-byte cache line
- 5 Pad each independently-written field to occupy a full cache line
- 6 Re-run perf stat: IPC should rise, cache-miss rate should drop, throughput should increase
A Node service calls a native Rust function via N-API 10,000 times/s. The Rust function takes 40 ns. The N-API stub takes 160 ns per call. What is the right fix?
- 01Walk through diagnosing false sharing: what does the profile show, which hardware counter confirms it, and what is the fix?
- 02Give two concrete examples of hot paths that appear wide in a flame graph but are NOT the right fix target, and explain why.
False sharing and native-bridge overhead are two senior-level hot-path gotchas invisible to naive profiling. False sharing occurs when threads write to different fields on the same cache line; the MESI protocol serialises the writes at hardware level, collapsing IPC and spiking cache-miss rate despite lock-free code. The fix is cache-line padding. Native-bridge overhead occurs when the FFI stub (N-API, JNI, cgo) costs more than the native function it calls; the fix is batching operations per crossing. Both require hardware counters or cross-language profilers to diagnose. Three edge cases subvert the “widest frame = biggest problem” heuristic: sampled-out short hot paths, spin-wait spinning on CPU, and symbol-resolution gaps showing as [unknown].
appears again in159
- The journey of a request: seven stops from socket to responsejunior
- Accept and parse: from kernel queue to a typed requestmiddle
- Routing and middleware: choosing what runs, and in what ordermiddle
- Handler and response: from business logic to bytes on the wiremiddle
- Streaming and backpressure: when the client reads slower than you writesenior
- Timeouts and tail latency: budgets, deadlines, and the fan-out trapsenior
- Middleware and DI: the two patterns that shape every backendjunior
- Writing middleware: signatures, next(), and the three framework modelsmiddle
- Inversion of control: how dependencies reach a classmiddle
- DI scopes and lifecycles: singleton, request, transientmiddle
- DI as a testing seam: fakes, mocks, and the boundary that matterssenior
- DI containers in production: resolution graphs, circular deps, and when not tosenior
- Blocking vs non-blocking I/O: two ways to waitjunior
- The event loop: one thread, ordered phasesmiddle
- What blocks the loop: CPU work and sync callsmiddle
- Offloading CPU work: worker threads and the libuv poolmiddle
- Backpressure and bounded concurrencysenior
- Throughput under load: tail latency and saturationsenior
- Why pool: the cost of creating a connectionjunior
- Pool sizing: why bigger is not fastermiddle
- Acquisition and timeouts: the wait queue is the real latency dialmiddle
- Retry strategies: backoff, jitter, and thundering herdmiddle
- Observability, production failures, and global-scale designsenior
- Tasks, microtasks, and scheduler.yield()middle
- Timer accuracy, throttling, and idle workmiddle
- Node.js event loop: phases, nextTick, and loop lagsenior
- Rendering strategies: SSG, SSR, ISR, streaming, and hydrationjunior
- SSG, SSR, ISR, streaming, and RSC — how each worksmiddle
- Hydration cost: selective, progressive, islands, resumabilitymiddle
- Core Web Vitals: what LCP, INP, and CLS measurejunior
- LCP: four phases, one dominant costmiddle
- INP: input delay, processing, presentationmiddle
- Lab vs field: why the two disagree and how to use eachmiddle
- Metric tradeoffs, RUM attribution, and the CI+field loopsenior
- The full picture: URL to LCP to INP as a relay racejunior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- What an index is and how it speeds up queriesjunior
- The leading-column rule and composite index designmiddle
- Partial, expression, and covering indexesmiddle
- Index types: GIN, GiST, BRIN, Hash, Bloom, and HOT updatesmiddle
- Index-only scans, the Visibility Map, and INCLUDEsenior
- Production failure modes and the index audit playbooksenior
- Index design exercise: full-text search strategysenior
- EXPLAIN and execution plans: what the planner decides and whyjunior
- Scan types: Seq, Index, Bitmap, Index-Onlymiddle
- Join algorithms and the row-estimate cascademiddle
- pg_statistic, ANALYZE, and production observabilitymiddle
- Extended statistics: fixing correlated-column estimate failuressenior
- Plan cache, cost-constant tuning, and planner internalssenior
- Production failure modes and plan stabilitysenior
- Connection pools: amortising the cost of a Postgres backendjunior
- PgBouncer session, transaction, and statement modesmiddle
- Pool sizing: the (cores × 2) + spindles formula and the two-layer stackmiddle
- Pool exhaustion and idle-in-transaction: the 3 AM failure modemiddle
- Migrating to transaction mode: rollout playbook and PgBouncer 1.21 prepared statementsmiddle
- The Postgres process model and why raising max_connections degrades throughputsenior
- Pooler landscape 2026, serverless connection storms, and the full failure-mode taxonomysenior
- ADD COLUMN: instant in PG 11+ vs rewrite in older Postgresjunior
- The lock-queue failure mode: why instant DDL can freeze the databasemiddle
- Safe DDL patterns: NOT VALID, CONCURRENTLY, and unsafe-op fixesmiddle
- Migration failure taxonomy and production disciplinesenior
- Shard-key selection: hash, range, list, and directory strategiesmiddle
- Co-location and Citus: the invariant that makes sharding usablemiddle
- The hot-shard failure mode: detection, isolation, and durable policymiddle
- Online resharding, 2PC, and the operational cost of shardingsenior
- The seven acts: from CREATE TABLE to Citusjunior
- Acts 1–3 in depth: schema, indexes, and planner statisticsmiddle
- Acts 4–6 in depth: MVCC bloat, connection pooling, and safe migrationsmiddle
- Act 7 in depth: sharding, co-location, and the seven-tier tradeoff cascademiddle
- Observability, anti-patterns, and production triagesenior
- Bits on the wirejunior
- Latency mathmiddle
- Bufferbloat and congestionsenior
- The physical frontiersenior
- Sequence numbers and connection statemiddle
- Flow control and congestion controlmiddle
- BBR, production observability, and beyond TCPsenior
- CDN: putting content next doorjunior
- Anycast and GeoDNS: routing to the nearest edgemiddle
- Tiered cache and Cache-Controlmiddle
- Vary header and cache keysmiddle
- Stale-while-revalidate and cache stampedesenior
- Edge workers and edge-side compositionsenior
- CDN operations and observabilitysenior
- WebSocket: the HTTP upgrade handshakejunior
- WebSocket vs SSE vs long-polling: choosing the right transportmiddle
- WebSocket backpressure: when clients can''''t keep upmiddle
- Reconnection: jittered backoff, thundering herd, message resumptionsenior
- WebSocket at scale: HTTP/2 multiplexing, permessage-deflate, C10Msenior
- WebSocket in production: proxies, security, and distributed architecturesenior
- What reverse proxies dojunior
- Balancing algorithms: round-robin to power-of-two-choicesmiddle
- L4 vs L7 load balancing and client-IP preservationmiddle
- Health checks, connection draining, and slow startmiddle
- Retry storms, circuit breakers, and load sheddingsenior
- Resilient LB architecture: anycast, zone-aware routing, and observabilitysenior
- Why QUIC and not TCP+TLSjunior
- QUIC streams and head-of-line blockingjunior
- Integrated handshake and 1-RTTmiddle
- Connection IDs and network migrationmiddle
- Loss detection and congestion controlmiddle
- 0-RTT resumption and packet encryptionsenior
- Deployment tradeoffs and CPU costsenior
- DDoS: what it is and why it worksjunior
- Amplification attacks and state exhaustionmiddle
- Rate limiting: algorithms and architecturemiddle
- WAFs, firewalls, mTLS, and HSTSmiddle
- DNS cache poisoning and BGP hijackingsenior
- Defense-in-depth architecture and attack economicssenior
- The twelve layers: one URL, seven actorsjunior
- DNS, TCP, TLS in sequence: where the milliseconds gomiddle
- Critical render path and Core Web Vitalsmiddle
- Proxy intercepts and security gates: rate limiters, WAF, mTLSmiddle
- Alternate paths: QUIC 0-RTT, WebSocket upgrade, connection migrationmiddle
- Observability: distributed traces, USE/RED, and samplingsenior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- What the three signals are: logs, metrics, and tracesjunior
- Metrics and cardinality: the cost model of a time-series databasemiddle
- Logs and volume: the cost model of structured loggingmiddle
- Traces and sampling: the cost model of distributed tracingmiddle
- Join keys and exemplars: making the three signals composemiddle
- Observability 2.0: wide events and the cost shiftsenior
- Failure modes and engineering practice: cardinality budgets, PII, and samplingsenior
- Why structured logs exist: the diary vs the spreadsheetjunior
- The production log schema: fields every line must carrymiddle
- Log levels and alert routingmiddle
- Sampling strategies and log costmiddle
- PII redaction and log injectionsenior
- Trace context propagation in logssenior
- OTel Logs Data Model and audit logs as a subsystemsenior
- OTel signals, Semantic Conventions, and the OTLP wire formatmiddle
- Auto-instrumentation and manual spans: the 80/20 of OTelmiddle
- The OTel Collector: receivers, processors, exporters, and deployment patternsmiddle
- Sampling strategies: head, tail, and parent-basedmiddle
- Vendor neutrality, eBPF instrumentation, the Operator, and browser/serverless OTelsenior
- Operating the OTel Collector: reliability, version skew, failure modes, and governancesenior
- RED and USE: two checklists, one triage disciplinejunior
- Instrumenting RED in Prometheus: counters, histograms, and cardinality disciplinemiddle
- USE on Linux: CPU, memory, disk, network, and PSImiddle
- Golden signals, dashboard layout, and service mesh auto-REDmiddle
- Cardinality as a cost driver: labels, PII, exemplars, and samplingmiddle
- Native histograms, SLO tie-in, and production failure patternsmiddle
- Choosing SLIs and SLO targets: ratios, not feelingsmiddle
- Multi-window multi-burn-rate alerting: why AND beats ORmiddle
- Error budget policy, latency SLOs, and composite journeysmiddle
- Iceberg SLIs, composite SLO math, and SLA vs SLOsenior
- Flame graphs: reading the picture that shows where time goesjunior
- Sampling vs instrumentation profiling: why 99 Hz wins in productionmiddle
- Profile types: CPU, memory, off-CPU, mutex — which one to reach formiddle
- Continuous profiling: always-on flame graphs with eBPF and trace-id correlationmiddle
- How flame graphs are built from samples, and the production workflows that use themmiddle
- Linux perf, eBPF internals, PGO, and the limits of samplingsenior
- Profiling in production: security, war stories, OTel profiles, and the infrastructure designsenior
- The debugging funnel: SLO → RED → trace → profilejunior
- OTel architecture: one SDK, four signals, one wire formatmiddle
- Cost discipline: keeping observability under 5% of infra spendmiddle
- Scale, security, and the ROI of observable systemssenior