awesome-everything RU
↑ Back to the climb

Performance

Cache lines, struct layout, and false sharing

Crux A 64-byte cache line is the atom of memory transfer — misaligned structs, over-packed fields, and threads sharing a line all pay the same 100x penalty despite appearing unrelated.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 14 min

A team makes a counter array lock-free with atomic operations. Under load, it is slower than the locked version was. Nobody modified the algorithm. The problem is that eight counters share two cache lines — and every atomic write on any of them invalidates the others.

Cache lines as ownership units

The CPU cache stores data in 64-byte lines (128 bytes on some Apple Silicon and ARM server cores). When one CPU core modifies any byte in a cache line, it must acquire exclusive ownership of that line. The MESI protocol (Modified / Exclusive / Shared / Invalid) coordinates this:

  • Modified: this core owns the line and has written to it; all other cores’ copies are Invalid.
  • Shared: multiple cores have a read-only copy.
  • Invalid: this core’s copy is stale; it must re-fetch before reading.

Every write sends a coherency message to every other core that holds a copy. Those cores mark their copy Invalid. The next read by any other core triggers a re-fetch from L3 or RAM — costing 30–100+ cycles.

Struct size and field packing

How many structs fit on one cache line determines how many loads you need per iteration. Consider a struct with 6 fields totalling 48 bytes:

struct Item { uint64 a; uint64 b; uint64 c; uint32 d; uint32 e; uint32 f; }
// Total: 3×8 + 3×4 = 36 bytes without padding, 40 bytes with typical alignment

With proper alignment, one 64-byte cache line holds one struct (with 16–24 bytes of space). A hot loop that touches only field a still loads all 48 bytes into L1 for every iteration. If the hot-path fields are the first in the struct, the first access per element is the only miss. If the hot-path field is the last, it might straddle two cache lines.

Struct splitting addresses this: separate the hot fields (accessed in the tight loop) from cold fields (accessed rarely) into two parallel arrays.

// Before: AoS with hot+cold mixed
struct Node { uint64 hot_key; uint64 hot_val; char[200] cold_metadata; }
Node nodes[1M];

// After: SoA or struct split
uint64 hot_keys[1M];
uint64 hot_vals[1M];
char cold_metadata[1M][200]; // accessed separately

Hot-field access now loads 8 keys per cache line instead of loading 200+ bytes per element. Working set in L1 shrinks by 12x.

Cache vs big-O: numbers that matter
L1 latency
~1 ns (3–5 cycles)
L2 latency
~3 ns (10–15 cycles)
L3 latency
~10 ns (30–50 cycles)
RAM latency (DDR5)
~70–100 ns
Cache line size
64 B (x86, most ARM)
Array vs linked-list iter
~17x slowdown
Row-major vs column-major
~9x slowdown

False sharing in multithreaded code

False sharing is the most common “invisible” cache performance bug. Two threads write to different variables that happen to share the same 64-byte cache line. From each thread’s perspective, it is working on independent data. From MESI’s perspective, every write by one thread invalidates the other thread’s copy of the entire line.

The result: the line ping-pongs between CPU cores at L3 latency (~30–50 cycles per bounce). IPC collapses to 0.3–0.6. Lock-free code performs worse than locked code.

Example:

// 16 counters, each 8 bytes = 128 bytes total = 2 cache lines
// counters[0..7] share one line; counters[8..15] share another
var counters [16]uint64

// Each goroutine writes to its own counter, but 8 goroutines share a line
// → every write invalidates the other 7's L1 copies
func worker(idx int) {
    atomic.AddUint64(&counters[idx], 1)
}

Fix: pad to a cache line boundary

type paddedCounter struct {
    value uint64
    _     [56]byte // pad to 64 bytes total
}
var counters [16]paddedCounter
// Now each counter occupies its own cache line
// → goroutines on different lines no longer interfere

In Java, @Contended (from jdk.internal.vm.annotation) inserts padding automatically. In Rust, crossbeam::CachePadded wraps values. In C++, alignas(64) on a struct field achieves the same.

Why this works

The Linux kernel scheduler’s per-CPU run queues, Java’s Disruptor ring buffer, and DPDK’s per-core packet queues all carry explicit cache-line padding as a first-class design invariant. Code review for any struct whose fields are written by multiple CPUs simultaneously should flag tight packing as a defect.

ObservationFalse sharing suspectNormal lock contention
CPU profile widthWide (CPU stalled on memory)Narrow in CPU, wide in off-CPU
IPC0.3–0.6 (memory-stalled)Near 0 (thread not running)
Scales with thread countGets worseGets worse
Cache-miss rateVery high (60–80%)Normal
Quiz

A tight loop scans an array of 1M structs, each containing 6 fields totalling 48 bytes. Cache lines are 64 bytes. How many structs fit per cache line?

Quiz

A lock-free counter array shows IPC 0.4 and 72% cache-miss rate as thread count rises. The most likely cause is:

Order the steps

Order the steps to diagnose and fix a false-sharing regression:

  1. 1 Observe: IPC <1, high cache-miss rate, performance worsens with thread count
  2. 2 Run perf stat with cache-misses and XSNP_HITM (Intel) to confirm cache-line bouncing
  3. 3 Identify which struct fields are written by multiple threads simultaneously
  4. 4 Calculate how many fields share one 64-byte line
  5. 5 Pad each independently-written field to its own 64-byte cache line
  6. 6 Re-run perf stat: IPC should rise, cache-miss rate should drop, throughput increase
Recall before you leave
  1. 01
    Explain how cache coherency (MESI) creates false sharing as a multithreaded performance bug, and how to detect and fix it.
  2. 02
    What is struct splitting, and when should you apply it?
Recap

Cache lines are the 64-byte units of hardware ownership. The MESI protocol ensures that any core writing to a cache line invalidates copies in all other cores, costing 30–100 cycles per bounce. False sharing — two threads writing to different variables on the same line — causes lock-free code to perform worse than locked code: IPC collapses and cache-miss rate spikes while threads never actually wait on each other. Detect it via perf c2c or XSNP_HITM hardware counters; fix it by padding per-thread fields to 64-byte boundaries. The same principle applies to struct layout: separate hot fields from cold ones to keep L1’s working set small and dense.

Connected lessons
appears again in167
Continue the climb ↑Branch prediction and branchless code
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.