Performance PERF · 03 · 03

Cache lines, struct layout, and false sharing

A 64-byte cache line is the atom of memory transfer — misaligned structs, over-packed fields, and threads sharing a line all pay the same 100x penalty despite appearing unrelated.

PERF Middle ◷ 14 min

Level

FoundationsJuniorMiddleSenior

A team makes a counter array lock-free with atomic operations. Under load, it is slower than the locked version was. Nobody modified the algorithm. The problem is that eight counters share two cache lines — and every atomic write on any of them invalidates the others.

Cache lines as ownership units

Have you ever made code lock-free and watched it get slower? Understanding cache lines as hardware ownership units — not just transfer units — is the key to diagnosing the most common “I parallelised it and it regressed” failure.

The CPU cache stores data in 64-byte lines (128 bytes on some Apple Silicon and ARM server cores). When one CPU core modifies any byte in a cache line, it must acquire exclusive ownership of that line. The MESI protocol (Modified / Exclusive / Shared / Invalid) coordinates this:

Modified: this core owns the line and has written to it; all other cores’ copies are Invalid.
Shared: multiple cores have a read-only copy.
Invalid: this core’s copy is stale; it must re-fetch before reading.

Every write sends a coherency message to every other core that holds a copy. Those cores mark their copy Invalid. The next read by any other core triggers a re-fetch from L3 or RAM — costing 30–100+ cycles.

Struct size and field packing

How many structs fit on one cache line determines how many loads you need per iteration. Consider a struct with 6 fields totalling 48 bytes:

struct Item { uint64 a; uint64 b; uint64 c; uint32 d; uint32 e; uint32 f; }
// Total: 3×8 + 3×4 = 36 bytes without padding, 40 bytes with typical alignment

With proper alignment, one 64-byte cache line holds one struct (with 16–24 bytes of space). A hot loop that touches only field a still loads all 48 bytes into L1 for every iteration. If the hot-path fields are the first in the struct, the first access per element is the only miss. If the hot-path field is the last, it might straddle two cache lines.

Struct splitting addresses this: separate the hot fields (accessed in the tight loop) from cold fields (accessed rarely) into two parallel arrays.

// Before: AoS with hot+cold mixed
struct Node { uint64 hot_key; uint64 hot_val; char[200] cold_metadata; }
Node nodes[1M];

// After: SoA or struct split
uint64 hot_keys[1M];
uint64 hot_vals[1M];
char cold_metadata[1M][200]; // accessed separately

Hot-field access now loads 8 keys per cache line instead of loading 200+ bytes per element. Working set in L1 shrinks by 12x.

Cache vs big-O: numbers that matter

L1 latency: ~1 ns (3–5 cycles)
L2 latency: ~3 ns (10–15 cycles)
L3 latency: ~10 ns (30–50 cycles)
RAM latency (DDR5): ~70–100 ns
Cache line size: 64 B (x86, most ARM)
Array vs linked-list iter: ~17x slowdown
Row-major vs column-major: ~9x slowdown

The whole reason cache lines matter: a hit in L1 costs ~1 ns, but a miss that reaches RAM costs ~85 ns — roughly 85x more. Layout that keeps the hot working set in L1 is what avoids paying that cliff every access.

False sharing is the most common “invisible” cache performance bug. Two threads write to different variables that happen to share the same 64-byte cache line. From each thread’s perspective, it is working on independent data. From MESI’s perspective, every write by one thread invalidates the other thread’s copy of the entire line.

The result: the line ping-pongs between CPU cores at L3 latency (~30–50 cycles per bounce). IPC collapses to 0.3–0.6. Lock-free code performs worse than locked code.

Example:

// 16 counters, each 8 bytes = 128 bytes total = 2 cache lines
// counters[0..7] share one line; counters[8..15] share another
var counters [16]uint64

// Each goroutine writes to its own counter, but 8 goroutines share a line
// → every write invalidates the other 7's L1 copies
func worker(idx int) {
    atomic.AddUint64(&counters[idx], 1)
}

Fix: pad to a cache line boundary

type paddedCounter struct {
    value uint64
    _     [56]byte // pad to 64 bytes total
}
var counters [16]paddedCounter
// Now each counter occupies its own cache line
// → goroutines on different lines no longer interfere

In Java, @Contended (from jdk.internal.vm.annotation) inserts padding automatically. In Rust, crossbeam::CachePadded wraps values. In C++, alignas(64) on a struct field achieves the same.

▸Why this works

The Linux kernel scheduler’s per-CPU run queues, Java’s Disruptor ring buffer, and DPDK’s per-core packet queues all carry explicit cache-line padding as a first-class design invariant. Code review for any struct whose fields are written by multiple CPUs simultaneously should flag tight packing as a defect.

Observation	False sharing suspect	Normal lock contention
CPU profile width	Wide (CPU stalled on memory)	Narrow in CPU, wide in off-CPU
IPC	0.3–0.6 (memory-stalled)	Near 0 (thread not running)
Scales with thread count	Gets worse	Gets worse
Cache-miss rate	Very high (60–80%)	Normal

Quiz

A tight loop scans an array of 1M structs, each containing 6 fields totalling 48 bytes. Cache lines are 64 bytes. How many structs fit per cache line?

Quiz

A lock-free counter array shows IPC 0.4 and 72% cache-miss rate as thread count rises. The most likely cause is:

Order the steps

Order the steps to diagnose and fix a false-sharing regression:

1 Observe: IPC <1, high cache-miss rate, performance worsens with thread count
2 Run perf stat with cache-misses and XSNP_HITM (Intel) to confirm cache-line bouncing
3 Identify which struct fields are written by multiple threads simultaneously
4 Calculate how many fields share one 64-byte line
5 Pad each independently-written field to its own 64-byte cache line
6 Re-run perf stat: IPC should rise, cache-miss rate should drop, throughput increase

counter[0] and counter[1] sit on the same 64-byte line. Each write makes the other core's whole-line copy Invalid — a MESI bounce at L3 latency, ~30–50 cycles per ping.

Recall before you leave

01
Explain how cache coherency (MESI) creates false sharing as a multithreaded performance bug, and how to detect and fix it.
02
What is struct splitting, and when should you apply it?

Recap

Cache lines are the 64-byte units of hardware ownership. The MESI protocol ensures that any core writing to a cache line invalidates copies in all other cores, costing 30–100 cycles per bounce. False sharing — two threads writing to different variables on the same line — causes lock-free code to perform worse than locked code: IPC collapses and cache-miss rate spikes while threads never actually wait on each other. Detect it via perf c2c or XSNP_HITM hardware counters; fix it by padding per-thread fields to 64-byte boundaries. The same principle applies to struct layout: separate hot fields from cold ones to keep L1’s working set small and dense. Now when you see a parallel workload that slows down as you add threads — with high CPU utilisation but low throughput — reach for perf c2c before suspecting a lock.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 7 done

Connected lessons

builds on

Memory hierarchy: why the same O(N) loop can be 17x slowerjunior

unlocks

deepens into

appears again in193

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.