awesome-everything RU
↑ Back to the climb

Performance

Hardware prefetcher, TLB, and memory-level parallelism

Crux The hardware prefetcher is a lubricant — it helps predictable patterns but cannot infer pointer-chasing. TLB misses add 5–50 cycles before any cache check. MLP lets the CPU stack multiple outstanding misses.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 15 min

A graph traversal loop shows high L3 miss rate even though the graph fits in L2 cache. The prefetcher is silent. Separately, a database scan that streams through 50 GB is bandwidth-bound despite cache being clean and hits being high. Two different mechanisms — same symptom.

Hardware prefetcher: what it can and cannot do

Modern CPUs have multiple hardware prefetchers per core:

  • Sequential prefetcher: detects forward-sequential access and prefetches up to 16 lines ahead. Activates after 2–3 consecutive misses in the same direction.
  • Stride detector: detects “every N-th byte” patterns (e.g. column access at a fixed stride). Works up to strides of ~256 bytes.
  • Stream prefetcher (some chips): detects multiple independent sequential streams simultaneously.

The prefetcher cannot infer the next address when it depends on the current data value — i.e. pointer chasing. A graph traversal where next = node->next_ptr requires loading the pointer value before knowing where to prefetch next. The prefetcher watches address patterns, not data values.

What this means in practice:

  • Sequential array iterations: prefetcher fills L1 continuously, essentially free.
  • Regular strides (every 32 bytes, every 128 bytes): stride detector fires, most misses hidden.
  • Pointer-chasing (linked lists, trees, graphs): every dereference is a cold miss unless data happens to fit entirely in L2/L3.
  • Graph traversal on a graph that fits in L2: still misses because access order is data-dependent, not because data is absent.

Software prefetch is the manual escape hatch. Insert __builtin_prefetch(addr, 0, 1) (GCC/Clang) or _mm_prefetch a few iterations ahead of when the data is needed. Useful when you know the next address at the code level but the hardware cannot infer it. Used in database hash-join implementations, some graph processing libraries.

TLB: the address-translation cache

Before any L1/L2/L3 check, the CPU must translate the virtual address to a physical address. The Translation Lookaside Buffer (TLB) caches recent virtual-to-physical translations:

  • TLB hit: 0–1 cycle overhead.
  • TLB miss: the CPU must walk the page table — 5–50 cycles (sometimes more with deep table nesting).

Default page size: 4 KB. A workload touching 1 GB of data touches 262 144 different pages. The TLB holds 64–1024 entries (depending on level and CPU). Most entries evict frequently — TLB thrashing adds 5–50 cycles per access on top of the L3/RAM latency.

Huge pages (2 MB on x86, optional 1 GB):

  • 2 MB pages: 512x fewer TLB entries needed to cover the same address space (2 MB / 4 KB = 512).
  • 1 GB pages: 262 144x reduction.
  • For code touching hundreds of MB (databases, ML inference, image processing), huge pages can provide 10–20% throughput improvement independently of cache locality.

Enabling: Linux madvise(MADV_HUGEPAGE), hugetlbfs, or LD_PRELOAD with jemalloc/mimalloc configured for transparent huge pages.

Access pathTLB stateTotal latency
L1 hitTLB hit~4 cycles total
L1 hitTLB miss (page walk)~50 cycles total
RAM missTLB hit~300 cycles total
RAM missTLB miss (page walk)~350 cycles total

Memory-level parallelism (MLP)

Modern CPUs can have 8–16 cache misses outstanding simultaneously. When code issues independent loads, the CPU queues them in its load buffer and executes them in parallel — 10 independent misses cost ~150 cycles (one miss latency + dispatch overhead), not 10 × 300 cycles.

Dependent-load chains defeat MLP entirely:

// Pointer chasing: each load depends on the previous result
x = arr[0];         // must wait for arr[0]
y = arr[x];         // must wait for x → cannot start until arr[0] completes
z = arr[y];         // must wait for y → cannot start until arr[x] completes
// Three misses in serial: 3 × 300 ns = 900 ns

Independent loads benefit from MLP:

// Multiple independent loads: CPU can issue all in parallel
a = arr[0]; b = arr[1000]; c = arr[2000]; d = arr[3000];
// Four misses in parallel: ~300 ns total (+ small overhead)

Optimising for MLP: transform dependent pointer-chains into index-based arrays where future indices can be prefetched while current data is processed.

Non-temporal stores

Streaming through gigabytes of data (DB scans, log processing, memcpy) produces data that is read once and never reused. Writing this data through the cache wastes L3 capacity evicting hot data. Non-temporal store instructions bypass the cache and write directly to memory:

// Standard store: goes through L1 → L2 → L3 → RAM
store[i] = value;

// Non-temporal store: bypasses cache, writes directly to RAM
// Frees L3 for hot data; improves sustained bandwidth 20–30%
_mm_stream_si64((long long*)&store[i], value);  // x86 AVX
__builtin_nontemporal_store(value, &store[i]);   // GCC/Clang

Use when: write-once-then-discard pattern is confirmed by profiling (perf stat mem-loads-retired.l3-miss shows writes dominate).

Cache associativity and conflict misses

L1 and L2 caches are not fully associative — each address maps to a specific “set” (typically 8–16 ways associative). If many hot addresses hash to the same set, they evict each other even though cache capacity is available elsewhere.

Classic case: two large arrays accessed at identical offsets that are multiples of the cache size (e.g. two arrays of size 1 MB accessed at stride 1 MB). On a 512 KB L2, offsets 0 and 512 KB map to the same set — they evict each other.

Detection: cachegrind reports D1mr/D1mw and DLmr/DLmw per source line. Conflict misses appear as high miss rate on small data that “should” fit in cache. Fix: pad arrays by a cache-line amount to break the alignment pattern.

Senior-tier cache and CPU numbers
TLB miss page walk
5–50 cycles
Huge page (x86)
2 MB / 1 GB
L1 cache associativity
typically 8-way
MLP outstanding misses
8–16 typical
Non-temporal store bandwidth gain
20–30% streaming
Apple M3 pipeline depth
~10 stages
DDR5-6000 bandwidth
~50 GB/s per channel
Quiz

Why does the hardware prefetcher fail on graph traversals even when the graph fits entirely in L2 cache?

Quiz

A database scan streams through 80 GB of data. Increasing L3 cache from 32 MB to 64 MB provides no speedup. Why, and what would actually help?

Recall before you leave
  1. 01
    A perf-critical graph algorithm shows high L3 miss rate even though the graph fits in L2 cache. What is happening and what are the fixes?
  2. 02
    Explain Memory-Level Parallelism (MLP) and give a concrete code example of how dependent pointer chains defeat it.
Recap

The hardware prefetcher accelerates sequential and strided access — it cannot infer the next address when it depends on data values (pointer-chasing in linked structures, graphs). Beyond L1/L2/L3, every access also pays a TLB translation cost; a miss adds 5–50 cycles before any cache level is checked. Huge pages (2 MB on x86) cut TLB pressure 512x for large working sets. Memory-level parallelism lets the CPU queue 8–16 independent misses simultaneously — dependent pointer-chains serialize all misses into sequential round-trips. Non-temporal stores bypass cache for write-once streaming workloads, protecting L3 for hot data. Cache associativity conflict misses occur when hot addresses hash to the same cache set; fix by padding arrays by a cache line to break alignment patterns.

Connected lessons
appears again in159
Continue the climb ↑Cache-oblivious algorithms, PGO, and production failures
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources4
expand
  1. 01
  2. 02
  3. 03
  4. 04

Trademarks belong to their respective owners. Editorial reference only.