awesome-everything RU
↑ Back to the climb

Performance

SIMD, SoA vs AoS, and memory bandwidth

Crux Array-of-Structures (AoS) prevents SIMD; Structure-of-Arrays (SoA) enables it. Memory bandwidth and NUMA are the two constraints that profiling cache-miss rate misses.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 15 min

A hot ML inference loop is 5x slower than the reference C++ implementation, despite identical algorithm and the same number of multiplications. The profiler shows IPC 0.8 and 25% L3 miss rate. Switching to SIMD intrinsics is suggested — but the fix that actually works is changing how the data is stored, not how it is computed.

SIMD: one instruction, multiple values

Modern CPUs carry wide vector registers: 256-bit (AVX2, standard on x86 since 2013), 512-bit (AVX-512, available on server Intel and recent Ryzen), 128-bit NEON on ARM. One SIMD instruction can add, multiply, or compare 4–8 floats simultaneously.

The critical requirement: the values must be consecutive in memory. A single load instruction fills the vector register from a contiguous block; the CPU then operates on all 4–8 values in parallel.

Array of Structures vs Structure of Arrays

AoS (Array of Structures):

struct Point { float x; float y; float z; };
Point points[N];  // layout: x0 y0 z0 x1 y1 z1 x2 y2 z2 ...

To add all x values with SIMD, you need points[0].x, points[1].x, points[2].x, points[3].x — but they are at offsets 0, 12, 24, 36 bytes. The SIMD load at address 0 gives x0 y0 z0 x1 — mixed types. You must use scatter/gather operations to pull out just the x values, which costs ~10x a contiguous load.

SoA (Structure of Arrays):

float xs[N];  // layout: x0 x1 x2 x3 x4 x5 x6 x7 ...
float ys[N];
float zs[N];

To add all x values with SIMD: load 8 consecutive floats from xs[0], done. One instruction, 8 results. The loop body becomes 8x more productive per instruction.

LayoutSIMD compatibilityCache efficiencyBest use case
AoSRequires gather/scatter (slow)Good when all fields usedSingle-element operations, OOP
SoANative contiguous loadExcellent when one field usedBatch processing, ML, game physics

Game engines (Unity ECS, Bevy), ML inference engines, audio processing, and database column stores all use SoA. V8 (Chrome’s JavaScript engine) uses SoA-like TypedArrays for hot loops. The previous lesson’s ML example: changing from AoS (x,y,z,w) structs to xs[], ys[], zs[], ws[] drops L3 miss rate from 25% to 5% and raises IPC from 0.8 to 3.0.

Auto-vectorisation

Compilers automatically convert simple loops to SIMD when the data layout allows. Conditions for auto-vectorisation success:

  • No pointer aliasing (two pointers don’t point to overlapping memory — use restrict in C, Rust handles this via borrow checker).
  • Contiguous (not strided or gather/scatter).
  • Predictable trip count.
  • No data-dependent inner branches.

Check what the compiler emitted: -fopt-info-vec (GCC) or -Rpass=loop-vectorize (Clang) prints when loops are vectorised and why they failed. For hot loops that fail auto-vec, manual SIMD intrinsics or libraries (Intel Highway, simd-everywhere) close the gap.

Memory bandwidth: the other constraint

Cache hit rate is one axis of performance; memory bandwidth is another. A workload that streams through 100 GB of data will hit the RAM bandwidth ceiling (~50–100 GB/s on DDR5) regardless of cache behaviour. Bandwidth-bound code is fixed by:

  • Reducing data volume: more compact types (float32 instead of float64 when precision allows), on-the-fly computation instead of materialised intermediate tables.
  • Compression at rest, decompression on-the-fly.
  • Non-temporal stores (covered in the senior lesson): bypass cache for write-once data.

perf stat -e cache-misses,mem-loads-retired.l3-miss separates cache-miss-bound from bandwidth-bound: if L3 misses are high but total data volume is huge, you are bandwidth-bound; if L3 misses are high but data volume is small (fits in L3 but pattern is random), you are cache-miss-bound.

NUMA: multi-socket memory access

Servers with 2+ CPU sockets are Non-Uniform Memory Access. Each socket has local RAM (~70 ns) and remote RAM (~120–150 ns). A thread that allocates memory on socket 0 but runs on socket 1 pays the remote-access tax on every load. This is a 1.7x latency penalty that perf stat will misreport as “L3 misses” because the access goes through the interconnect.

Mitigations:

  • Pin threads to sockets (taskset, hwloc).
  • Allocate on the local NUMA node (numactl --membind, jemalloc NUMA-aware mode).
  • Distribute work so each thread touches data from its local socket.
SIMD and data layout numbers
AVX2 float throughput
8 floats per instruction
AVX-512 float throughput
16 floats per instruction
NEON (ARM) throughput
4 floats per instruction
Gather/scatter vs contiguous load
~10x slower
DDR5-6000 bandwidth
~50 GB/s per channel
NUMA local vs remote latency
70 ns vs 120–150 ns
ML loop AoS→SoA IPC gain
0.8 → 3.0 (example)
Why this works

Inlining affects the instruction cache (I-cache), not data cache. Aggressive inlining of a function called from 100 sites adds 100 copies of its body to the binary. If those copies pollute I-cache they evict other hot functions. The fix: inline tiny functions (1–3 lines) freely; inline larger functions only on the hottest call sites. PGO makes these decisions based on real call frequencies. Monitoring L1-icache-load-misses in perf stat catches instruction-cache pressure after layout changes.

Quiz

A hot ML inference loop is 5x slower than a reference C++ port despite identical algorithm. perf stat shows IPC 0.8 and 25% L3 miss rate. What is the primary diagnosis?

Quiz

A backend service runs on a 2-socket NUMA server. A thread allocates a large working set on socket 0, then migrates to socket 1 under load. What perf signature does this produce?

Recall before you leave
  1. 01
    Explain the AoS vs SoA trade-off for a hot loop that accesses only one field of a multi-field struct.
  2. 02
    A perf stat run shows high L3 miss rate, but the working set is only 50 MB (well within L3 capacity). What alternative explanation should you investigate, and how?
Recap

SIMD instructions operate on 4–16 consecutive values per cycle; AoS layout interleaves field types forcing 10x-slower gather operations, while SoA stores each field in a flat array enabling native SIMD loads. The ML benchmark’s 5x gap came entirely from changing AoS to SoA — no algorithm change, just layout. Beyond cache locality, two separate constraints limit throughput: memory bandwidth (how many GB/s you can stream from RAM) and NUMA topology (remote-socket memory costs 1.7x more than local). perf stat exposes both via L3-miss rate and NUMA counters. Fix layout to SoA first, verify the loop becomes compute-bound, then apply SIMD intrinsics or rely on auto-vectorisation.

Connected lessons
appears again in167
Continue the climb ↑Hardware prefetcher, TLB, and memory-level parallelism
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.