Performance PERF · 03 · 05

SIMD, SoA vs AoS, and memory bandwidth

Array-of-Structures (AoS) prevents SIMD; Structure-of-Arrays (SoA) enables it. Memory bandwidth and NUMA are the two constraints that profiling cache-miss rate misses.

PERF Middle ◷ 15 min

Level

FoundationsJuniorMiddleSenior

A hot ML inference loop is 5x slower than the reference C++ implementation, despite identical algorithm and the same number of multiplications. The profiler shows IPC 0.8 and 25% L3 miss rate. Switching to SIMD intrinsics is suggested — but the fix that actually works is changing how the data is stored, not how it is computed.

SIMD: one instruction, multiple values

If you have ever wondered why the same arithmetic loop runs 4–8x faster in one codebase than another without any apparent algorithmic difference, data layout is almost always the answer.

Modern CPUs carry wide vector registers: 256-bit (AVX2, standard on x86 since 2013), 512-bit (AVX-512, available on server Intel and recent Ryzen), 128-bit NEON on ARM. One SIMD instruction can add, multiply, or compare 4–8 floats simultaneously.

This throughput ladder is the prize a contiguous (SoA) layout unlocks: up to 8–16 float32 per instruction. AoS forfeits it to ~10x-slower gather, so layout decides whether you operate at 1x or 16x.

The critical requirement: the values must be consecutive in memory. A single load instruction fills the vector register from a contiguous block; the CPU then operates on all 4–8 values in parallel.

Array of Structures vs Structure of Arrays

AoS (Array of Structures):

struct Point { float x; float y; float z; };
Point points[N];  // layout: x0 y0 z0 x1 y1 z1 x2 y2 z2 ...

To add all x values with SIMD, you need points[0].x, points[1].x, points[2].x, points[3].x — but they are at offsets 0, 12, 24, 36 bytes. The SIMD load at address 0 gives x0 y0 z0 x1 — mixed types. You must use scatter/gather operations to pull out just the x values, which costs ~10x a contiguous load.

SoA (Structure of Arrays):

float xs[N];  // layout: x0 x1 x2 x3 x4 x5 x6 x7 ...
float ys[N];
float zs[N];

To add all x values with SIMD: load 8 consecutive floats from xs[0], done. One instruction, 8 results. The loop body becomes 8x more productive per instruction.

Layout	SIMD compatibility	Cache efficiency	Best use case
AoS	Requires gather/scatter (slow)	Good when all fields used	Single-element operations, OOP
SoA	Native contiguous load	Excellent when one field used	Batch processing, ML, game physics

Game engines (Unity ECS, Bevy), ML inference engines, audio processing, and database column stores all use SoA. V8 (Chrome’s JavaScript engine) uses SoA-like TypedArrays for hot loops. The previous lesson’s ML example: changing from AoS (x,y,z,w) structs to xs[], ys[], zs[], ws[] drops L3 miss rate from 25% to 5% and raises IPC from 0.8 to 3.0.

Auto-vectorisation

Compilers automatically convert simple loops to SIMD when the data layout allows. Conditions for auto-vectorisation success:

No pointer aliasing (two pointers don’t point to overlapping memory — use restrict in C, Rust handles this via borrow checker).
Contiguous (not strided or gather/scatter).
Predictable trip count.
No data-dependent inner branches.

Check what the compiler emitted: -fopt-info-vec (GCC) or -Rpass=loop-vectorize (Clang) prints when loops are vectorised and why they failed. For hot loops that fail auto-vec, manual SIMD intrinsics or libraries (Intel Highway, simd-everywhere) close the gap.

Memory bandwidth: the other constraint

Cache hit rate is one axis of performance; memory bandwidth is another. A workload that streams through 100 GB of data will hit the RAM bandwidth ceiling (~50–100 GB/s on DDR5) regardless of cache behaviour. Bandwidth-bound code is fixed by:

Reducing data volume: more compact types (float32 instead of float64 when precision allows), on-the-fly computation instead of materialised intermediate tables.
Compression at rest, decompression on-the-fly.
Non-temporal stores (covered in the senior lesson): bypass cache for write-once data.

perf stat -e cache-misses,mem-loads-retired.l3-miss separates cache-miss-bound from bandwidth-bound: if L3 misses are high but total data volume is huge, you are bandwidth-bound; if L3 misses are high but data volume is small (fits in L3 but pattern is random), you are cache-miss-bound.

NUMA: multi-socket memory access

Servers with 2+ CPU sockets are Non-Uniform Memory Access. Each socket has local RAM (~70 ns) and remote RAM (~120–150 ns). A thread that allocates memory on socket 0 but runs on socket 1 pays the remote-access tax on every load. This is a 1.7x latency penalty that perf stat will misreport as “L3 misses” because the access goes through the interconnect.

Mitigations:

Pin threads to sockets (taskset, hwloc).
Allocate on the local NUMA node (numactl --membind, jemalloc NUMA-aware mode).
Distribute work so each thread touches data from its local socket.

SIMD and data layout numbers

AVX2 float throughput: 8 floats per instruction
AVX-512 float throughput: 16 floats per instruction
NEON (ARM) throughput: 4 floats per instruction
Gather/scatter vs contiguous load: ~10x slower
DDR5-6000 bandwidth: ~50 GB/s per channel
NUMA local vs remote latency: 70 ns vs 120–150 ns
ML loop AoS→SoA IPC gain: 0.8 → 3.0 (example)

▸Why this works

Inlining affects the instruction cache (I-cache), not data cache. Aggressive inlining of a function called from 100 sites adds 100 copies of its body to the binary. If those copies pollute I-cache they evict other hot functions. The fix: inline tiny functions (1–3 lines) freely; inline larger functions only on the hottest call sites. PGO makes these decisions based on real call frequencies. Monitoring L1-icache-load-misses in perf stat catches instruction-cache pressure after layout changes.

Quiz

A hot ML inference loop is 5x slower than a reference C++ port despite identical algorithm. perf stat shows IPC 0.8 and 25% L3 miss rate. What is the primary diagnosis?

Quiz

A backend service runs on a 2-socket NUMA server. A thread allocates a large working set on socket 0, then migrates to socket 1 under load. What perf signature does this produce?

To sum x with SIMD: AoS forces a gather (x values 12 bytes apart, ~10x slower); SoA loads 4–8 consecutive x in one instruction.

Recall before you leave

01
Explain the AoS vs SoA trade-off for a hot loop that accesses only one field of a multi-field struct.
02
A perf stat run shows high L3 miss rate, but the working set is only 50 MB (well within L3 capacity). What alternative explanation should you investigate, and how?

Recap

SIMD instructions operate on 4–16 consecutive values per cycle; AoS layout interleaves field types forcing 10x-slower gather operations, while SoA stores each field in a flat array enabling native SIMD loads. The ML benchmark’s 5x gap came entirely from changing AoS to SoA — no algorithm change, just layout. Beyond cache locality, two separate constraints limit throughput: memory bandwidth (how many GB/s you can stream from RAM) and NUMA topology (remote-socket memory costs 1.7x more than local). perf stat exposes both via L3-miss rate and NUMA counters. Fix layout to SoA first, verify the loop becomes compute-bound, then apply SIMD intrinsics or rely on auto-vectorisation. Now when you see a hot compute loop that a profiler says should be faster than it is, ask: are the values this loop processes stored contiguously, or interleaved with fields it never touches?

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Cache lines, struct layout, and false sharingmiddle

unlocks

Hardware prefetcher, TLB, and memory-level parallelismsenior

deepens into

appears again in193

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.