Crux Read real snippets — a strided loop, an AoS struct, a padded counter, and a pointer chain — predict the cache and pipeline behaviour, then pick the highest-leverage fix.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min
Cache problems are diagnosed in the code and the profile, not in the big-O. Read each snippet, predict where the wall-clock time actually goes, and pick the fix a senior engineer reaches for first.
Goal
Practise the loop you run in every performance incident: read the hot path, find the layout or access-pattern defect the profiler will confirm, and choose the highest-leverage fix before touching SIMD intrinsics or compiler flags.
Snippet 1 — the strided sum
double sum = 0;// matrix is row-major: matrix[i][j] at offset (i*N + j)*8 bytesfor (int j = 0; j < N; j++) // outer: column index for (int i = 0; i < N; i++) // inner: row index sum += matrix[i][j];
Quiz
Completed
On a 10 000×10 000 double matrix, why is this loop ~9x slower than swapping the two loops, and what is the fix?
Heads-up The floating-point add is nearly free; the CPU stalls waiting on memory. SIMD cannot load a strided column efficiently anyway — fix the access order first.
Heads-up They touch the same elements but in a different memory order. Column-order on row-major storage misses cache on essentially every access; the wall-clock gap is ~9x.
Heads-up Alignment does not change an 80 000-byte stride. The defect is iterating across the storage layout; reorder the loops or tile the matrix.
Snippet 2 — the struct layout
struct Particle { float x, y, z, vx, vy, vz; char name[40]; }; // 64 bytesParticle parts[1_000_000];float total_x = 0;for (int i = 0; i < 1_000_000; i++) total_x += parts[i].x; // only x is read
Quiz
Completed
The loop reads only `x`, yet the profile shows a high L3 miss rate. What is the layout defect and the fix?
Heads-up Size is not the root cause — wasted bandwidth is. SoA shrinks the bytes touched ~16x for this loop, so far more of it stays cache-resident and the prefetcher streams it.
Heads-up Prefetch helps a little, but you would still pull 64 bytes per element to use 4. The structural fix is SoA, which removes the wasted 60 bytes per line.
Heads-up It is already 64 bytes with no slack; packing changes nothing. The waste is the cold fields (`y…name`) co-located with the hot field, which only struct-splitting/SoA fixes.
Snippet 3 — the padded counter
type paddedCounter struct { value uint64 _ [56]byte // pad to a full 64-byte cache line}var counters [16]paddedCounterfunc worker(idx int) { // one goroutine per idx atomic.AddUint64(&counters[idx].value, 1)}
Quiz
Completed
What does the `[56]byte` padding buy, and what failure does it prevent?
Heads-up The atomic instruction is unchanged. Padding removes inter-core coherency traffic (line invalidations), not per-instruction cost.
Heads-up atomic.AddUint64 already makes the increment race-free. Padding addresses a performance bug (false sharing), not correctness.
Heads-up It does the opposite — it inflates each counter from 8 to 64 bytes. The trade is memory for the elimination of cache-line contention.
Snippet 4 — the pointer chain
// idx[] holds the next index to visit; following it is data-dependentint i = start;for (int step = 0; step < N; step++) { process(data[i]); i = idx[i]; // next address depends on this load's result}
Quiz
Completed
Even when `data` and `idx` fit in L2, this runs far slower than a sequential scan of the same arrays. Why, and what is the structural fix?
Heads-up They fit in L2 — capacity is fine. The cost is the dependency chain serializing the misses; that is an access-pattern problem, not a capacity one.
Heads-up The stalls are on the dependent `idx[i]` loads. Even a trivial process() leaves the loop memory-latency-bound because each step waits on the previous load.
Heads-up Recursion does not break the data dependency. You must materialise the order into independent, sequential accesses so MLP and the prefetcher can engage.
Recap
Every snippet here is a constant-factor defect big-O cannot see: a column-order loop on row-major storage misses on every step (reorder or tile); an AoS struct drags cold fields into the line for a one-field loop (split to SoA); unpadded per-thread counters ping-pong cache lines (pad to 64 bytes); and a dependent pointer chain serializes misses and defeats both MLP and the prefetcher (precompute a contiguous order). Read the access pattern, name the cause, fix the layout — then re-profile to confirm.