awesome-everything RU
↑ Back to the climb

Performance

Cache vs big-O: code and layout reading

Crux Read real snippets — a strided loop, an AoS struct, a padded counter, and a pointer chain — predict the cache and pipeline behaviour, then pick the highest-leverage fix.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min

Cache problems are diagnosed in the code and the profile, not in the big-O. Read each snippet, predict where the wall-clock time actually goes, and pick the fix a senior engineer reaches for first.

Goal

Practise the loop you run in every performance incident: read the hot path, find the layout or access-pattern defect the profiler will confirm, and choose the highest-leverage fix before touching SIMD intrinsics or compiler flags.

Snippet 1 — the strided sum

double sum = 0;
// matrix is row-major: matrix[i][j] at offset (i*N + j)*8 bytes
for (int j = 0; j < N; j++)        // outer: column index
    for (int i = 0; i < N; i++)    // inner: row index
        sum += matrix[i][j];
Quiz

On a 10 000×10 000 double matrix, why is this loop ~9x slower than swapping the two loops, and what is the fix?

Snippet 2 — the struct layout

struct Particle { float x, y, z, vx, vy, vz; char name[40]; };  // 64 bytes
Particle parts[1_000_000];

float total_x = 0;
for (int i = 0; i < 1_000_000; i++)
    total_x += parts[i].x;          // only x is read
Quiz

The loop reads only `x`, yet the profile shows a high L3 miss rate. What is the layout defect and the fix?

Snippet 3 — the padded counter

type paddedCounter struct {
    value uint64
    _     [56]byte   // pad to a full 64-byte cache line
}
var counters [16]paddedCounter

func worker(idx int) {                     // one goroutine per idx
    atomic.AddUint64(&counters[idx].value, 1)
}
Quiz

What does the `[56]byte` padding buy, and what failure does it prevent?

Snippet 4 — the pointer chain

// idx[] holds the next index to visit; following it is data-dependent
int i = start;
for (int step = 0; step < N; step++) {
    process(data[i]);
    i = idx[i];          // next address depends on this load's result
}
Quiz

Even when `data` and `idx` fit in L2, this runs far slower than a sequential scan of the same arrays. Why, and what is the structural fix?

Recap

Every snippet here is a constant-factor defect big-O cannot see: a column-order loop on row-major storage misses on every step (reorder or tile); an AoS struct drags cold fields into the line for a one-field loop (split to SoA); unpadded per-thread counters ping-pong cache lines (pad to 64 bytes); and a dependent pointer chain serializes misses and defeats both MLP and the prefetcher (precompute a contiguous order). Read the access pattern, name the cause, fix the layout — then re-profile to confirm.

Continue the climb ↑Cache vs big-O: beat the textbook
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources2
expand
  1. 01
  2. 02

Trademarks belong to their respective owners. Editorial reference only.