awesome-everything RU
↑ Back to the climb

Performance

Row-major vs column-major: access order and the 9x gap

Crux Changing loop order on the same matrix can produce a 9x wall-clock difference — because the step from row to column flips every access from a cache hit to a cache miss.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at junior altitude — the surface
◷ 12 min

Two matrix-multiply loops. Same algorithm. Same 10 000 × 10 000 matrix. Same number of additions and multiplications. One finishes in 300 ms. The other in 2 800 ms. The only difference: the order of two for loops.

How 2D arrays are laid out in memory

In C, Java, Rust, Python (NumPy default), and most other languages, a 2D array arr[rows][cols] is stored in row-major order: all elements of row 0 are contiguous, then all of row 1, and so on.

For a 10 000 × 10 000 matrix of 8-byte doubles, the layout is:

arr[0][0], arr[0][1], ..., arr[0][9999],   ← row 0: bytes 0–79999
arr[1][0], arr[1][1], ..., arr[1][9999],   ← row 1: bytes 80000–159999
...

What happens inside each loop order

Row-major iteration (for i: for j: arr[i][j]):

  • Accesses arr[i][0], arr[i][1], arr[i][2], …
  • Each step moves 8 bytes forward in memory.
  • A cache line is 64 bytes (8 doubles). The first read loads 8 elements at once; the next 7 reads are free L1 hits.
  • The prefetcher detects the sequential pattern and pre-loads the next line ahead.
  • Cost per element: ~1 ns.

Column-major iteration (for j: for i: arr[i][j]):

  • Accesses arr[0][j], arr[1][j], arr[2][j], …
  • Each step moves 10 000 × 8 = 80 000 bytes forward — well past any loaded cache line.
  • Every access is in a fresh cache line that was not prefetched.
  • Cost per element: ~70–100 ns (full RAM latency).
Loop orderStep size in memoryCache behaviourTime (10k×10k doubles)
for i: for j: arr[i][j]8 bytes (sequential)7 of 8 accesses hit L1~300 ms
for j: for i: arr[i][j]80 000 bytes (random)Almost every access misses~2 800 ms

The instruction count is identical. The arithmetic is identical. The 9x difference is entirely from cache-miss latency.

Fortran exception and language conventions

Fortran stores arrays in column-major order — all elements of column 0 first, then column 1. NumPy has a Fortran-order option (order='F'). If your code or library uses Fortran layout, the loop that is “obviously right” in C is the worst case in Fortran. Always verify your language’s memory layout before tuning inner loops.

Cache-oblivious blocking

Sometimes both loop orders are necessary — for example in matrix transposition or certain linear algebra routines. The fix is cache-oblivious blocking (also called tiling): divide the matrix into small tiles (~64 rows × 64 columns, so each tile fits in L2), and process each tile before moving to the next. Within a tile, accesses are in order; the whole tile fits in cache while being processed, and both row and column dimensions see mostly cache hits.

Why this works

The 9x example uses a 10 000 × 10 000 matrix because it guarantees the working set far exceeds L3 cache. On smaller matrices (100 × 100) the entire matrix fits in L1 and both loop orders are fast. The performance cliff only appears when the matrix outgrows the cache — which is the common production scenario for real data processing workloads.

Quiz

A 10 000 × 10 000 matrix of doubles is stored row-major. The inner loop iterates columns (arr[i][j], j from 0 to N). Why is this fast?

Quiz

A team switches a matrix multiply from row-major to column-major loop order and observes a 9x slowdown with no algorithm change. The root cause is:

Order the steps

Trace what happens when row-major code accesses arr[i][0] through arr[i][7] (8 consecutive doubles, 64 bytes total):

  1. 1 CPU requests arr[i][0] (address X)
  2. 2 Cache miss: load 64-byte line at X into L1 (costs ~100 ns once)
  3. 3 arr[i][0] through arr[i][7] are all in the loaded line
  4. 4 Accesses arr[i][1] through arr[i][7] each hit L1 (~1 cycle each)
  5. 5 Prefetcher fires and loads the next 64-byte line before it is needed
  6. 6 arr[i][8] through arr[i][15] arrive with zero wait
Recall before you leave
  1. 01
    Why is row-major matrix iteration 9x faster than column-major on a 10 000×10 000 matrix, when both perform the same number of element accesses?
  2. 02
    What is cache-oblivious blocking and when should you use it?
Recap

In a row-major language (C, Java, Rust, Python/NumPy), elements of the same row occupy consecutive addresses. A 64-byte cache line holds 8 doubles; sequential (row-major) iteration reads 8 elements per cache miss. Column-major iteration on the same array steps 80 000 bytes per element on a 10 000-column matrix, landing in a fresh cache line every time and paying full RAM latency on every load. The result is a 9x wall-clock difference with no algorithmic change. Always know your language’s memory layout and align loop order to match it; use cache-oblivious blocking when both access directions are unavoidable.

Connected lessons
appears again in159
Continue the climb ↑Cache lines, struct layout, and false sharing
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.