Performance PERF · 03 · 02

Row-major vs column-major: access order and the 9x gap

Changing loop order on the same matrix can produce a 9x wall-clock difference — because the step from row to column flips every access from a cache hit to a cache miss.

PERF Junior ◷ 12 min

Level

FoundationsJuniorMiddleSenior

Two matrix-multiply loops. Same algorithm. Same 10 000 × 10 000 matrix. Same number of additions and multiplications. One finishes in 300 ms. The other in 2 800 ms. The only difference: the order of two for loops.

How 2D arrays are laid out in memory

Before you tune a single inner loop, you need to know how the language stores your matrix in memory — because that choice determines whether each step is a 1 ns cache hit or a 100 ns round-trip to RAM.

In C, Java, Rust, Python (NumPy default), and most other languages, a 2D array arr[rows][cols] is stored in row-major order: all elements of row 0 are contiguous, then all of row 1, and so on.

For a 10 000 × 10 000 matrix of 8-byte doubles, the layout is:

arr[0][0], arr[0][1], ..., arr[0][9999],   ← row 0: bytes 0–79999
arr[1][0], arr[1][1], ..., arr[1][9999],   ← row 1: bytes 80000–159999
...

What happens inside each loop order

Row-major iteration (for i: for j: arr[i][j]):

Accesses arr[i][0], arr[i][1], arr[i][2], …
Each step moves 8 bytes forward in memory.
A cache line is 64 bytes (8 doubles). The first read loads 8 elements at once; the next 7 reads are free L1 hits.
The prefetcher detects the sequential pattern and pre-loads the next line ahead.
Cost per element: ~1 ns.

Column-major iteration (for j: for i: arr[i][j]):

Accesses arr[0][j], arr[1][j], arr[2][j], …
Each step moves 10 000 × 8 = 80 000 bytes forward — well past any loaded cache line.
Every access is in a fresh cache line that was not prefetched.
Cost per element: ~70–100 ns (full RAM latency).

Loop order	Step size in memory	Cache behaviour	Time (10k×10k doubles)
`for i: for j: arr[i][j]`	8 bytes (sequential)	7 of 8 accesses hit L1	~300 ms
`for j: for i: arr[i][j]`	80 000 bytes (random)	Almost every access misses	~2 800 ms

The instruction count is identical. The arithmetic is identical. The 9x difference is entirely from cache-miss latency.

Same O(N²), same arithmetic — the bar height is pure cache-miss latency. Column-major's 80 000-byte stride turns nearly every access into a RAM round-trip.

Fortran exception and language conventions

Fortran stores arrays in column-major order — all elements of column 0 first, then column 1. NumPy has a Fortran-order option (order='F'). If your code or library uses Fortran layout, the loop that is “obviously right” in C is the worst case in Fortran. Always verify your language’s memory layout before tuning inner loops.

Cache-oblivious blocking

Sometimes both loop orders are necessary — for example in matrix transposition or certain linear algebra routines. The fix is cache-oblivious blocking (also called tiling): divide the matrix into small tiles (~64 rows × 64 columns, so each tile fits in L2), and process each tile before moving to the next. Within a tile, accesses are in order; the whole tile fits in cache while being processed, and both row and column dimensions see mostly cache hits.

▸Why this works

The 9x example uses a 10 000 × 10 000 matrix because it guarantees the working set far exceeds L3 cache. On smaller matrices (100 × 100) the entire matrix fits in L1 and both loop orders are fast. The performance cliff only appears when the matrix outgrows the cache — which is the common production scenario for real data processing workloads.

Quiz

A 10 000 × 10 000 matrix of doubles is stored row-major. The inner loop iterates columns (arr[i][j], j from 0 to N). Why is this fast?

Quiz

A team switches a matrix multiply from row-major to column-major loop order and observes a 9x slowdown with no algorithm change. The root cause is:

Order the steps

Trace what happens when row-major code accesses arr[i][0] through arr[i][7] (8 consecutive doubles, 64 bytes total):

1 CPU requests arr[i][0] (address X)
2 Cache miss: load 64-byte line at X into L1 (costs ~100 ns once)
3 arr[i][0] through arr[i][7] are all in the loaded line
4 Accesses arr[i][1] through arr[i][7] each hit L1 (~1 cycle each)
5 Prefetcher fires and loads the next 64-byte line before it is needed
6 arr[i][8] through arr[i][15] arrive with zero wait

Row-major walks one cache line (8 bytes/step); column-major jumps a whole row (80 000 bytes/step), missing every time.

Recall before you leave

01
Why is row-major matrix iteration 9x faster than column-major on a 10 000×10 000 matrix, when both perform the same number of element accesses?
02
What is cache-oblivious blocking and when should you use it?

Recap

In a row-major language (C, Java, Rust, Python/NumPy), elements of the same row occupy consecutive addresses. A 64-byte cache line holds 8 doubles; sequential (row-major) iteration reads 8 elements per cache miss. Column-major iteration on the same array steps 80 000 bytes per element on a 10 000-column matrix, landing in a fresh cache line every time and paying full RAM latency on every load. The result is a 9x wall-clock difference with no algorithmic change. Always know your language’s memory layout and align loop order to match it; use cache-oblivious blocking when both access directions are unavoidable. Now when you see a matrix-processing function that is suspiciously slow, flip the loop order before reaching for any other optimization — it takes one line to fix and one benchmark to confirm.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 6 done

Connected lessons

builds on

Memory hierarchy: why the same O(N) loop can be 17x slowerjunior

deepens into

appears again in162

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.