awesome-everything RU
↑ Back to the climb

Performance

Memory hierarchy: why the same O(N) loop can be 17x slower

Crux Modern CPUs have five storage levels separated by 100x latency gaps — the level where your data lives decides wall-clock cost more than operation count does.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at junior altitude — the surface
◷ 14 min

Two engineers benchmark a list of 1 million numbers. Both loops are O(N). One takes 2 ms, the other 35 ms. Big-O predicts them equal. The CPU disagrees — by 17x.

The memory hierarchy

A modern CPU is not a uniform memory machine. Data does not sit in one place — it flows through five levels, each faster and smaller than the one below it.

LevelSize (typical)LatencyWho controls it
Register~16 × 8 bytes<1 cycleCompiler / CPU
L1 cache32–64 KB per core~1 ns (3–5 cycles)Hardware
L2 cache256 KB–2 MB per core~3 ns (10–15 cycles)Hardware
L3 cache4–64 MB shared~10 ns (30–50 cycles)Hardware
Main memory (DDR5)GBs~70–100 nsOS + allocator

The number that matters most: L1 is 100x faster than RAM. A single cache miss to main memory costs as much as ~30 floating-point multiplications would take in registers.

The library metaphor

Think of a library with five storage zones. 10 books are on your desk (L1), 100 on a nearby shelf (L2), 10 000 in the same room (L3), and 10 million in a warehouse two blocks away (RAM). A “fast” algorithm that needs to walk to the warehouse for every step loses to a “slow” algorithm that processes every book on the desk in order. Distance matters more than operation count.

Cache lines: the unit of transfer

Data does not move between RAM and cache one byte at a time. It moves in cache lines — 64 bytes on x86 and most ARM chips. Reading one byte at address X loads bytes X through X+63 into cache. Every subsequent read of any byte in that range is a free L1 hit.

This is why sequential access patterns are fast: after the first load, the next seven or eight reads cost almost nothing. The hardware fires proactively — before you ask for byte 64, the prefetcher already loaded that line.

Array vs linked list: the canonical case

Both data structures hold a million integers. Both require N iterations. Big-O says equal. The benchmark says 17x different.

  • An array is one contiguous allocation. Elements sit at fixed offsets: address, address+8, address+16, … The prefetcher reads cache lines ahead of your loop. Each element is in L1 by the time the loop reaches it. Cost per element: ~1 ns.
  • A linked list allocates each node separately on the heap. The .next pointer in each node points to a random location. The prefetcher cannot predict this. Every pointer dereference likely hits a cold cache line at L3 or RAM. Cost per node: ~100 ns.

1 000 000 × 100 ns = 100 ms. The array case: ~1 ms. The 17x benchmark gap is conservative — it assumes some hits from recently-touched nodes. At larger sizes the gap widens.

Why this works

Big-O is defined as asymptotic — it tells you how cost grows with N, not the absolute cost. The hidden constant in front of the N is what the memory hierarchy controls. A well-placed O(N) can have a constant of 1 ns; a pointer-chasing O(log N) can have a constant of 100 ns per step, making the O(N) win until N is in the millions.

Quiz

Why can an O(N) array iteration beat an O(log N) tree traversal in real benchmarks?

Quiz

What is a cache line?

Order the steps

Order the levels of memory the CPU accesses, from fastest to slowest:

  1. 1 Register (inside the CPU core)
  2. 2 L1 cache (per core, ~1 ns)
  3. 3 L2 cache (per core, ~3 ns)
  4. 4 L3 cache (shared, ~10 ns)
  5. 5 Main memory / RAM (~70–100 ns)
Complete the analogy

Fill in the blank: when memory accesses are sequential, the CPU's hardware _______ predicts and pre-loads the next bytes before you ask for them.

Recall before you leave
  1. 01
    In one paragraph: why is the iteration cost of a linked list of 1 million nodes 17x slower than an array of 1 million numbers, when both are O(N)?
  2. 02
    What is a cache line, and why does it make sequential access patterns much cheaper than random access patterns?
Recap

A modern CPU has five memory levels separated by up to 100x latency gaps; data in L1 costs 1 ns, data in RAM costs 70–100 ns. The hardware works in 64-byte cache lines: one read fills a line, making sequential accesses cheap and random accesses expensive. An array of 1 million integers iterates 17x faster than a linked list of the same size because array elements live in contiguous memory — the prefetcher fills cache lines ahead of the loop — while linked list nodes scatter across the heap and each pointer dereference lands in cold memory. Big-O counts operations; the memory hierarchy decides the price per operation. Data layout is the first lever in production performance work.

Connected lessons
appears again in159
Continue the climb ↑Row-major vs column-major: access order and the 9x gap
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.