Performance PERF · 03 · 01

Memory hierarchy: why the same O(N) loop can be 17x slower

Modern CPUs have five storage levels separated by 100x latency gaps — the level where your data lives decides wall-clock cost more than operation count does.

PERF Junior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

Two engineers benchmark a list of 1 million numbers. Both loops are O(N). One takes 2 ms, the other 35 ms. Big-O predicts them equal. The CPU disagrees — by 17x.

The memory hierarchy

By the end of this lesson you will know exactly why that 17x gap appears and which data structures pay the price in production every day.

A modern CPU is not a uniform memory machine. Data does not sit in one place — it flows through five levels, each faster and smaller than the one below it.

Level	Size (typical)	Latency	Who controls it
Register	~16 × 8 bytes	<1 cycle	Compiler / CPU
L1 cache	32–64 KB per core	~1 ns (3–5 cycles)	Hardware
L2 cache	256 KB–2 MB per core	~3 ns (10–15 cycles)	Hardware
L3 cache	4–64 MB shared	~10 ns (30–50 cycles)	Hardware
Main memory (DDR5)	GBs	~70–100 ns	OS + allocator

The number that matters most: L1 is 100x faster than RAM. A single cache miss to main memory costs as much as ~30 floating-point multiplications would take in registers.

The library metaphor

When you see a benchmark claiming two algorithms with the same complexity run at wildly different speeds, ask where the data lives before questioning the code.

Think of a library with five storage zones. 10 books are on your desk (L1), 100 on a nearby shelf (L2), 10 000 in the same room (L3), and 10 million in a warehouse two blocks away (RAM). A “fast” algorithm that needs to walk to the warehouse for every step loses to a “slow” algorithm that processes every book on the desk in order. Distance matters more than operation count.

Cache lines: the unit of transfer

Data does not move between RAM and cache one byte at a time. It moves in cache lines — 64 bytes on x86 and most ARM chips. Reading one byte at address X loads bytes X through X+63 into cache. Every subsequent read of any byte in that range is a free L1 hit.

This is why sequential access patterns are fast: after the first load, the next seven or eight reads cost almost nothing. The hardware fires proactively — before you ask for byte 64, the prefetcher already loaded that line.

Array vs linked list: the canonical case

Both data structures hold a million integers. Both require N iterations. Big-O says equal. The benchmark says 17x different.

An array is one contiguous allocation. Elements sit at fixed offsets: address, address+8, address+16, … The prefetcher reads cache lines ahead of your loop. Each element is in L1 by the time the loop reaches it. Cost per element: ~1 ns.
A linked list allocates each node separately on the heap. The .next pointer in each node points to a random location. The prefetcher cannot predict this. Every pointer dereference likely hits a cold cache line at L3 or RAM. Cost per node: ~100 ns.

1 000 000 × 100 ns = 100 ms. The array case: ~1 ms. The 17x benchmark gap is conservative — it assumes some hits from recently-touched nodes. At larger sizes the gap widens.

Same O(N), same N iterations — but contiguous layout keeps the array in L1 while pointer-chasing pays a full RAM round-trip per node, producing the 17x gap.

▸Why this works

Big-O is defined as asymptotic — it tells you how cost grows with N, not the absolute cost. The hidden constant in front of the N is what the memory hierarchy controls. A well-placed O(N) can have a constant of 1 ns; a pointer-chasing O(log N) can have a constant of 100 ns per step, making the O(N) win until N is in the millions.

Quiz

Why can an O(N) array iteration beat an O(log N) tree traversal in real benchmarks?

Quiz

What is a cache line?

Order the steps

Order the levels of memory the CPU accesses, from fastest to slowest:

1 Register (inside the CPU core)
2 L1 cache (per core, ~1 ns)
3 L2 cache (per core, ~3 ns)
4 L3 cache (shared, ~10 ns)
5 Main memory / RAM (~70–100 ns)

Complete the analogy

Fill in the blank: when memory accesses are sequential, the CPU's hardware _______ predicts and pre-loads the next bytes before you ask for them.

L1 cache ~1 ns, 32–64 KB

L2 cache ~3 ns, 256 KB–2 MB

L3 cache ~10 ns, 4–64 MB

Main memory (DDR5) ~70–100 ns, GBs

L1 is ~100x faster than RAM — the level where data lives sets the price.

Recall before you leave

01
In one paragraph: why is the iteration cost of a linked list of 1 million nodes 17x slower than an array of 1 million numbers, when both are O(N)?
02
What is a cache line, and why does it make sequential access patterns much cheaper than random access patterns?

Recap

A modern CPU has five memory levels separated by up to 100x latency gaps; data in L1 costs 1 ns, data in RAM costs 70–100 ns. The hardware works in 64-byte cache lines: one read fills a line, making sequential accesses cheap and random accesses expensive. An array of 1 million integers iterates 17x faster than a linked list of the same size because array elements live in contiguous memory — the prefetcher fills cache lines ahead of the loop — while linked list nodes scatter across the heap and each pointer dereference lands in cold memory. Big-O counts operations; the memory hierarchy decides the price per operation. Now when you see a performance gap that Big-O cannot explain, check the data structure’s memory layout before reaching for a faster algorithm.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

unlocks

deepens into

appears again in162

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Virtual data gridRender and smooth-scroll 100k rows at 60fps with windowing/virtualization, sticky headers, and full keyboard navigation — no library, just math.