Performance PERF · 03 · 10

Cache vs big-O: beat the textbook

Hands-on project — take a benchmark where an O(N) layout beats an O(log N) one, profile the constant-factor causes with perf, and drive a 3–5x speedup through layout, branch, and locality fixes with before/after numbers.

PERF Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading that cache dominates big-O is not the same as making the “slow” algorithm win on the clock. Build a benchmark where an O(N) contiguous structure beats an O(log N) pointer-based one, then profile and fix a hot loop until perf stat proves the constant factors are gone.

Goal

Turn the unit’s mental model into a reproducible engineering loop: measure with perf, name the constant-factor cause (locality, branch, false sharing, bandwidth), apply the matching layout fix, and verify a real wall-clock speedup with before/after numbers under identical input.

Project

0 of 7

Objective

Take a memory-bound hot loop (your own, or the two starters below) and drive a 3–5x wall-clock speedup by fixing constant factors — layout, access order, and branch predictability — without changing the algorithm's big-O, proving each step with perf measurements.

Requirements

Acceptance criteria

A before/after table: wall-clock time, IPC, LLC miss rate, and branch-miss rate — all measured under identical input on the same machine, not estimated.
A perf profile (perf record + annotate, or equivalent) showing the hot line's miss/stall share shrinking after the fix — re-profiled, not assumed.
A 3–5x speedup on at least one hot loop, with the dominant metric moved into its healthy band (IPC >2.0, or LLC miss <10%, or branch miss <5%, matching the diagnosed cause).
A one-paragraph write-up naming the constant-factor cause, the lever you used, and the crossover N at which the asymptotically-better structure stops losing.

Senior stretch

Add a cache-oblivious or tiled variant (blocked matrix transpose / multiply) and show it wins at every cache level versus the naive loop, without hard-coding cache sizes.
Parallelise the aggregation and deliberately induce false sharing on a shared counter array; show the regression in perf c2c / IPC, then fix it with 64-byte padding and re-measure.
Add huge pages (madvise(MADV_HUGEPAGE) or hugetlbfs) for a >1 GB working set and measure the TLB-driven throughput change independently of cache locality.
Add a CI performance gate: run the benchmark on a canary, diff IPC and LLC miss rate against main, and fail the build if either regresses beyond a threshold — the Discord/Bun-style guard against silent layout regressions.

Recap

This is the loop you will run in every real performance incident: measure with perf first, name the constant-factor cause from the metrics, fix at the layout level (SoA, struct split, loop reorder, tiling, padding) before reaching for SIMD or a different big-O, and verify with before/after numbers under identical input. Proving the textbook lie once on a benchmark — watching the contiguous O(N) structure beat the pointer-based O(log N) one and then finding the crossover — makes the production instinct muscle memory.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.