awesome-everything RU
↑ Back to the climb

Performance

Cache vs big-O: beat the textbook

Crux Hands-on project — take a benchmark where an O(N) layout beats an O(log N) one, profile the constant-factor causes with perf, and drive a 3–5x speedup through layout, branch, and locality fixes with before/after numbers.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 240 min

Reading that cache dominates big-O is not the same as making the “slow” algorithm win on the clock. Build a benchmark where an O(N) contiguous structure beats an O(log N) pointer-based one, then profile and fix a hot loop until perf stat proves the constant factors are gone.

Goal

Turn the unit’s mental model into a reproducible engineering loop: measure with perf, name the constant-factor cause (locality, branch, false sharing, bandwidth), apply the matching layout fix, and verify a real wall-clock speedup with before/after numbers under identical input.

Project
0 of 7
Objective

Take a memory-bound hot loop (your own, or the two starters below) and drive a 3–5x wall-clock speedup by fixing constant factors — layout, access order, and branch predictability — without changing the algorithm's big-O, proving each step with perf measurements.

Requirements
Acceptance criteria
  • A before/after table: wall-clock time, IPC, LLC miss rate, and branch-miss rate — all measured under identical input on the same machine, not estimated.
  • A perf profile (perf record + annotate, or equivalent) showing the hot line's miss/stall share shrinking after the fix — re-profiled, not assumed.
  • A 3–5x speedup on at least one hot loop, with the dominant metric moved into its healthy band (IPC >2.0, or LLC miss <10%, or branch miss <5%, matching the diagnosed cause).
  • A one-paragraph write-up naming the constant-factor cause, the lever you used, and the crossover N at which the asymptotically-better structure stops losing.
Senior stretch
  • Add a cache-oblivious or tiled variant (blocked matrix transpose / multiply) and show it wins at every cache level versus the naive loop, without hard-coding cache sizes.
  • Parallelise the aggregation and deliberately induce false sharing on a shared counter array; show the regression in perf c2c / IPC, then fix it with 64-byte padding and re-measure.
  • Add huge pages (madvise(MADV_HUGEPAGE) or hugetlbfs) for a >1 GB working set and measure the TLB-driven throughput change independently of cache locality.
  • Add a CI performance gate: run the benchmark on a canary, diff IPC and LLC miss rate against main, and fail the build if either regresses beyond a threshold — the Discord/Bun-style guard against silent layout regressions.
Recap

This is the loop you will run in every real performance incident: measure with perf first, name the constant-factor cause from the metrics, fix at the layout level (SoA, struct split, loop reorder, tiling, padding) before reaching for SIMD or a different big-O, and verify with before/after numbers under identical input. Proving the textbook lie once on a benchmark — watching the contiguous O(N) structure beat the pointer-based O(log N) one and then finding the crossover — makes the production instinct muscle memory.

Continue the climb ↑GC basics: what the runtime taxes you for
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources2
expand
  1. 01
  2. 02

Trademarks belong to their respective owners. Editorial reference only.