Performance
Cache vs big-O: beat the textbook
Reading that cache dominates big-O is not the same as making the “slow” algorithm win on the clock. Build a benchmark where an O(N) contiguous structure beats an O(log N) pointer-based one, then profile and fix a hot loop until perf stat proves the constant factors are gone.
Turn the unit’s mental model into a reproducible engineering loop: measure with perf, name the constant-factor cause (locality, branch, false sharing, bandwidth), apply the matching layout fix, and verify a real wall-clock speedup with before/after numbers under identical input.
Take a memory-bound hot loop (your own, or the two starters below) and drive a 3–5x wall-clock speedup by fixing constant factors — layout, access order, and branch predictability — without changing the algorithm's big-O, proving each step with perf measurements.
- A before/after table: wall-clock time, IPC, LLC miss rate, and branch-miss rate — all measured under identical input on the same machine, not estimated.
- A perf profile (perf record + annotate, or equivalent) showing the hot line's miss/stall share shrinking after the fix — re-profiled, not assumed.
- A 3–5x speedup on at least one hot loop, with the dominant metric moved into its healthy band (IPC >2.0, or LLC miss <10%, or branch miss <5%, matching the diagnosed cause).
- A one-paragraph write-up naming the constant-factor cause, the lever you used, and the crossover N at which the asymptotically-better structure stops losing.
- Add a cache-oblivious or tiled variant (blocked matrix transpose / multiply) and show it wins at every cache level versus the naive loop, without hard-coding cache sizes.
- Parallelise the aggregation and deliberately induce false sharing on a shared counter array; show the regression in perf c2c / IPC, then fix it with 64-byte padding and re-measure.
- Add huge pages (madvise(MADV_HUGEPAGE) or hugetlbfs) for a >1 GB working set and measure the TLB-driven throughput change independently of cache locality.
- Add a CI performance gate: run the benchmark on a canary, diff IPC and LLC miss rate against main, and fail the build if either regresses beyond a threshold — the Discord/Bun-style guard against silent layout regressions.
This is the loop you will run in every real performance incident: measure with perf first, name the constant-factor cause from the metrics, fix at the layout level (SoA, struct split, loop reorder, tiling, padding) before reaching for SIMD or a different big-O, and verify with before/after numbers under identical input. Proving the textbook lie once on a benchmark — watching the contiguous O(N) structure beat the pointer-based O(log N) one and then finding the crossover — makes the production instinct muscle memory.