Performance PERF · 03 · 07

Cache-oblivious algorithms, PGO, and production failures

Cache-oblivious recursion, profile-guided optimisation, real regressions from Discord/Mojang/Bun, and the full perf-stat observability toolkit that ties it all together.

PERF Senior ◷ 18 min

Level

FoundationsJuniorMiddleSenior

A Rust struct refactor moved one hot field to the end of a 96-byte struct. No algorithmic change. No added work. p99 latency on the message-render path grew 40%. A production outage disguised as a code cleanup.

Cache-oblivious algorithms

The algorithms in this section are used by database engines, compilers, and game engines precisely because they keep data in the right cache level automatically — understanding why they work gives you a mental model for designing your own cache-aware data structures.

A cache-oblivious algorithm performs well at every cache level — L1, L2, L3 — without knowing the specific sizes. The canonical example is recursive blocked matrix multiply.

The naive three-nested-loop matrix multiply has poor cache behaviour because the inner loop exhausts a cache level before moving on. The recursive blocked version divides the matrix into four sub-matrices and recurses until a base case. At each recursion level, the active sub-matrix fits in some cache level naturally, even though the algorithm doesn’t know which one. The hardware adapts.

The same principle applies to:

Cache-oblivious mergesort: recursive halving keeps active data in the smallest available cache level.
Cache-oblivious B-trees: van Emde Boas layout arranges tree levels so that each half-tree fits in a cache level — no pointer chasing past the level boundary.
FFT: recursive Cooley-Tukey naturally aligns butterfly groups with cache lines.
Matrix transposition: naive transposition is cache-hostile; recursive blocked transposition achieves good spatial locality at all levels.

The 2–4x speedup over naive iteration often dominates whatever code complexity the recursion adds.

Profile-guided optimisation (PGO)

Compilers improve dramatically when given runtime frequency data. The process:

Build with instrumentation (-fprofile-generate / clang -fprofile-instr-generate).
Run a representative workload — this populates a .profdata file.
Rebuild with the profile (-fprofile-use / -fprofile-instr-use).

What PGO changes:

Inlining decisions: inline hot callees, don’t inline cold ones. Without PGO, compilers inline based on code size heuristics and often get it wrong.
Branch layout: predicted-taken branches are placed sequentially in memory. Non-taken branches jump forward. This keeps the hot path contiguous in the instruction cache.
Basic block reordering: hot basic blocks are grouped in nearby pages, improving I-cache hit rate for the common path.
Function placement: functions called together are placed near each other in the binary — fewer I-TLB misses.

Real workload speedups from PGO:

Chrome: 10–15% on typical pages.
Firefox: 12–18% on JIT warmup.
Postgres: 8–15% on OLTP benchmarks.
Linux kernel (BOLT post-link optimisation, similar concept): 3–8%.

PGO cost: build pipeline complexity and the need for a representative profiling workload. The profiling workload must match production traffic — a toy benchmark gives bad profiles and can hurt performance.

Real production failures

These failures all share the same shape: a “small” code change moved bytes around without changing the algorithm. The CPU noticed.

Discord 2023 — struct field reorder. A Rust struct refactor moved a hot field (message_timestamp) from byte offset 0 to byte offset 72 in a 96-byte struct. The field crossed a cache-line boundary. Every render of a message list — the hottest path in the client — now loaded two cache lines instead of one to get that field. p99 latency on the message-render path grew 40%. Rollback confirmed the cause. Fix: reorder struct fields so the hot fields live in the first 64 bytes (one cache line). Rule of thumb: profile your struct access patterns and put the top-3 most-read fields first.

Mojang Minecraft Java Edition 2018 — HashMap → contiguous array. The internal entity tracker used a HashMap<EntityID, EntityData> to store all world entities. Entity simulation (movement, AI, collision) accessed these entries in a tight game-tick loop. The HashMap’s chained buckets scattered EntityData structs across the heap — each entity lookup was a pointer chase to a random allocation. Replacing the entity store with a contiguous array (Entity Component System architecture, ECS) gave 3–5x throughput on the game-tick loop. Every entity’s component data was now sequential; the prefetcher loaded the next entity’s data while the CPU processed the current one. Discord’s 2024 game-server work extended this pattern to SIMD-aware ECS structs: further 2x on certain simulation paths.

Bun 2024 — startup cold-path regression. A new dependency added a HashMap initialised at module load time. Due to allocation order during JIT warmup, the HashMap’s backing array landed in the middle of the hot-code region, evicting frequently-used JIT-compiled functions from L1 during startup. The symptom was a 30ms startup regression — invisible in micro-benchmarks, visible in automated startup regression tests. Fix: lazy-init the HashMap (create on first use, not at module load). Pattern: anything allocated on a cold path that grows over time can pollute cache for the hot path. Lazy-init is the standard mitigation.

The structural pattern. In all three cases:

A “refactor” changed byte offsets or allocation timing — the algorithm was identical.
The hot path silently crossed a cache-line or TLB boundary.
The regression appeared as p99 growth or throughput drop, not as a crash or obvious error.
Root cause required perf c2c or perf stat to surface; code review would not have caught it.

Observability for cache and branch performance

The essential perf stat invocation:

perf stat -e \
  cache-misses,cache-references,\
  branch-misses,branches,\
  cycles,instructions \
  -- ./your_binary

Metric	Healthy	Problem threshold	Likely cause
IPC (instructions / cycles)	2.0–4.0	<1.0	Memory stalls or branch mispredicts
L1 miss rate (cache-misses / refs)	<1%	>10%	Bad spatial/temporal locality
LLC miss rate	<5%	>20%	Working set exceeds L3, or pointer chasing
Branch miss rate (branch-misses / branches)	1–2%	>5%	Unpredictable data-dependent branch in hot path

Further tools:

perf record + perf annotate: sample hot lines with source annotation. Shows which load instruction accumulates the most cycles.
perf c2c: reports cache-line contention between cores — the direct tool for false sharing.
perf top: live per-symbol miss attribution.
Intel VTune / AMD uProf: per-line attribution, memory access patterns, NUMA topology view. Requires platform-specific tools.
BCC cachestat / llcstat: BPF-based, zero-instrumentation, production-safe sampling. Reports L1/LLC hit rates system-wide or per-process without rebuilding the binary.
ARM PMU + Apple Instruments: same metrics on ARM; Instruments has “CPU Counters” template with MEMORY_CYCLES and LAST_LEVEL_CACHE_MISS.

The performance-engineering loop:

Measure with perf stat — get IPC, miss rates, branch miss rate.
Identify dominant cost: cache miss (layout problem), branch miss (prediction problem), low IPC from bandwidth (data volume problem), low IPC from instruction issue (SIMD opportunity).
Choose matching fix: layout → SoA / struct split; branches → sort / branchless; bandwidth → non-temporal stores / compression; SIMD → AoS-to-SoA + autovectorise.
Re-measure — confirm improvement, check for new bottleneck.

The senior move is not to guess: measure first, then apply the fix that matches the bottleneck perf stat actually shows.

▸Why this works

Big-O remains the right model for scalability with N. It is silent on constant factors that vary 100x with layout. Production reality: constants dominate well into millions for many problem sizes. The performance-engineering loop starts with measurement, not with algorithm selection.

Senior-tier cache and CPU numbers

TLB miss page walk: 5–50 cycles
Huge page (x86): 2 MB / 1 GB
L1 cache associativity: typ. 8-way
MLP outstanding misses: 8–16 typical
PGO typical speedup: 10–30%
Non-temporal store bandwidth gain: 20–30% streaming
Apple M3 pipeline depth: ~10 stages
DDR5-6000 bandwidth: ~50 GB/s per channel

Trace it

1/5

A perf-critical service has IPC of 0.6 and L3 miss rate of 30%. Trace the optimisation path.

Step 1 of 5

Step 1: IPC 0.6, L3 miss 30%. What does this mean?

Locked

Step 2: where to look first?

Locked

Step 3: hot function operates on a vector<MyStruct> where MyStruct has 200 bytes (mix of hot and cold fields). What's the diagnosis?

Locked

Step 4: after splitting, IPC up to 1.5. Still memory-bound. Next step?

Locked

Step 5: long-term governance?

Debug this

Diagnose the perf stat output — what is the bottleneck?

log

Performance counter stats for './hot_loop':

     5,238,194,672      cycles
     2,094,127,491      instructions             #  0.40 insn per cycle
       413,824,917      branches
       105,209,832      branch-misses            # 25.42% of all branches
       684,283,194      cache-references
       287,401,952      cache-misses             # 41.99% of all cache refs
        12,498,373      LLC-loads
         5,927,491      LLC-load-misses          # 47.43% of all LL-cache hits

         2.851 seconds time elapsed

IPC is 0.40, branch miss rate 25%, cache miss rate 42%, LLC miss 47%. What is the dominant bottleneck and how do you fix it?

Pick the best fit

Pick the data structure for a perf-critical lookup table with 1M entries, mostly reads, occasional inserts, no range queries.

Quiz

Why does the hardware prefetcher fail on graph traversals even when the graph is small enough to fit in L2?

Which RFC?

Which specification defines the cache-coherency protocols (MESI / MOESI) used by modern multi-core x86 CPUs?

Design challenge

Design the data layer for a real-time analytics service: 100M rows of telemetry data, p99 query latency under 10ms, 99% read, 1% write, mostly aggregation queries over column subsets.

Working set: 100M rows × 200 bytes = 20 GB.
Server has 256 GB RAM, so dataset fits in memory.
Most queries touch 1–3 columns out of 20.
Need to support compressed-on-disk + uncompressed-in-RAM.
Throughput target: 100k queries/sec.

PGO feeds real branch frequencies back into the compiler: hot blocks placed sequentially, hot callees inlined, cold paths split out — 10–30% speedup, no code change.

Recall before you leave

01
Explain how cache coherency (MESI/MOESI) creates 'false sharing' as a multi-thread performance bug, and how to detect and fix it.
02
A 'fast' O(log N) algorithm performs worse than a 'slow' O(N) one in production. List the diagnostic steps to confirm cache-locality is the cause.

Recap

Cache-oblivious algorithms (recursive blocked matrix multiply, van Emde Boas trees, recursive mergesort) achieve good cache behaviour at every level simultaneously without knowing cache sizes. PGO turns a representative runtime profile into inlining decisions, branch layout, and function placement — typically 10–30% speedup with no code changes. Real production failures share one shape: a struct reorder (Discord, +40% p99), a HashMap replacing contiguous arrays (Mojang, 3–5x throughput drop), or a cold-path HashMap polluting cache during warmup (Bun startup regression). The observability loop is: perf stat for IPC and miss rates, perf c2c for false sharing, perf record + annotate for per-line attribution, BPF tools (cachestat, llcstat) in production. IPC below 1.0 means memory-bound; branch miss rate above 5% means an unpredictable hot branch. Fix matches the bottleneck type: layout for cache misses, sort/branchless for branches, non-temporal stores for streaming bandwidth. Big-O is the right scalability model; cache constants are the right performance model. Now when you review a PR that “just refactors a struct” — check whether any hot-path field crossed a 64-byte boundary before approving it.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Hardware prefetcher, TLB, and memory-level parallelismsenior

appears again in193

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.