Performance PERF · 03 · 04

Branch prediction and branchless code

Modern CPUs predict every branch before executing it — a miss flushes the pipeline at 10–30 cycles. Unpredictable branches in hot loops are the branch equivalent of a cache miss.

PERF Middle ◷ 13 min

Level

FoundationsJuniorMiddleSenior

An inner loop sums elements above a threshold on an unsorted array. The same loop on a sorted array of the same data runs 2–3x faster — despite doing the same arithmetic. The sort cost is O(N log N). The branch cost is why it pays back.

How the CPU handles branches

Modern CPUs are deeply pipelined: many instructions are in flight simultaneously. When the pipeline reaches a branch (if, loop boundary, function call), it must decide which instruction to fetch next before the branch condition is computed. The branch predictor makes that guess.

If the guess is right: execution continues without interruption. If the guess is wrong: the pipeline must be flushed — all partially-executed instructions that followed the wrong path are discarded, and the CPU starts over from the correct path. On modern x86, this costs 10–30 cycles. On deep pipelines (Apple M1/M2/M3 with ~10 stages), a miss can waste 40+ instruction slots.

Branch predictor accuracy

Modern hardware predictors — Intel’s TAGE (TAgged GEometric history), AMD’s similar design — achieve 95–99% accuracy on real production code. Their techniques:

Branch target buffer (BTB): caches recent jump targets.
Pattern history table (PHT): tracks taken/not-taken history per branch.
TAGE predictor: combines multiple history lengths, predicting long-period patterns.

For code with regular behaviour (loop bounds, type-stable hot paths), the predictor is essentially free. The cost appears for data-dependent branches — branches whose taken/not-taken outcome depends on the values being processed, not the code structure.

The sorted array trick

Classic example: summing elements above a threshold in an unsorted vs sorted array.

# Unsorted: threshold crossing happens at unpredictable positions
# The branch `if data[i] > THRESHOLD` is taken ~50% of the time, randomly
# Predictor accuracy: ~50% (random), misprediction every other iteration

# Sorted: all elements below threshold first, then all above
# The branch is predictably not-taken, then predictably taken — crossed once
# Predictor accuracy: ~99.5%

The sort costs O(N log N). For large N, the branch misprediction savings easily outweigh the sort. For small N where the sorted data stays in L1, the win is immediate. On a benchmark with 100k elements: sorted version runs 2.5–3x faster despite the sort cost.

Branchless code

For branches that cannot be sorted away (e.g. input data is inherently random), the alternative is to eliminate the branch entirely:

// Branched version: if (x > 0) y = a; else y = b;
// At 50/50 miss rate: 15-30 cycles per iteration wasted

// Branchless version using arithmetic select:
y = (x > 0) * a + (x <= 0) * b;

// Or conditional move (CMOV) — compiler often emits this:
// cmovg rax, rbx  (move rbx into rax if greater)

The branchless version always computes both paths; no branch, no misprediction. The tipping point: if a branch is predictable (95%+), the branched version is faster (skips one path’s compute). If unpredictable (50/50), branchless wins by 2–5x by eliminating the pipeline flush.

Common uses: sort partition, packet classification loops, image-processing inner loops.

Predictability is the deciding variable: keep the branch when it is predictable, go branchless only when the miss rate is high.

Scenario	Branch miss rate	Best approach	Cost per iteration
Predictable branch (sorted data)	<1%	Normal if/else	~0 extra cycles
Unpredictable branch (50/50 random)	~50%	Branchless / CMOV	15–30 cycles wasted
Somewhat predictable (70/30)	~10%	Depends; measure both	~3 cycles wasted

Branch prediction numbers

Modern predictor accuracy: 95–99% (real code)
Branch miss penalty (x86): 10–30 cycles
Apple M3 pipeline stages: ~10 stages
Sorted vs unsorted loop: 2–3x speedup
Branchless vs 50/50 branch: 2–5x speedup

Profile-guided optimisation (PGO) for branches

When you cannot change the data or the branch structure, Profile-Guided Optimisation (PGO) lets the compiler help. Build with instrumentation, run a representative workload, recompile with the profile. The compiler uses the branch-taken data to:

Arrange “taken” paths sequentially (better instruction-cache hit rate).
Insert __builtin_expect-style hints baked into code layout.
Make better inlining decisions based on actual call frequency.

PGO wins: 10–30% speedup on real workloads. Used in Chrome, Firefox, Postgres, Linux kernel.

▸Why this works

Compilers sometimes auto-emit branchless code when they can prove correctness (no overflow, known types). Check what your compiler actually emitted with objdump -d or Compiler Explorer (godbolt.org) if branch miss rate is high despite simple-looking code. The perf stat counter branch-misses / branches gives the overall miss rate; a value above 5% on a hot function warrants investigation.

Quiz

A loop has an if-branch that goes true 50% of the time, unpredictably. What is the typical performance impact?

Quiz

Why does sorting an array before summing values above a threshold provide a 2–3x speedup, even though sorting adds O(N log N) work?

Order the steps

Order the steps of a branch pipeline flush when the predictor mispredicts:

1 CPU predicts branch taken and starts fetching instructions from predicted path
2 Predicted-path instructions flow through decode, execute stages
3 Branch condition is resolved — actual path is not-taken
4 All in-flight instructions from the predicted path are squashed (discarded)
5 Pipeline restarts from the correct path — 10–30 cycles wasted
6 Predictor updates its history with the correct outcome for next time

A correct guess is nearly free; a misprediction discards in-flight work and restarts from the right path.

Recall before you leave

01
When should you prefer branchless code over a normal if/else, and when is branchless the wrong choice?
02
Explain how Profile-Guided Optimisation (PGO) improves branch prediction at the code layout level.

Recap

Every branch is a prediction: the CPU guesses the path before the condition is known, then pays a 10–30 cycle pipeline flush if wrong. Modern predictors reach 95–99% accuracy on regular code but collapse to ~50% for data-dependent branches on random input. The sorted-array trick — O(N log N) sort before the O(N) loop — converts unpredictable branches into predictable ones, yielding 2–3x speedup. When data cannot be sorted, branchless arithmetic (conditional moves, CMOV) eliminates the prediction entirely — always compute both outcomes, pay no penalty. Only apply branchless when the miss rate is actually high; for predictable branches, normal if/else is cheaper. Now when you see an inner loop with a data-dependent condition that feels slower than the instruction count suggests, check perf stat branch-misses before rewriting the algorithm.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 7 done

Connected lessons

builds on

Memory hierarchy: why the same O(N) loop can be 17x slowerjunior

deepens into

appears again in162

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.