Performance
Branch prediction and branchless code
An inner loop sums elements above a threshold on an unsorted array. The same loop on a sorted array of the same data runs 2–3x faster — despite doing the same arithmetic. The sort cost is O(N log N). The branch cost is why it pays back.
How the CPU handles branches
Modern CPUs are deeply pipelined: many instructions are in flight simultaneously. When the pipeline reaches a branch (if, loop boundary, function call), it must decide which instruction to fetch next before the branch condition is computed. The branch predictor makes that guess.
If the guess is right: execution continues without interruption. If the guess is wrong: the pipeline must be flushed — all partially-executed instructions that followed the wrong path are discarded, and the CPU starts over from the correct path. On modern x86, this costs 10–30 cycles. On deep pipelines (Apple M1/M2/M3 with ~10 stages), a miss can waste 40+ instruction slots.
Branch predictor accuracy
Modern hardware predictors — Intel’s TAGE (TAgged GEometric history), AMD’s similar design — achieve 95–99% accuracy on real production code. Their techniques:
- Branch target buffer (BTB): caches recent jump targets.
- Pattern history table (PHT): tracks taken/not-taken history per branch.
- TAGE predictor: combines multiple history lengths, predicting long-period patterns.
For code with regular behaviour (loop bounds, type-stable hot paths), the predictor is essentially free. The cost appears for data-dependent branches — branches whose taken/not-taken outcome depends on the values being processed, not the code structure.
The sorted array trick
Classic example: summing elements above a threshold in an unsorted vs sorted array.
# Unsorted: threshold crossing happens at unpredictable positions
# The branch `if data[i] > THRESHOLD` is taken ~50% of the time, randomly
# Predictor accuracy: ~50% (random), misprediction every other iteration
# Sorted: all elements below threshold first, then all above
# The branch is predictably not-taken, then predictably taken — crossed once
# Predictor accuracy: ~99.5%The sort costs O(N log N). For large N, the branch misprediction savings easily outweigh the sort. For small N where the sorted data stays in L1, the win is immediate. On a benchmark with 100k elements: sorted version runs 2.5–3x faster despite the sort cost.
Branchless code
For branches that cannot be sorted away (e.g. input data is inherently random), the alternative is to eliminate the branch entirely:
// Branched version: if (x > 0) y = a; else y = b;
// At 50/50 miss rate: 15-30 cycles per iteration wasted
// Branchless version using arithmetic select:
y = (x > 0) * a + (x <= 0) * b;
// Or conditional move (CMOV) — compiler often emits this:
// cmovg rax, rbx (move rbx into rax if greater)The branchless version always computes both paths; no branch, no misprediction. The tipping point: if a branch is predictable (95%+), the branched version is faster (skips one path’s compute). If unpredictable (50/50), branchless wins by 2–5x by eliminating the pipeline flush.
Common uses: sort partition, packet classification loops, image-processing inner loops.
| Scenario | Branch miss rate | Best approach | Cost per iteration |
|---|---|---|---|
| Predictable branch (sorted data) | <1% | Normal if/else | ~0 extra cycles |
| Unpredictable branch (50/50 random) | ~50% | Branchless / CMOV | 15–30 cycles wasted |
| Somewhat predictable (70/30) | ~10% | Depends; measure both | ~3 cycles wasted |
- Modern predictor accuracy
- 95–99% (real code)
- Branch miss penalty (x86)
- 10–30 cycles
- Apple M3 pipeline stages
- ~10 stages
- Sorted vs unsorted loop
- 2–3x speedup
- Branchless vs 50/50 branch
- 2–5x speedup
Profile-guided optimisation (PGO) for branches
When you cannot change the data or the branch structure, Profile-Guided Optimisation (PGO) lets the compiler help. Build with instrumentation, run a representative workload, recompile with the profile. The compiler uses the branch-taken data to:
- Arrange “taken” paths sequentially (better instruction-cache hit rate).
- Insert
__builtin_expect-style hints baked into code layout. - Make better inlining decisions based on actual call frequency.
PGO wins: 10–30% speedup on real workloads. Used in Chrome, Firefox, Postgres, Linux kernel.
Why this works
Compilers sometimes auto-emit branchless code when they can prove correctness (no overflow, known types). Check what your compiler actually emitted with objdump -d or Compiler Explorer (godbolt.org) if branch miss rate is high despite simple-looking code. The perf stat counter branch-misses / branches gives the overall miss rate; a value above 5% on a hot function warrants investigation.
A loop has an if-branch that goes true 50% of the time, unpredictably. What is the typical performance impact?
Why does sorting an array before summing values above a threshold provide a 2–3x speedup, even though sorting adds O(N log N) work?
Order the steps of a branch pipeline flush when the predictor mispredicts:
- 1 CPU predicts branch taken and starts fetching instructions from predicted path
- 2 Predicted-path instructions flow through decode, execute stages
- 3 Branch condition is resolved — actual path is not-taken
- 4 All in-flight instructions from the predicted path are squashed (discarded)
- 5 Pipeline restarts from the correct path — 10–30 cycles wasted
- 6 Predictor updates its history with the correct outcome for next time
- 01When should you prefer branchless code over a normal if/else, and when is branchless the wrong choice?
- 02Explain how Profile-Guided Optimisation (PGO) improves branch prediction at the code layout level.
Every branch is a prediction: the CPU guesses the path before the condition is known, then pays a 10–30 cycle pipeline flush if wrong. Modern predictors reach 95–99% accuracy on regular code but collapse to ~50% for data-dependent branches on random input. The sorted-array trick — O(N log N) sort before the O(N) loop — converts unpredictable branches into predictable ones, yielding 2–3x speedup. When data cannot be sorted, branchless arithmetic (conditional moves, CMOV) eliminates the prediction entirely — always compute both outcomes, pay no penalty. Only apply branchless when the miss rate is actually high; for predictable branches, normal if/else is cheaper.
appears again in159
- The journey of a request: seven stops from socket to responsejunior
- Accept and parse: from kernel queue to a typed requestmiddle
- Routing and middleware: choosing what runs, and in what ordermiddle
- Handler and response: from business logic to bytes on the wiremiddle
- Streaming and backpressure: when the client reads slower than you writesenior
- Timeouts and tail latency: budgets, deadlines, and the fan-out trapsenior
- Middleware and DI: the two patterns that shape every backendjunior
- Writing middleware: signatures, next(), and the three framework modelsmiddle
- Inversion of control: how dependencies reach a classmiddle
- DI scopes and lifecycles: singleton, request, transientmiddle
- DI as a testing seam: fakes, mocks, and the boundary that matterssenior
- DI containers in production: resolution graphs, circular deps, and when not tosenior
- Blocking vs non-blocking I/O: two ways to waitjunior
- The event loop: one thread, ordered phasesmiddle
- What blocks the loop: CPU work and sync callsmiddle
- Offloading CPU work: worker threads and the libuv poolmiddle
- Backpressure and bounded concurrencysenior
- Throughput under load: tail latency and saturationsenior
- Why pool: the cost of creating a connectionjunior
- Pool sizing: why bigger is not fastermiddle
- Acquisition and timeouts: the wait queue is the real latency dialmiddle
- Retry strategies: backoff, jitter, and thundering herdmiddle
- Observability, production failures, and global-scale designsenior
- Tasks, microtasks, and scheduler.yield()middle
- Timer accuracy, throttling, and idle workmiddle
- Node.js event loop: phases, nextTick, and loop lagsenior
- Rendering strategies: SSG, SSR, ISR, streaming, and hydrationjunior
- SSG, SSR, ISR, streaming, and RSC — how each worksmiddle
- Hydration cost: selective, progressive, islands, resumabilitymiddle
- Core Web Vitals: what LCP, INP, and CLS measurejunior
- LCP: four phases, one dominant costmiddle
- INP: input delay, processing, presentationmiddle
- Lab vs field: why the two disagree and how to use eachmiddle
- Metric tradeoffs, RUM attribution, and the CI+field loopsenior
- The full picture: URL to LCP to INP as a relay racejunior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- What an index is and how it speeds up queriesjunior
- The leading-column rule and composite index designmiddle
- Partial, expression, and covering indexesmiddle
- Index types: GIN, GiST, BRIN, Hash, Bloom, and HOT updatesmiddle
- Index-only scans, the Visibility Map, and INCLUDEsenior
- Production failure modes and the index audit playbooksenior
- Index design exercise: full-text search strategysenior
- EXPLAIN and execution plans: what the planner decides and whyjunior
- Scan types: Seq, Index, Bitmap, Index-Onlymiddle
- Join algorithms and the row-estimate cascademiddle
- pg_statistic, ANALYZE, and production observabilitymiddle
- Extended statistics: fixing correlated-column estimate failuressenior
- Plan cache, cost-constant tuning, and planner internalssenior
- Production failure modes and plan stabilitysenior
- Connection pools: amortising the cost of a Postgres backendjunior
- PgBouncer session, transaction, and statement modesmiddle
- Pool sizing: the (cores × 2) + spindles formula and the two-layer stackmiddle
- Pool exhaustion and idle-in-transaction: the 3 AM failure modemiddle
- Migrating to transaction mode: rollout playbook and PgBouncer 1.21 prepared statementsmiddle
- The Postgres process model and why raising max_connections degrades throughputsenior
- Pooler landscape 2026, serverless connection storms, and the full failure-mode taxonomysenior
- ADD COLUMN: instant in PG 11+ vs rewrite in older Postgresjunior
- The lock-queue failure mode: why instant DDL can freeze the databasemiddle
- Safe DDL patterns: NOT VALID, CONCURRENTLY, and unsafe-op fixesmiddle
- Migration failure taxonomy and production disciplinesenior
- Shard-key selection: hash, range, list, and directory strategiesmiddle
- Co-location and Citus: the invariant that makes sharding usablemiddle
- The hot-shard failure mode: detection, isolation, and durable policymiddle
- Online resharding, 2PC, and the operational cost of shardingsenior
- The seven acts: from CREATE TABLE to Citusjunior
- Acts 1–3 in depth: schema, indexes, and planner statisticsmiddle
- Acts 4–6 in depth: MVCC bloat, connection pooling, and safe migrationsmiddle
- Act 7 in depth: sharding, co-location, and the seven-tier tradeoff cascademiddle
- Observability, anti-patterns, and production triagesenior
- Bits on the wirejunior
- Latency mathmiddle
- Bufferbloat and congestionsenior
- The physical frontiersenior
- Sequence numbers and connection statemiddle
- Flow control and congestion controlmiddle
- BBR, production observability, and beyond TCPsenior
- CDN: putting content next doorjunior
- Anycast and GeoDNS: routing to the nearest edgemiddle
- Tiered cache and Cache-Controlmiddle
- Vary header and cache keysmiddle
- Stale-while-revalidate and cache stampedesenior
- Edge workers and edge-side compositionsenior
- CDN operations and observabilitysenior
- WebSocket: the HTTP upgrade handshakejunior
- WebSocket vs SSE vs long-polling: choosing the right transportmiddle
- WebSocket backpressure: when clients can''''t keep upmiddle
- Reconnection: jittered backoff, thundering herd, message resumptionsenior
- WebSocket at scale: HTTP/2 multiplexing, permessage-deflate, C10Msenior
- WebSocket in production: proxies, security, and distributed architecturesenior
- What reverse proxies dojunior
- Balancing algorithms: round-robin to power-of-two-choicesmiddle
- L4 vs L7 load balancing and client-IP preservationmiddle
- Health checks, connection draining, and slow startmiddle
- Retry storms, circuit breakers, and load sheddingsenior
- Resilient LB architecture: anycast, zone-aware routing, and observabilitysenior
- Why QUIC and not TCP+TLSjunior
- QUIC streams and head-of-line blockingjunior
- Integrated handshake and 1-RTTmiddle
- Connection IDs and network migrationmiddle
- Loss detection and congestion controlmiddle
- 0-RTT resumption and packet encryptionsenior
- Deployment tradeoffs and CPU costsenior
- DDoS: what it is and why it worksjunior
- Amplification attacks and state exhaustionmiddle
- Rate limiting: algorithms and architecturemiddle
- WAFs, firewalls, mTLS, and HSTSmiddle
- DNS cache poisoning and BGP hijackingsenior
- Defense-in-depth architecture and attack economicssenior
- The twelve layers: one URL, seven actorsjunior
- DNS, TCP, TLS in sequence: where the milliseconds gomiddle
- Critical render path and Core Web Vitalsmiddle
- Proxy intercepts and security gates: rate limiters, WAF, mTLSmiddle
- Alternate paths: QUIC 0-RTT, WebSocket upgrade, connection migrationmiddle
- Observability: distributed traces, USE/RED, and samplingsenior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- What the three signals are: logs, metrics, and tracesjunior
- Metrics and cardinality: the cost model of a time-series databasemiddle
- Logs and volume: the cost model of structured loggingmiddle
- Traces and sampling: the cost model of distributed tracingmiddle
- Join keys and exemplars: making the three signals composemiddle
- Observability 2.0: wide events and the cost shiftsenior
- Failure modes and engineering practice: cardinality budgets, PII, and samplingsenior
- Why structured logs exist: the diary vs the spreadsheetjunior
- The production log schema: fields every line must carrymiddle
- Log levels and alert routingmiddle
- Sampling strategies and log costmiddle
- PII redaction and log injectionsenior
- Trace context propagation in logssenior
- OTel Logs Data Model and audit logs as a subsystemsenior
- OTel signals, Semantic Conventions, and the OTLP wire formatmiddle
- Auto-instrumentation and manual spans: the 80/20 of OTelmiddle
- The OTel Collector: receivers, processors, exporters, and deployment patternsmiddle
- Sampling strategies: head, tail, and parent-basedmiddle
- Vendor neutrality, eBPF instrumentation, the Operator, and browser/serverless OTelsenior
- Operating the OTel Collector: reliability, version skew, failure modes, and governancesenior
- RED and USE: two checklists, one triage disciplinejunior
- Instrumenting RED in Prometheus: counters, histograms, and cardinality disciplinemiddle
- USE on Linux: CPU, memory, disk, network, and PSImiddle
- Golden signals, dashboard layout, and service mesh auto-REDmiddle
- Cardinality as a cost driver: labels, PII, exemplars, and samplingmiddle
- Native histograms, SLO tie-in, and production failure patternsmiddle
- Choosing SLIs and SLO targets: ratios, not feelingsmiddle
- Multi-window multi-burn-rate alerting: why AND beats ORmiddle
- Error budget policy, latency SLOs, and composite journeysmiddle
- Iceberg SLIs, composite SLO math, and SLA vs SLOsenior
- Flame graphs: reading the picture that shows where time goesjunior
- Sampling vs instrumentation profiling: why 99 Hz wins in productionmiddle
- Profile types: CPU, memory, off-CPU, mutex — which one to reach formiddle
- Continuous profiling: always-on flame graphs with eBPF and trace-id correlationmiddle
- How flame graphs are built from samples, and the production workflows that use themmiddle
- Linux perf, eBPF internals, PGO, and the limits of samplingsenior
- Profiling in production: security, war stories, OTel profiles, and the infrastructure designsenior
- The debugging funnel: SLO → RED → trace → profilejunior
- OTel architecture: one SDK, four signals, one wire formatmiddle
- Cost discipline: keeping observability under 5% of infra spendmiddle
- Scale, security, and the ROI of observable systemssenior