Performance
Cache-oblivious algorithms, PGO, and production failures
A Rust struct refactor moved one hot field to the end of a 96-byte struct. No algorithmic change. No added work. p99 latency on the message-render path grew 40%. A production outage disguised as a code cleanup.
Cache-oblivious algorithms
A cache-oblivious algorithm performs well at every cache level — L1, L2, L3 — without knowing the specific sizes. The canonical example is recursive blocked matrix multiply.
The naive three-nested-loop matrix multiply has poor cache behaviour because the inner loop exhausts a cache level before moving on. The recursive blocked version divides the matrix into four sub-matrices and recurses until a base case. At each recursion level, the active sub-matrix fits in some cache level naturally, even though the algorithm doesn’t know which one. The hardware adapts.
The same principle applies to:
- Cache-oblivious mergesort: recursive halving keeps active data in the smallest available cache level.
- Cache-oblivious B-trees: van Emde Boas layout arranges tree levels so that each half-tree fits in a cache level — no pointer chasing past the level boundary.
- FFT: recursive Cooley-Tukey naturally aligns butterfly groups with cache lines.
- Matrix transposition: naive transposition is cache-hostile; recursive blocked transposition achieves good spatial locality at all levels.
The 2–4x speedup over naive iteration often dominates whatever code complexity the recursion adds.
Profile-guided optimisation (PGO)
Compilers improve dramatically when given runtime frequency data. The process:
- Build with instrumentation (
-fprofile-generate/clang -fprofile-instr-generate). - Run a representative workload — this populates a
.profdatafile. - Rebuild with the profile (
-fprofile-use/-fprofile-instr-use).
What PGO changes:
- Inlining decisions: inline hot callees, don’t inline cold ones. Without PGO, compilers inline based on code size heuristics and often get it wrong.
- Branch layout: predicted-taken branches are placed sequentially in memory. Non-taken branches jump forward. This keeps the hot path contiguous in the instruction cache.
- Basic block reordering: hot basic blocks are grouped in nearby pages, improving I-cache hit rate for the common path.
- Function placement: functions called together are placed near each other in the binary — fewer I-TLB misses.
Real workload speedups from PGO:
- Chrome: 10–15% on typical pages.
- Firefox: 12–18% on JIT warmup.
- Postgres: 8–15% on OLTP benchmarks.
- Linux kernel (BOLT post-link optimisation, similar concept): 3–8%.
PGO cost: build pipeline complexity and the need for a representative profiling workload. The profiling workload must match production traffic — a toy benchmark gives bad profiles and can hurt performance.
Real production failures
These failures all share the same shape: a “small” code change moved bytes around without changing the algorithm. The CPU noticed.
Discord 2023 — struct field reorder. A Rust struct refactor moved a hot field (message_timestamp) from byte offset 0 to byte offset 72 in a 96-byte struct. The field crossed a cache-line boundary. Every render of a message list — the hottest path in the client — now loaded two cache lines instead of one to get that field. p99 latency on the message-render path grew 40%. Rollback confirmed the cause. Fix: reorder struct fields so the hot fields live in the first 64 bytes (one cache line). Rule of thumb: profile your struct access patterns and put the top-3 most-read fields first.
Mojang Minecraft Java Edition 2018 — HashMap → contiguous array. The internal entity tracker used a HashMap<EntityID, EntityData> to store all world entities. Entity simulation (movement, AI, collision) accessed these entries in a tight game-tick loop. The HashMap’s chained buckets scattered EntityData structs across the heap — each entity lookup was a pointer chase to a random allocation. Replacing the entity store with a contiguous array (Entity Component System architecture, ECS) gave 3–5x throughput on the game-tick loop. Every entity’s component data was now sequential; the prefetcher loaded the next entity’s data while the CPU processed the current one. Discord’s 2024 game-server work extended this pattern to SIMD-aware ECS structs: further 2x on certain simulation paths.
Bun 2024 — startup cold-path regression. A new dependency added a HashMap initialised at module load time. Due to allocation order during JIT warmup, the HashMap’s backing array landed in the middle of the hot-code region, evicting frequently-used JIT-compiled functions from L1 during startup. The symptom was a 30ms startup regression — invisible in micro-benchmarks, visible in automated startup regression tests. Fix: lazy-init the HashMap (create on first use, not at module load). Pattern: anything allocated on a cold path that grows over time can pollute cache for the hot path. Lazy-init is the standard mitigation.
The structural pattern. In all three cases:
- A “refactor” changed byte offsets or allocation timing — the algorithm was identical.
- The hot path silently crossed a cache-line or TLB boundary.
- The regression appeared as p99 growth or throughput drop, not as a crash or obvious error.
- Root cause required
perf c2corperf statto surface; code review would not have caught it.
Observability for cache and branch performance
The essential perf stat invocation:
perf stat -e \
cache-misses,cache-references,\
branch-misses,branches,\
cycles,instructions \
-- ./your_binary| Metric | Healthy | Problem threshold | Likely cause |
|---|---|---|---|
| IPC (instructions / cycles) | 2.0–4.0 | <1.0 | Memory stalls or branch mispredicts |
| L1 miss rate (cache-misses / refs) | <1% | >10% | Bad spatial/temporal locality |
| LLC miss rate | <5% | >20% | Working set exceeds L3, or pointer chasing |
| Branch miss rate (branch-misses / branches) | 1–2% | >5% | Unpredictable data-dependent branch in hot path |
Further tools:
perf record+perf annotate: sample hot lines with source annotation. Shows which load instruction accumulates the most cycles.perf c2c: reports cache-line contention between cores — the direct tool for false sharing.perf top: live per-symbol miss attribution.- Intel VTune / AMD uProf: per-line attribution, memory access patterns, NUMA topology view. Requires platform-specific tools.
- BCC
cachestat/llcstat: BPF-based, zero-instrumentation, production-safe sampling. Reports L1/LLC hit rates system-wide or per-process without rebuilding the binary. - ARM PMU + Apple Instruments: same metrics on ARM; Instruments has “CPU Counters” template with
MEMORY_CYCLESandLAST_LEVEL_CACHE_MISS.
The performance-engineering loop:
- Measure with
perf stat— get IPC, miss rates, branch miss rate. - Identify dominant cost: cache miss (layout problem), branch miss (prediction problem), low IPC from bandwidth (data volume problem), low IPC from instruction issue (SIMD opportunity).
- Choose matching fix: layout → SoA / struct split; branches → sort / branchless; bandwidth → non-temporal stores / compression; SIMD → AoS-to-SoA + autovectorise.
- Re-measure — confirm improvement, check for new bottleneck.
Why this works
Big-O remains the right model for scalability with N. It is silent on constant factors that vary 100x with layout. Production reality: constants dominate well into millions for many problem sizes. The performance-engineering loop starts with measurement, not with algorithm selection.
- TLB miss page walk
- 5–50 cycles
- Huge page (x86)
- 2 MB / 1 GB
- L1 cache associativity
- typ. 8-way
- MLP outstanding misses
- 8–16 typical
- PGO typical speedup
- 10–30%
- Non-temporal store bandwidth gain
- 20–30% streaming
- Apple M3 pipeline depth
- ~10 stages
- DDR5-6000 bandwidth
- ~50 GB/s per channel
A perf-critical service has IPC of 0.6 and L3 miss rate of 30%. Trace the optimisation path.
Diagnose the perf stat output — what is the bottleneck?
Performance counter stats for './hot_loop':
5,238,194,672 cycles
2,094,127,491 instructions # 0.40 insn per cycle
413,824,917 branches
105,209,832 branch-misses # 25.42% of all branches
684,283,194 cache-references
287,401,952 cache-misses # 41.99% of all cache refs
12,498,373 LLC-loads
5,927,491 LLC-load-misses # 47.43% of all LL-cache hits
2.851 seconds time elapsed IPC is 0.40, branch miss rate 25%, cache miss rate 42%, LLC miss 47%. What is the dominant bottleneck and how do you fix it?
Pick the data structure for a perf-critical lookup table with 1M entries, mostly reads, occasional inserts, no range queries.
Why does the hardware prefetcher fail on graph traversals even when the graph is small enough to fit in L2?
Which specification defines the cache-coherency protocols (MESI / MOESI) used by modern multi-core x86 CPUs?
Design the data layer for a real-time analytics service: 100M rows of telemetry data, p99 query latency under 10ms, 99% read, 1% write, mostly aggregation queries over column subsets.
- Working set: 100M rows × 200 bytes = 20 GB.
- Server has 256 GB RAM, so dataset fits in memory.
- Most queries touch 1–3 columns out of 20.
- Need to support compressed-on-disk + uncompressed-in-RAM.
- Throughput target: 100k queries/sec.
- Column-store (SoA) reduces memory bandwidth 5-10x for narrow queries.
- Each column sized to fit in L3 keeps hot data cache-resident.
- SIMD operations on contiguous columns achieve near-peak throughput.
- Append-only writes + compaction balances read perf with write throughput.
- Observability ties IPC + cache miss rate to query latency dashboards.
- 01Explain how cache coherency (MESI/MOESI) creates 'false sharing' as a multi-thread performance bug, and how to detect and fix it.
- 02A 'fast' O(log N) algorithm performs worse than a 'slow' O(N) one in production. List the diagnostic steps to confirm cache-locality is the cause.
Cache-oblivious algorithms (recursive blocked matrix multiply, van Emde Boas trees, recursive mergesort) achieve good cache behaviour at every level simultaneously without knowing cache sizes. PGO turns a representative runtime profile into inlining decisions, branch layout, and function placement — typically 10–30% speedup with no code changes. Real production failures share one shape: a struct reorder (Discord, +40% p99), a HashMap replacing contiguous arrays (Mojang, 3–5x throughput drop), or a cold-path HashMap polluting cache during warmup (Bun startup regression). The observability loop is: perf stat for IPC and miss rates, perf c2c for false sharing, perf record + annotate for per-line attribution, BPF tools (cachestat, llcstat) in production. IPC below 1.0 means memory-bound; branch miss rate above 5% means an unpredictable hot branch. Fix matches the bottleneck type: layout for cache misses, sort/branchless for branches, non-temporal stores for streaming bandwidth. Big-O is the right scalability model; cache constants are the right performance model.
appears again in167
- Why GraphQL gets N+1junior
- DataLoader mechanics: tick-boundary batchingmiddle
- Batch function contracts: ordering, shapes, errorsmiddle
- Federation and lookahead: batching beyond DataLoadermiddle
- Query complexity defences: depth, cost, persisted queriesmiddle
- Senior GraphQL API: scheduling contract, tenant isolation, observabilitysenior
- Why idempotency: making retries safejunior
- Server-side state machine: four states of an idempotency keymiddle
- Outbox and inbox: effectively-once across the dual-write boundarymiddle
- Concurrency and cache architecture for idempotency at scalesenior
- Observability, production failures, and global-scale designsenior
- The event loop: one thread, three queuesjunior
- Tasks, microtasks, and scheduler.yield()middle
- Microtask starvation, Long Tasks, and LoAFsenior
- Node.js event loop: phases, nextTick, and loop lagsenior
- React, Vue, and INP observability in productionsenior
- The render pipeline: six stages from bytes to pixelsjunior
- Stage costs and the renderer process modelmiddle
- Invalidation, dirty bits, and containmiddle
- Compositor layers: promotion, overlap, and GPU memorymiddle
- DevTools flame strip and the frame lifecyclemiddle
- Layout thrash: forced synchronous layoutsenior
- BeginMainFrame, compositor-driven animations, and GPU memorysenior
- Production observability: LoAF, INP, and the full attack surfacesenior
- What V8 is and why performance varies 100×junior
- V8''''s four-tier JIT pipeline and profile-guided tieringmiddle
- Hidden classes, transition trees, and memory layoutmiddle
- Inline caches, IC states, and deoptimizationmiddle
- Orinoco GC: parallel scavenger, concurrent marking, and write barriersmiddle
- TurboFan''''s speculative engine and the deopt-loop trapsenior
- V8 in production: isolates, pointer compression, and real failuressenior
- Service worker lifecycle and cache strategiesmiddle
- Service worker edge cases: version skew, durability, and navigation trapssenior
- What the reconciler does: render vs commitjunior
- The fiber object and the double-buffer treemiddle
- Render phase purity and commit phase sub-stepsmiddle
- Reconciliation: diffing heuristics and the key trapmiddle
- Priority lanes, time-slicing, and useTransitionmiddle
- Bailout, memoisation, and tearingsenior
- React Profiler, the Compiler, and production observabilitysenior
- Rendering strategies: SSG, SSR, ISR, streaming, and hydrationjunior
- SSG, SSR, ISR, streaming, and RSC — how each worksmiddle
- Hydration cost: selective, progressive, islands, resumabilitymiddle
- Hydration mismatch: causes, detection, and the determinism rulesenior
- RSC, per-route strategy, and production observabilitysenior
- Core Web Vitals: what LCP, INP, and CLS measurejunior
- CLS: why layout shifts happen and how to stop themmiddle
- Metric tradeoffs, RUM attribution, and the CI+field loopsenior
- The full picture: URL to LCP to INP as a relay racejunior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- What is a cache stampede and why it makes things worsejunior
- Lock and single-flight: bounding concurrent rebuildsmiddle
- XFetch: coordination-free probabilistic early expirationmiddle
- Stale-while-revalidate and CDN request coalescingmiddle
- Detecting stampedes and designing TTL for productionmiddle
- Metastable failure, fencing tokens, and production postmortemssenior
- What a relation is: tables, rows, keys, and constraintsjunior
- Constraints, keys, and Postgres data typesmiddle
- Normal forms, denormalization, and why schemas stickmiddle
- JSONB, arrays, and when a side table winsmiddle
- Heap storage, TOAST, and column alignmentsenior
- Schema integrity: deferral, versioning, and production failure modessenior
- Relational vs document, wide-column, graph, and key-valuesenior
- Index-only scans, the Visibility Map, and INCLUDEsenior
- Production failure modes and the index audit playbooksenior
- pg_statistic, ANALYZE, and production observabilitymiddle
- Production failure modes and plan stabilitysenior
- MVCC: why readers and writers never wait for each otherjunior
- Row versions and snapshots: the on-disk mechanicsmiddle
- HOT updates and isolation levels: what you gain and what you paymiddle
- Vacuum and bloat: keeping the storage tax boundedmiddle
- CLOG, XID wraparound, and MultiXact: deep visibility internalssenior
- SSI internals and production autovacuum tuningsenior
- Real-world MVCC failures, deployment patterns, and distributed snapshotssenior
- Connection pools: amortising the cost of a Postgres backendjunior
- PgBouncer session, transaction, and statement modesmiddle
- Pool sizing: the (cores × 2) + spindles formula and the two-layer stackmiddle
- Pool exhaustion and idle-in-transaction: the 3 AM failure modemiddle
- Migrating to transaction mode: rollout playbook and PgBouncer 1.21 prepared statementsmiddle
- The Postgres process model and why raising max_connections degrades throughputsenior
- Pooler landscape 2026, serverless connection storms, and the full failure-mode taxonomysenior
- What a schema migration is and why it replaces ad-hoc DDLjunior
- ADD COLUMN: instant in PG 11+ vs rewrite in older Postgresjunior
- The lock-queue failure mode: why instant DDL can freeze the databasemiddle
- Safe DDL patterns: NOT VALID, CONCURRENTLY, and unsafe-op fixesmiddle
- Expand-contract: zero-downtime for breaking schema changesmiddle
- Advisory locks, migration tools, and deploy coordinationsenior
- Migration failure taxonomy and production disciplinesenior
- Why sharding exists: the single-Postgres ceilingjunior
- Shard-key selection: hash, range, list, and directory strategiesmiddle
- Partitioning vs sharding: same word, two different thingsmiddle
- Co-location and Citus: the invariant that makes sharding usablemiddle
- The hot-shard failure mode: detection, isolation, and durable policymiddle
- Schema-based sharding and multi-tenancy alternativessenior
- Online resharding, 2PC, and the operational cost of shardingsenior
- The seven acts: from CREATE TABLE to Citusjunior
- Acts 1–3 in depth: schema, indexes, and planner statisticsmiddle
- Acts 4–6 in depth: MVCC bloat, connection pooling, and safe migrationsmiddle
- Act 7 in depth: sharding, co-location, and the seven-tier tradeoff cascademiddle
- Observability, anti-patterns, and production triagesenior
- Raft roles, terms, and why majority quorums prevent split brainjunior
- How Raft replicates a log entry and decides it is safe to commitmiddle
- Raft leader election: timeouts, voting rules, and the four safety propertiesmiddle
- Raft in the real world: partitions, slow disks, and client routingmiddle
- Raft extensions: pre-vote, learners, snapshots, and linearizable readssenior
- Raft in production: membership changes, Multi-Raft, and observabilitysenior
- Where data fetching happens — and why it decides LCPjunior
- Fetch waterfalls — diagnosis and the Promise.all curemiddle
- React Server Components and Suspense streamingmiddle
- Client-side cache: TanStack Query, SWR, and stale-while-revalidatemiddle
- LCP, prefetch, and race conditions in interactive fetchingmiddle
- Senior internals: RSC payload, caching layers, and production failure modessenior
- The three-way handshakejunior
- Sequence numbers and connection statemiddle
- DNS: what it does and why it existsjunior
- The resolver walk: referrals, record types, and gluemiddle
- TTL, caching, and DNS propagationmiddle
- The 1-RTT handshake: key shares and ECDHEmiddle
- Session resumption and 0-RTTmiddle
- WebSocket: the HTTP upgrade handshakejunior
- WebSocket frame format: opcodes, masking, fragmentationmiddle
- WebSocket backpressure: when clients can''''t keep upmiddle
- Reconnection: jittered backoff, thundering herd, message resumptionsenior
- WebSocket at scale: HTTP/2 multiplexing, permessage-deflate, C10Msenior
- WebSocket in production: proxies, security, and distributed architecturesenior
- What reverse proxies dojunior
- Health checks, connection draining, and slow startmiddle
- Session affinity, consistent hashing, and the right fixmiddle
- Retry storms, circuit breakers, and load sheddingsenior
- Resilient LB architecture: anycast, zone-aware routing, and observabilitysenior
- Why QUIC and not TCP+TLSjunior
- Connection IDs and network migrationmiddle
- 0-RTT resumption and packet encryptionsenior
- DDoS: what it is and why it worksjunior
- Amplification attacks and state exhaustionmiddle
- Rate limiting: algorithms and architecturemiddle
- WAFs, firewalls, mTLS, and HSTSmiddle
- DNS cache poisoning and BGP hijackingsenior
- Defense-in-depth architecture and attack economicssenior
- DNS, TCP, TLS in sequence: where the milliseconds gomiddle
- Proxy intercepts and security gates: rate limiters, WAF, mTLSmiddle
- Alternate paths: QUIC 0-RTT, WebSocket upgrade, connection migrationmiddle
- Observability: distributed traces, USE/RED, and samplingsenior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- What the three signals are: logs, metrics, and tracesjunior
- Why structured logs exist: the diary vs the spreadsheetjunior
- The production log schema: fields every line must carrymiddle
- PII redaction and log injectionsenior
- OTel Logs Data Model and audit logs as a subsystemsenior
- SLI, SLO, and the error budget: reliability by the numbersjunior
- Error budget policy, latency SLOs, and composite journeysmiddle
- Production SLO failures, self-observability, security, and the big picturesenior
- The incident loop: from pager to postmortem to preventionmiddle
- At-most-once, at-least-once, exactly-once: the three delivery contractsjunior
- The three failure legs — where duplicates and losses actually happenmiddle
- Consumer-side dedup: the cheapest path to exactly-once processingmiddle
- Kafka exactly-once semantics: idempotent producer and transactionsmiddle
- SQS visibility timeout, DLQ, and the outbox patternmiddle
- Exactly-once in production: impossibility proof, hybrid patterns, and real incidentssenior
- What OAuth is and why passwords are not the answerjunior
- Authorization code flow with PKCEmiddle
- ID token validation and JWKS cache managementmiddle
- Refresh token rotation and scope-based least privilegemiddle
- Sender-constrained tokens: DPoP and mTLSsenior
- OAuth in production: audience attacks, observability, and real failuressenior