Performance
SIMD, SoA vs AoS, and memory bandwidth
A hot ML inference loop is 5x slower than the reference C++ implementation, despite identical algorithm and the same number of multiplications. The profiler shows IPC 0.8 and 25% L3 miss rate. Switching to SIMD intrinsics is suggested — but the fix that actually works is changing how the data is stored, not how it is computed.
SIMD: one instruction, multiple values
Modern CPUs carry wide vector registers: 256-bit (AVX2, standard on x86 since 2013), 512-bit (AVX-512, available on server Intel and recent Ryzen), 128-bit NEON on ARM. One SIMD instruction can add, multiply, or compare 4–8 floats simultaneously.
The critical requirement: the values must be consecutive in memory. A single load instruction fills the vector register from a contiguous block; the CPU then operates on all 4–8 values in parallel.
Array of Structures vs Structure of Arrays
AoS (Array of Structures):
struct Point { float x; float y; float z; };
Point points[N]; // layout: x0 y0 z0 x1 y1 z1 x2 y2 z2 ...To add all x values with SIMD, you need points[0].x, points[1].x, points[2].x, points[3].x — but they are at offsets 0, 12, 24, 36 bytes. The SIMD load at address 0 gives x0 y0 z0 x1 — mixed types. You must use scatter/gather operations to pull out just the x values, which costs ~10x a contiguous load.
SoA (Structure of Arrays):
float xs[N]; // layout: x0 x1 x2 x3 x4 x5 x6 x7 ...
float ys[N];
float zs[N];To add all x values with SIMD: load 8 consecutive floats from xs[0], done. One instruction, 8 results. The loop body becomes 8x more productive per instruction.
| Layout | SIMD compatibility | Cache efficiency | Best use case |
|---|---|---|---|
| AoS | Requires gather/scatter (slow) | Good when all fields used | Single-element operations, OOP |
| SoA | Native contiguous load | Excellent when one field used | Batch processing, ML, game physics |
Game engines (Unity ECS, Bevy), ML inference engines, audio processing, and database column stores all use SoA. V8 (Chrome’s JavaScript engine) uses SoA-like TypedArrays for hot loops. The previous lesson’s ML example: changing from AoS (x,y,z,w) structs to xs[], ys[], zs[], ws[] drops L3 miss rate from 25% to 5% and raises IPC from 0.8 to 3.0.
Auto-vectorisation
Compilers automatically convert simple loops to SIMD when the data layout allows. Conditions for auto-vectorisation success:
- No pointer aliasing (two pointers don’t point to overlapping memory — use
restrictin C, Rust handles this via borrow checker). - Contiguous (not strided or gather/scatter).
- Predictable trip count.
- No data-dependent inner branches.
Check what the compiler emitted: -fopt-info-vec (GCC) or -Rpass=loop-vectorize (Clang) prints when loops are vectorised and why they failed. For hot loops that fail auto-vec, manual SIMD intrinsics or libraries (Intel Highway, simd-everywhere) close the gap.
Memory bandwidth: the other constraint
Cache hit rate is one axis of performance; memory bandwidth is another. A workload that streams through 100 GB of data will hit the RAM bandwidth ceiling (~50–100 GB/s on DDR5) regardless of cache behaviour. Bandwidth-bound code is fixed by:
- Reducing data volume: more compact types (float32 instead of float64 when precision allows), on-the-fly computation instead of materialised intermediate tables.
- Compression at rest, decompression on-the-fly.
- Non-temporal stores (covered in the senior lesson): bypass cache for write-once data.
perf stat -e cache-misses,mem-loads-retired.l3-miss separates cache-miss-bound from bandwidth-bound: if L3 misses are high but total data volume is huge, you are bandwidth-bound; if L3 misses are high but data volume is small (fits in L3 but pattern is random), you are cache-miss-bound.
NUMA: multi-socket memory access
Servers with 2+ CPU sockets are Non-Uniform Memory Access. Each socket has local RAM (~70 ns) and remote RAM (~120–150 ns). A thread that allocates memory on socket 0 but runs on socket 1 pays the remote-access tax on every load. This is a 1.7x latency penalty that perf stat will misreport as “L3 misses” because the access goes through the interconnect.
Mitigations:
- Pin threads to sockets (
taskset,hwloc). - Allocate on the local NUMA node (
numactl --membind,jemallocNUMA-aware mode). - Distribute work so each thread touches data from its local socket.
- AVX2 float throughput
- 8 floats per instruction
- AVX-512 float throughput
- 16 floats per instruction
- NEON (ARM) throughput
- 4 floats per instruction
- Gather/scatter vs contiguous load
- ~10x slower
- DDR5-6000 bandwidth
- ~50 GB/s per channel
- NUMA local vs remote latency
- 70 ns vs 120–150 ns
- ML loop AoS→SoA IPC gain
- 0.8 → 3.0 (example)
Why this works
Inlining affects the instruction cache (I-cache), not data cache. Aggressive inlining of a function called from 100 sites adds 100 copies of its body to the binary. If those copies pollute I-cache they evict other hot functions. The fix: inline tiny functions (1–3 lines) freely; inline larger functions only on the hottest call sites. PGO makes these decisions based on real call frequencies. Monitoring L1-icache-load-misses in perf stat catches instruction-cache pressure after layout changes.
A hot ML inference loop is 5x slower than a reference C++ port despite identical algorithm. perf stat shows IPC 0.8 and 25% L3 miss rate. What is the primary diagnosis?
A backend service runs on a 2-socket NUMA server. A thread allocates a large working set on socket 0, then migrates to socket 1 under load. What perf signature does this produce?
- 01Explain the AoS vs SoA trade-off for a hot loop that accesses only one field of a multi-field struct.
- 02A perf stat run shows high L3 miss rate, but the working set is only 50 MB (well within L3 capacity). What alternative explanation should you investigate, and how?
SIMD instructions operate on 4–16 consecutive values per cycle; AoS layout interleaves field types forcing 10x-slower gather operations, while SoA stores each field in a flat array enabling native SIMD loads. The ML benchmark’s 5x gap came entirely from changing AoS to SoA — no algorithm change, just layout. Beyond cache locality, two separate constraints limit throughput: memory bandwidth (how many GB/s you can stream from RAM) and NUMA topology (remote-socket memory costs 1.7x more than local). perf stat exposes both via L3-miss rate and NUMA counters. Fix layout to SoA first, verify the loop becomes compute-bound, then apply SIMD intrinsics or rely on auto-vectorisation.
appears again in167
- Why GraphQL gets N+1junior
- DataLoader mechanics: tick-boundary batchingmiddle
- Batch function contracts: ordering, shapes, errorsmiddle
- Federation and lookahead: batching beyond DataLoadermiddle
- Query complexity defences: depth, cost, persisted queriesmiddle
- Senior GraphQL API: scheduling contract, tenant isolation, observabilitysenior
- Why idempotency: making retries safejunior
- Server-side state machine: four states of an idempotency keymiddle
- Outbox and inbox: effectively-once across the dual-write boundarymiddle
- Concurrency and cache architecture for idempotency at scalesenior
- Observability, production failures, and global-scale designsenior
- The event loop: one thread, three queuesjunior
- Tasks, microtasks, and scheduler.yield()middle
- Microtask starvation, Long Tasks, and LoAFsenior
- Node.js event loop: phases, nextTick, and loop lagsenior
- React, Vue, and INP observability in productionsenior
- The render pipeline: six stages from bytes to pixelsjunior
- Stage costs and the renderer process modelmiddle
- Invalidation, dirty bits, and containmiddle
- Compositor layers: promotion, overlap, and GPU memorymiddle
- DevTools flame strip and the frame lifecyclemiddle
- Layout thrash: forced synchronous layoutsenior
- BeginMainFrame, compositor-driven animations, and GPU memorysenior
- Production observability: LoAF, INP, and the full attack surfacesenior
- What V8 is and why performance varies 100×junior
- V8''''s four-tier JIT pipeline and profile-guided tieringmiddle
- Hidden classes, transition trees, and memory layoutmiddle
- Inline caches, IC states, and deoptimizationmiddle
- Orinoco GC: parallel scavenger, concurrent marking, and write barriersmiddle
- TurboFan''''s speculative engine and the deopt-loop trapsenior
- V8 in production: isolates, pointer compression, and real failuressenior
- Service worker lifecycle and cache strategiesmiddle
- Service worker edge cases: version skew, durability, and navigation trapssenior
- What the reconciler does: render vs commitjunior
- The fiber object and the double-buffer treemiddle
- Render phase purity and commit phase sub-stepsmiddle
- Reconciliation: diffing heuristics and the key trapmiddle
- Priority lanes, time-slicing, and useTransitionmiddle
- Bailout, memoisation, and tearingsenior
- React Profiler, the Compiler, and production observabilitysenior
- Rendering strategies: SSG, SSR, ISR, streaming, and hydrationjunior
- SSG, SSR, ISR, streaming, and RSC — how each worksmiddle
- Hydration cost: selective, progressive, islands, resumabilitymiddle
- Hydration mismatch: causes, detection, and the determinism rulesenior
- RSC, per-route strategy, and production observabilitysenior
- Core Web Vitals: what LCP, INP, and CLS measurejunior
- CLS: why layout shifts happen and how to stop themmiddle
- Metric tradeoffs, RUM attribution, and the CI+field loopsenior
- The full picture: URL to LCP to INP as a relay racejunior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- What is a cache stampede and why it makes things worsejunior
- Lock and single-flight: bounding concurrent rebuildsmiddle
- XFetch: coordination-free probabilistic early expirationmiddle
- Stale-while-revalidate and CDN request coalescingmiddle
- Detecting stampedes and designing TTL for productionmiddle
- Metastable failure, fencing tokens, and production postmortemssenior
- What a relation is: tables, rows, keys, and constraintsjunior
- Constraints, keys, and Postgres data typesmiddle
- Normal forms, denormalization, and why schemas stickmiddle
- JSONB, arrays, and when a side table winsmiddle
- Heap storage, TOAST, and column alignmentsenior
- Schema integrity: deferral, versioning, and production failure modessenior
- Relational vs document, wide-column, graph, and key-valuesenior
- Index-only scans, the Visibility Map, and INCLUDEsenior
- Production failure modes and the index audit playbooksenior
- pg_statistic, ANALYZE, and production observabilitymiddle
- Production failure modes and plan stabilitysenior
- MVCC: why readers and writers never wait for each otherjunior
- Row versions and snapshots: the on-disk mechanicsmiddle
- HOT updates and isolation levels: what you gain and what you paymiddle
- Vacuum and bloat: keeping the storage tax boundedmiddle
- CLOG, XID wraparound, and MultiXact: deep visibility internalssenior
- SSI internals and production autovacuum tuningsenior
- Real-world MVCC failures, deployment patterns, and distributed snapshotssenior
- Connection pools: amortising the cost of a Postgres backendjunior
- PgBouncer session, transaction, and statement modesmiddle
- Pool sizing: the (cores × 2) + spindles formula and the two-layer stackmiddle
- Pool exhaustion and idle-in-transaction: the 3 AM failure modemiddle
- Migrating to transaction mode: rollout playbook and PgBouncer 1.21 prepared statementsmiddle
- The Postgres process model and why raising max_connections degrades throughputsenior
- Pooler landscape 2026, serverless connection storms, and the full failure-mode taxonomysenior
- What a schema migration is and why it replaces ad-hoc DDLjunior
- ADD COLUMN: instant in PG 11+ vs rewrite in older Postgresjunior
- The lock-queue failure mode: why instant DDL can freeze the databasemiddle
- Safe DDL patterns: NOT VALID, CONCURRENTLY, and unsafe-op fixesmiddle
- Expand-contract: zero-downtime for breaking schema changesmiddle
- Advisory locks, migration tools, and deploy coordinationsenior
- Migration failure taxonomy and production disciplinesenior
- Why sharding exists: the single-Postgres ceilingjunior
- Shard-key selection: hash, range, list, and directory strategiesmiddle
- Partitioning vs sharding: same word, two different thingsmiddle
- Co-location and Citus: the invariant that makes sharding usablemiddle
- The hot-shard failure mode: detection, isolation, and durable policymiddle
- Schema-based sharding and multi-tenancy alternativessenior
- Online resharding, 2PC, and the operational cost of shardingsenior
- The seven acts: from CREATE TABLE to Citusjunior
- Acts 1–3 in depth: schema, indexes, and planner statisticsmiddle
- Acts 4–6 in depth: MVCC bloat, connection pooling, and safe migrationsmiddle
- Act 7 in depth: sharding, co-location, and the seven-tier tradeoff cascademiddle
- Observability, anti-patterns, and production triagesenior
- Raft roles, terms, and why majority quorums prevent split brainjunior
- How Raft replicates a log entry and decides it is safe to commitmiddle
- Raft leader election: timeouts, voting rules, and the four safety propertiesmiddle
- Raft in the real world: partitions, slow disks, and client routingmiddle
- Raft extensions: pre-vote, learners, snapshots, and linearizable readssenior
- Raft in production: membership changes, Multi-Raft, and observabilitysenior
- Where data fetching happens — and why it decides LCPjunior
- Fetch waterfalls — diagnosis and the Promise.all curemiddle
- React Server Components and Suspense streamingmiddle
- Client-side cache: TanStack Query, SWR, and stale-while-revalidatemiddle
- LCP, prefetch, and race conditions in interactive fetchingmiddle
- Senior internals: RSC payload, caching layers, and production failure modessenior
- The three-way handshakejunior
- Sequence numbers and connection statemiddle
- DNS: what it does and why it existsjunior
- The resolver walk: referrals, record types, and gluemiddle
- TTL, caching, and DNS propagationmiddle
- The 1-RTT handshake: key shares and ECDHEmiddle
- Session resumption and 0-RTTmiddle
- WebSocket: the HTTP upgrade handshakejunior
- WebSocket frame format: opcodes, masking, fragmentationmiddle
- WebSocket backpressure: when clients can''''t keep upmiddle
- Reconnection: jittered backoff, thundering herd, message resumptionsenior
- WebSocket at scale: HTTP/2 multiplexing, permessage-deflate, C10Msenior
- WebSocket in production: proxies, security, and distributed architecturesenior
- What reverse proxies dojunior
- Health checks, connection draining, and slow startmiddle
- Session affinity, consistent hashing, and the right fixmiddle
- Retry storms, circuit breakers, and load sheddingsenior
- Resilient LB architecture: anycast, zone-aware routing, and observabilitysenior
- Why QUIC and not TCP+TLSjunior
- Connection IDs and network migrationmiddle
- 0-RTT resumption and packet encryptionsenior
- DDoS: what it is and why it worksjunior
- Amplification attacks and state exhaustionmiddle
- Rate limiting: algorithms and architecturemiddle
- WAFs, firewalls, mTLS, and HSTSmiddle
- DNS cache poisoning and BGP hijackingsenior
- Defense-in-depth architecture and attack economicssenior
- DNS, TCP, TLS in sequence: where the milliseconds gomiddle
- Proxy intercepts and security gates: rate limiters, WAF, mTLSmiddle
- Alternate paths: QUIC 0-RTT, WebSocket upgrade, connection migrationmiddle
- Observability: distributed traces, USE/RED, and samplingsenior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- What the three signals are: logs, metrics, and tracesjunior
- Why structured logs exist: the diary vs the spreadsheetjunior
- The production log schema: fields every line must carrymiddle
- PII redaction and log injectionsenior
- OTel Logs Data Model and audit logs as a subsystemsenior
- SLI, SLO, and the error budget: reliability by the numbersjunior
- Error budget policy, latency SLOs, and composite journeysmiddle
- Production SLO failures, self-observability, security, and the big picturesenior
- The incident loop: from pager to postmortem to preventionmiddle
- At-most-once, at-least-once, exactly-once: the three delivery contractsjunior
- The three failure legs — where duplicates and losses actually happenmiddle
- Consumer-side dedup: the cheapest path to exactly-once processingmiddle
- Kafka exactly-once semantics: idempotent producer and transactionsmiddle
- SQS visibility timeout, DLQ, and the outbox patternmiddle
- Exactly-once in production: impossibility proof, hybrid patterns, and real incidentssenior
- What OAuth is and why passwords are not the answerjunior
- Authorization code flow with PKCEmiddle
- ID token validation and JWKS cache managementmiddle
- Refresh token rotation and scope-based least privilegemiddle
- Sender-constrained tokens: DPoP and mTLSsenior
- OAuth in production: audience attacks, observability, and real failuressenior