Performance
Row-major vs column-major: access order and the 9x gap
Two matrix-multiply loops. Same algorithm. Same 10 000 × 10 000 matrix. Same number of additions and multiplications. One finishes in 300 ms. The other in 2 800 ms. The only difference: the order of two for loops.
How 2D arrays are laid out in memory
In C, Java, Rust, Python (NumPy default), and most other languages, a 2D array arr[rows][cols] is stored in row-major order: all elements of row 0 are contiguous, then all of row 1, and so on.
For a 10 000 × 10 000 matrix of 8-byte doubles, the layout is:
arr[0][0], arr[0][1], ..., arr[0][9999], ← row 0: bytes 0–79999
arr[1][0], arr[1][1], ..., arr[1][9999], ← row 1: bytes 80000–159999
...What happens inside each loop order
Row-major iteration (for i: for j: arr[i][j]):
- Accesses
arr[i][0],arr[i][1],arr[i][2], … - Each step moves 8 bytes forward in memory.
- A cache line is 64 bytes (8 doubles). The first read loads 8 elements at once; the next 7 reads are free L1 hits.
- The prefetcher detects the sequential pattern and pre-loads the next line ahead.
- Cost per element: ~1 ns.
Column-major iteration (for j: for i: arr[i][j]):
- Accesses
arr[0][j],arr[1][j],arr[2][j], … - Each step moves 10 000 × 8 = 80 000 bytes forward — well past any loaded cache line.
- Every access is in a fresh cache line that was not prefetched.
- Cost per element: ~70–100 ns (full RAM latency).
| Loop order | Step size in memory | Cache behaviour | Time (10k×10k doubles) |
|---|---|---|---|
for i: for j: arr[i][j] | 8 bytes (sequential) | 7 of 8 accesses hit L1 | ~300 ms |
for j: for i: arr[i][j] | 80 000 bytes (random) | Almost every access misses | ~2 800 ms |
The instruction count is identical. The arithmetic is identical. The 9x difference is entirely from cache-miss latency.
Fortran exception and language conventions
Fortran stores arrays in column-major order — all elements of column 0 first, then column 1. NumPy has a Fortran-order option (order='F'). If your code or library uses Fortran layout, the loop that is “obviously right” in C is the worst case in Fortran. Always verify your language’s memory layout before tuning inner loops.
Cache-oblivious blocking
Sometimes both loop orders are necessary — for example in matrix transposition or certain linear algebra routines. The fix is cache-oblivious blocking (also called tiling): divide the matrix into small tiles (~64 rows × 64 columns, so each tile fits in L2), and process each tile before moving to the next. Within a tile, accesses are in order; the whole tile fits in cache while being processed, and both row and column dimensions see mostly cache hits.
Why this works
The 9x example uses a 10 000 × 10 000 matrix because it guarantees the working set far exceeds L3 cache. On smaller matrices (100 × 100) the entire matrix fits in L1 and both loop orders are fast. The performance cliff only appears when the matrix outgrows the cache — which is the common production scenario for real data processing workloads.
A 10 000 × 10 000 matrix of doubles is stored row-major. The inner loop iterates columns (arr[i][j], j from 0 to N). Why is this fast?
A team switches a matrix multiply from row-major to column-major loop order and observes a 9x slowdown with no algorithm change. The root cause is:
Trace what happens when row-major code accesses arr[i][0] through arr[i][7] (8 consecutive doubles, 64 bytes total):
- 1 CPU requests arr[i][0] (address X)
- 2 Cache miss: load 64-byte line at X into L1 (costs ~100 ns once)
- 3 arr[i][0] through arr[i][7] are all in the loaded line
- 4 Accesses arr[i][1] through arr[i][7] each hit L1 (~1 cycle each)
- 5 Prefetcher fires and loads the next 64-byte line before it is needed
- 6 arr[i][8] through arr[i][15] arrive with zero wait
- 01Why is row-major matrix iteration 9x faster than column-major on a 10 000×10 000 matrix, when both perform the same number of element accesses?
- 02What is cache-oblivious blocking and when should you use it?
In a row-major language (C, Java, Rust, Python/NumPy), elements of the same row occupy consecutive addresses. A 64-byte cache line holds 8 doubles; sequential (row-major) iteration reads 8 elements per cache miss. Column-major iteration on the same array steps 80 000 bytes per element on a 10 000-column matrix, landing in a fresh cache line every time and paying full RAM latency on every load. The result is a 9x wall-clock difference with no algorithmic change. Always know your language’s memory layout and align loop order to match it; use cache-oblivious blocking when both access directions are unavoidable.
appears again in159
- The journey of a request: seven stops from socket to responsejunior
- Accept and parse: from kernel queue to a typed requestmiddle
- Routing and middleware: choosing what runs, and in what ordermiddle
- Handler and response: from business logic to bytes on the wiremiddle
- Streaming and backpressure: when the client reads slower than you writesenior
- Timeouts and tail latency: budgets, deadlines, and the fan-out trapsenior
- Middleware and DI: the two patterns that shape every backendjunior
- Writing middleware: signatures, next(), and the three framework modelsmiddle
- Inversion of control: how dependencies reach a classmiddle
- DI scopes and lifecycles: singleton, request, transientmiddle
- DI as a testing seam: fakes, mocks, and the boundary that matterssenior
- DI containers in production: resolution graphs, circular deps, and when not tosenior
- Blocking vs non-blocking I/O: two ways to waitjunior
- The event loop: one thread, ordered phasesmiddle
- What blocks the loop: CPU work and sync callsmiddle
- Offloading CPU work: worker threads and the libuv poolmiddle
- Backpressure and bounded concurrencysenior
- Throughput under load: tail latency and saturationsenior
- Why pool: the cost of creating a connectionjunior
- Pool sizing: why bigger is not fastermiddle
- Acquisition and timeouts: the wait queue is the real latency dialmiddle
- Retry strategies: backoff, jitter, and thundering herdmiddle
- Observability, production failures, and global-scale designsenior
- Tasks, microtasks, and scheduler.yield()middle
- Timer accuracy, throttling, and idle workmiddle
- Node.js event loop: phases, nextTick, and loop lagsenior
- Rendering strategies: SSG, SSR, ISR, streaming, and hydrationjunior
- SSG, SSR, ISR, streaming, and RSC — how each worksmiddle
- Hydration cost: selective, progressive, islands, resumabilitymiddle
- Core Web Vitals: what LCP, INP, and CLS measurejunior
- LCP: four phases, one dominant costmiddle
- INP: input delay, processing, presentationmiddle
- Lab vs field: why the two disagree and how to use eachmiddle
- Metric tradeoffs, RUM attribution, and the CI+field loopsenior
- The full picture: URL to LCP to INP as a relay racejunior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- What an index is and how it speeds up queriesjunior
- The leading-column rule and composite index designmiddle
- Partial, expression, and covering indexesmiddle
- Index types: GIN, GiST, BRIN, Hash, Bloom, and HOT updatesmiddle
- Index-only scans, the Visibility Map, and INCLUDEsenior
- Production failure modes and the index audit playbooksenior
- Index design exercise: full-text search strategysenior
- EXPLAIN and execution plans: what the planner decides and whyjunior
- Scan types: Seq, Index, Bitmap, Index-Onlymiddle
- Join algorithms and the row-estimate cascademiddle
- pg_statistic, ANALYZE, and production observabilitymiddle
- Extended statistics: fixing correlated-column estimate failuressenior
- Plan cache, cost-constant tuning, and planner internalssenior
- Production failure modes and plan stabilitysenior
- Connection pools: amortising the cost of a Postgres backendjunior
- PgBouncer session, transaction, and statement modesmiddle
- Pool sizing: the (cores × 2) + spindles formula and the two-layer stackmiddle
- Pool exhaustion and idle-in-transaction: the 3 AM failure modemiddle
- Migrating to transaction mode: rollout playbook and PgBouncer 1.21 prepared statementsmiddle
- The Postgres process model and why raising max_connections degrades throughputsenior
- Pooler landscape 2026, serverless connection storms, and the full failure-mode taxonomysenior
- ADD COLUMN: instant in PG 11+ vs rewrite in older Postgresjunior
- The lock-queue failure mode: why instant DDL can freeze the databasemiddle
- Safe DDL patterns: NOT VALID, CONCURRENTLY, and unsafe-op fixesmiddle
- Migration failure taxonomy and production disciplinesenior
- Shard-key selection: hash, range, list, and directory strategiesmiddle
- Co-location and Citus: the invariant that makes sharding usablemiddle
- The hot-shard failure mode: detection, isolation, and durable policymiddle
- Online resharding, 2PC, and the operational cost of shardingsenior
- The seven acts: from CREATE TABLE to Citusjunior
- Acts 1–3 in depth: schema, indexes, and planner statisticsmiddle
- Acts 4–6 in depth: MVCC bloat, connection pooling, and safe migrationsmiddle
- Act 7 in depth: sharding, co-location, and the seven-tier tradeoff cascademiddle
- Observability, anti-patterns, and production triagesenior
- Bits on the wirejunior
- Latency mathmiddle
- Bufferbloat and congestionsenior
- The physical frontiersenior
- Sequence numbers and connection statemiddle
- Flow control and congestion controlmiddle
- BBR, production observability, and beyond TCPsenior
- CDN: putting content next doorjunior
- Anycast and GeoDNS: routing to the nearest edgemiddle
- Tiered cache and Cache-Controlmiddle
- Vary header and cache keysmiddle
- Stale-while-revalidate and cache stampedesenior
- Edge workers and edge-side compositionsenior
- CDN operations and observabilitysenior
- WebSocket: the HTTP upgrade handshakejunior
- WebSocket vs SSE vs long-polling: choosing the right transportmiddle
- WebSocket backpressure: when clients can''''t keep upmiddle
- Reconnection: jittered backoff, thundering herd, message resumptionsenior
- WebSocket at scale: HTTP/2 multiplexing, permessage-deflate, C10Msenior
- WebSocket in production: proxies, security, and distributed architecturesenior
- What reverse proxies dojunior
- Balancing algorithms: round-robin to power-of-two-choicesmiddle
- L4 vs L7 load balancing and client-IP preservationmiddle
- Health checks, connection draining, and slow startmiddle
- Retry storms, circuit breakers, and load sheddingsenior
- Resilient LB architecture: anycast, zone-aware routing, and observabilitysenior
- Why QUIC and not TCP+TLSjunior
- QUIC streams and head-of-line blockingjunior
- Integrated handshake and 1-RTTmiddle
- Connection IDs and network migrationmiddle
- Loss detection and congestion controlmiddle
- 0-RTT resumption and packet encryptionsenior
- Deployment tradeoffs and CPU costsenior
- DDoS: what it is and why it worksjunior
- Amplification attacks and state exhaustionmiddle
- Rate limiting: algorithms and architecturemiddle
- WAFs, firewalls, mTLS, and HSTSmiddle
- DNS cache poisoning and BGP hijackingsenior
- Defense-in-depth architecture and attack economicssenior
- The twelve layers: one URL, seven actorsjunior
- DNS, TCP, TLS in sequence: where the milliseconds gomiddle
- Critical render path and Core Web Vitalsmiddle
- Proxy intercepts and security gates: rate limiters, WAF, mTLSmiddle
- Alternate paths: QUIC 0-RTT, WebSocket upgrade, connection migrationmiddle
- Observability: distributed traces, USE/RED, and samplingsenior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- What the three signals are: logs, metrics, and tracesjunior
- Metrics and cardinality: the cost model of a time-series databasemiddle
- Logs and volume: the cost model of structured loggingmiddle
- Traces and sampling: the cost model of distributed tracingmiddle
- Join keys and exemplars: making the three signals composemiddle
- Observability 2.0: wide events and the cost shiftsenior
- Failure modes and engineering practice: cardinality budgets, PII, and samplingsenior
- Why structured logs exist: the diary vs the spreadsheetjunior
- The production log schema: fields every line must carrymiddle
- Log levels and alert routingmiddle
- Sampling strategies and log costmiddle
- PII redaction and log injectionsenior
- Trace context propagation in logssenior
- OTel Logs Data Model and audit logs as a subsystemsenior
- OTel signals, Semantic Conventions, and the OTLP wire formatmiddle
- Auto-instrumentation and manual spans: the 80/20 of OTelmiddle
- The OTel Collector: receivers, processors, exporters, and deployment patternsmiddle
- Sampling strategies: head, tail, and parent-basedmiddle
- Vendor neutrality, eBPF instrumentation, the Operator, and browser/serverless OTelsenior
- Operating the OTel Collector: reliability, version skew, failure modes, and governancesenior
- RED and USE: two checklists, one triage disciplinejunior
- Instrumenting RED in Prometheus: counters, histograms, and cardinality disciplinemiddle
- USE on Linux: CPU, memory, disk, network, and PSImiddle
- Golden signals, dashboard layout, and service mesh auto-REDmiddle
- Cardinality as a cost driver: labels, PII, exemplars, and samplingmiddle
- Native histograms, SLO tie-in, and production failure patternsmiddle
- Choosing SLIs and SLO targets: ratios, not feelingsmiddle
- Multi-window multi-burn-rate alerting: why AND beats ORmiddle
- Error budget policy, latency SLOs, and composite journeysmiddle
- Iceberg SLIs, composite SLO math, and SLA vs SLOsenior
- Flame graphs: reading the picture that shows where time goesjunior
- Sampling vs instrumentation profiling: why 99 Hz wins in productionmiddle
- Profile types: CPU, memory, off-CPU, mutex — which one to reach formiddle
- Continuous profiling: always-on flame graphs with eBPF and trace-id correlationmiddle
- How flame graphs are built from samples, and the production workflows that use themmiddle
- Linux perf, eBPF internals, PGO, and the limits of samplingsenior
- Profiling in production: security, war stories, OTel profiles, and the infrastructure designsenior
- The debugging funnel: SLO → RED → trace → profilejunior
- OTel architecture: one SDK, four signals, one wire formatmiddle
- Cost discipline: keeping observability under 5% of infra spendmiddle
- Scale, security, and the ROI of observable systemssenior