Databases
Why sharding exists: the single-Postgres ceiling
Otto’s Postgres is running at 95% CPU, the working set has grown past what fits in RAM, and writes are saturating the single primary. Read replicas help for reads, but the write bottleneck is the machine itself. There is no bigger instance left to buy.
The single-Postgres ceiling
A single Postgres instance has hard limits:
- Working set vs RAM: once the hot portion of your data no longer fits in
shared_buffersplus OS page cache, every query starts hitting disk. Cache-hit ratio (blks_hit / (blks_hit + blks_read)) dropping below 95–99% is the leading signal. - Write throughput: a single primary can sustain roughly 10–50k transactions per second depending on hardware. Above that, WAL generation dominates CPU and the primary becomes the bottleneck.
- Storage: practical ceiling around 1–10 TB of working set in RAM on the largest available machines.
Vertical scaling (more RAM, faster NVMe) buys time — often years. The right order is: fix slow queries first, add indexes, add read replicas, vertically scale. Sharding is the last resort, not the first.
What sharding does
Sharding takes one logical dataset (all your orders, all your users) and splits it across N physical Postgres instances. Each row lives on exactly one shard, determined by a shard key — a column or combination of columns (typically tenant_id for B2B SaaS, user_id for B2C).
| Single Postgres | Sharded (4 instances) |
|---|---|
| All rows on one machine | Rows split by shard key across 4 machines |
| One CPU/IO ceiling | ~4× the CPU/IO capacity |
| Simple operations (1 backup, 1 migration) | 4× operations (4 backups, migration runs 4×) |
| Cross-table joins: free | Cross-shard joins: expensive |
The win is linear-ish scale-out: 10 shards give roughly 5–8× the throughput (overhead from coordination reduces the theoretical 10×). The cost is that every operational task — backups, migrations, upgrades, VACUUM — must run on every shard.
The hot-shard failure mode
The most important failure mode to design against from day one is the hot shard: one shard receives far more traffic than the others because the shard key was poorly chosen, or because one tenant grew 1000× larger than the median.
Example: a SaaS shards orders by customer_id. An enterprise customer signs and their orders are 5× all other customers combined. Their shard hits 95% CPU while the other shards sit at 10%. The cluster effectively runs at single-shard capacity for the dominant customer.
The fix — moving that tenant to a dedicated shard — is covered in lesson 05. The lesson here is: shard-key selection is the single most consequential design decision in a sharded system, and bad choices are lived with for years.
The library metaphor
Sharding is the difference between one library with a million books (one building, one lookup desk, find anything by walking around) and ten libraries with 100k books each (ten buildings, you need a directory of “which library has this book”). The single library is simpler. The ten-library setup needs a directory (the shard map) and a rule for which building to go to before you set off.
Books that get checked out a million times a year fill one library while the others sit empty — that is the hot-shard failure.
Why this works
Why not just use a bigger cloud instance? Vertical scaling is almost always cheaper than sharding for the first few years. The decision point is when you have measured that the largest available instance is still insufficient at peak — usually 10 TB working set, 50k+ QPS sustained write load, or multi-year projections that cross those thresholds. Sharding is a one-way door: once your schema and app logic are written for a sharded world, reversing it is a multi-month project.
What does a shard key determine?
A B2B SaaS shards 'orders' by customer_id. One enterprise customer generates 5× the traffic of all others combined. What is the result?
Order the scaling strategies from cheapest to most expensive (operational complexity), before reaching for sharding:
- 1 Fix slow queries: add missing indexes, rewrite inefficient SQL
- 2 Add read replicas to offload read-heavy traffic from the primary
- 3 Add caching layers (Redis, in-process) for hot read paths
- 4 Vertical scaling: upgrade to a larger instance (more RAM, faster NVMe)
- 5 Shard: split the dataset across multiple Postgres instances by a shard key
- 01What are the three hard ceilings of a single Postgres instance that sharding addresses?
- 02In one sentence: what is the hot-shard failure mode and what causes it?
- 03Why is the operational cost of sharding described as 'N×'?
A single Postgres instance has hard ceilings on working set (RAM), write throughput, and storage that vertical scaling can only delay, not eliminate. Sharding splits the dataset across N physical Postgres instances by a shard key, giving roughly N× the capacity at the cost of N× the operational complexity — every migration, backup, and upgrade runs once per shard. The central failure mode is the hot shard: one shard saturating because the key was poorly chosen or a single tenant outgrew the rest. Senior engineers exhaust indexing, caching, read replicas, and vertical scaling before sharding — and when they do shard, they pick the key deliberately and design the hot-shard response before the first incident.
appears again in140
- Why GraphQL gets N+1junior
- DataLoader mechanics: tick-boundary batchingmiddle
- Batch function contracts: ordering, shapes, errorsmiddle
- Federation and lookahead: batching beyond DataLoadermiddle
- Query complexity defences: depth, cost, persisted queriesmiddle
- Senior GraphQL API: scheduling contract, tenant isolation, observabilitysenior
- Why idempotency: making retries safejunior
- Server-side state machine: four states of an idempotency keymiddle
- Outbox and inbox: effectively-once across the dual-write boundarymiddle
- Concurrency and cache architecture for idempotency at scalesenior
- Observability, production failures, and global-scale designsenior
- The event loop: one thread, three queuesjunior
- Tasks, microtasks, and scheduler.yield()middle
- Microtask starvation, Long Tasks, and LoAFsenior
- Node.js event loop: phases, nextTick, and loop lagsenior
- React, Vue, and INP observability in productionsenior
- The render pipeline: six stages from bytes to pixelsjunior
- Stage costs and the renderer process modelmiddle
- Invalidation, dirty bits, and containmiddle
- Compositor layers: promotion, overlap, and GPU memorymiddle
- DevTools flame strip and the frame lifecyclemiddle
- Layout thrash: forced synchronous layoutsenior
- BeginMainFrame, compositor-driven animations, and GPU memorysenior
- Production observability: LoAF, INP, and the full attack surfacesenior
- What V8 is and why performance varies 100×junior
- V8''''s four-tier JIT pipeline and profile-guided tieringmiddle
- Hidden classes, transition trees, and memory layoutmiddle
- Inline caches, IC states, and deoptimizationmiddle
- Orinoco GC: parallel scavenger, concurrent marking, and write barriersmiddle
- TurboFan''''s speculative engine and the deopt-loop trapsenior
- V8 in production: isolates, pointer compression, and real failuressenior
- Service worker lifecycle and cache strategiesmiddle
- Service worker edge cases: version skew, durability, and navigation trapssenior
- What the reconciler does: render vs commitjunior
- The fiber object and the double-buffer treemiddle
- Render phase purity and commit phase sub-stepsmiddle
- Reconciliation: diffing heuristics and the key trapmiddle
- Priority lanes, time-slicing, and useTransitionmiddle
- Bailout, memoisation, and tearingsenior
- React Profiler, the Compiler, and production observabilitysenior
- Rendering strategies: SSG, SSR, ISR, streaming, and hydrationjunior
- SSG, SSR, ISR, streaming, and RSC — how each worksmiddle
- Hydration cost: selective, progressive, islands, resumabilitymiddle
- Hydration mismatch: causes, detection, and the determinism rulesenior
- RSC, per-route strategy, and production observabilitysenior
- Core Web Vitals: what LCP, INP, and CLS measurejunior
- CLS: why layout shifts happen and how to stop themmiddle
- Metric tradeoffs, RUM attribution, and the CI+field loopsenior
- The full picture: URL to LCP to INP as a relay racejunior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- What is a cache stampede and why it makes things worsejunior
- Lock and single-flight: bounding concurrent rebuildsmiddle
- XFetch: coordination-free probabilistic early expirationmiddle
- Stale-while-revalidate and CDN request coalescingmiddle
- Detecting stampedes and designing TTL for productionmiddle
- Metastable failure, fencing tokens, and production postmortemssenior
- Raft roles, terms, and why majority quorums prevent split brainjunior
- How Raft replicates a log entry and decides it is safe to commitmiddle
- Raft leader election: timeouts, voting rules, and the four safety propertiesmiddle
- Raft in the real world: partitions, slow disks, and client routingmiddle
- Raft extensions: pre-vote, learners, snapshots, and linearizable readssenior
- Raft in production: membership changes, Multi-Raft, and observabilitysenior
- Where data fetching happens — and why it decides LCPjunior
- Fetch waterfalls — diagnosis and the Promise.all curemiddle
- React Server Components and Suspense streamingmiddle
- Client-side cache: TanStack Query, SWR, and stale-while-revalidatemiddle
- LCP, prefetch, and race conditions in interactive fetchingmiddle
- Senior internals: RSC payload, caching layers, and production failure modessenior
- The three-way handshakejunior
- Sequence numbers and connection statemiddle
- DNS: what it does and why it existsjunior
- The resolver walk: referrals, record types, and gluemiddle
- TTL, caching, and DNS propagationmiddle
- The 1-RTT handshake: key shares and ECDHEmiddle
- Session resumption and 0-RTTmiddle
- WebSocket: the HTTP upgrade handshakejunior
- WebSocket frame format: opcodes, masking, fragmentationmiddle
- WebSocket backpressure: when clients can''''t keep upmiddle
- Reconnection: jittered backoff, thundering herd, message resumptionsenior
- WebSocket at scale: HTTP/2 multiplexing, permessage-deflate, C10Msenior
- WebSocket in production: proxies, security, and distributed architecturesenior
- What reverse proxies dojunior
- Health checks, connection draining, and slow startmiddle
- Session affinity, consistent hashing, and the right fixmiddle
- Retry storms, circuit breakers, and load sheddingsenior
- Resilient LB architecture: anycast, zone-aware routing, and observabilitysenior
- Why QUIC and not TCP+TLSjunior
- Connection IDs and network migrationmiddle
- 0-RTT resumption and packet encryptionsenior
- DDoS: what it is and why it worksjunior
- Amplification attacks and state exhaustionmiddle
- Rate limiting: algorithms and architecturemiddle
- WAFs, firewalls, mTLS, and HSTSmiddle
- DNS cache poisoning and BGP hijackingsenior
- Defense-in-depth architecture and attack economicssenior
- DNS, TCP, TLS in sequence: where the milliseconds gomiddle
- Proxy intercepts and security gates: rate limiters, WAF, mTLSmiddle
- Alternate paths: QUIC 0-RTT, WebSocket upgrade, connection migrationmiddle
- Observability: distributed traces, USE/RED, and samplingsenior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- What the three signals are: logs, metrics, and tracesjunior
- Why structured logs exist: the diary vs the spreadsheetjunior
- The production log schema: fields every line must carrymiddle
- PII redaction and log injectionsenior
- OTel Logs Data Model and audit logs as a subsystemsenior
- SLI, SLO, and the error budget: reliability by the numbersjunior
- Error budget policy, latency SLOs, and composite journeysmiddle
- Production SLO failures, self-observability, security, and the big picturesenior
- The incident loop: from pager to postmortem to preventionmiddle
- Cache lines, struct layout, and false sharingmiddle
- SIMD, SoA vs AoS, and memory bandwidthmiddle
- Cache-oblivious algorithms, PGO, and production failuressenior
- GC in production: observability, security, edge cases, and fleet governancesenior
- Batching: amortize fixed cost per operationjunior
- The batching window: size and wait timemiddle
- Batching in Kafka and Postgresmiddle
- io_uring and observability of batchingmiddle
- From Nagle to io_uring: evolution of batchingmiddle
- Backpressure, failure isolation, and batch security in productionsenior
- CI enforcement and RUM: making budgets stickmiddle
- V8 JIT pipeline, HTTP priorities, and bundle securitysenior
- The performance loop: discipline, not a projectjunior
- Classify and fix: matching bottleneck families to remediesmiddle
- Observability stack and CI gates: catching regressions before they shipmiddle
- Incident to enforcement: SLO burn to verified fix in 35 minutesmiddle
- Culture, economics, and org-scale performancesenior
- At-most-once, at-least-once, exactly-once: the three delivery contractsjunior
- The three failure legs — where duplicates and losses actually happenmiddle
- Consumer-side dedup: the cheapest path to exactly-once processingmiddle
- Kafka exactly-once semantics: idempotent producer and transactionsmiddle
- SQS visibility timeout, DLQ, and the outbox patternmiddle
- Exactly-once in production: impossibility proof, hybrid patterns, and real incidentssenior
- What OAuth is and why passwords are not the answerjunior
- Authorization code flow with PKCEmiddle
- ID token validation and JWKS cache managementmiddle
- Refresh token rotation and scope-based least privilegemiddle
- Sender-constrained tokens: DPoP and mTLSsenior
- OAuth in production: audience attacks, observability, and real failuressenior