Databases
SSI internals and production autovacuum tuning
A production OLTP cluster suddenly logs hundreds of “could not serialize access” errors per minute on the orders table. You have 15 minutes before the on-call engineer escalates. The diagnosis path — isolation level confirmation, predicate-lock count, workload genuineness — is a step-by-step procedure, not guesswork.
Serializable Snapshot Isolation (SSI) internals
Postgres 9.1 introduced SSI per Cahill, Röhm, and Fekete’s 2008 paper. The mechanism:
- Every read tracks “predicate locks” — lightweight, in-memory references to “I read this set”
- Every write tracks rw-antidependencies — “this transaction wrote something my predicate matched after I read”
- At commit time, if a transaction is part of a “dangerous structure” — a cycle of rw-antidependencies — SSI aborts it with SQLSTATE 40001
could not serialize access due to read/write dependencies
The math is conservative: SSI sometimes aborts transactions that would have committed safely (false positives are correct but cost throughput), but it never allows a non-serializable schedule.
Predicate-lock memory is bounded by max_pred_locks_per_transaction × max_connections. Under memory pressure the locks coarsen from row → page → relation, raising false-positive rate further.
| False-positive rate | Likely cause | Fix |
|---|---|---|
| 0.1–1% | Normal SSI behavior | No action needed |
| 1–5% | Predicate-lock coarsening | Raise max_pred_locks_per_transaction |
| >5% | Genuine workload conflict or memory pressure | App-level partitioning or switch to RC+FOR UPDATE |
A production OLTP cluster suddenly logs hundreds of 'could not serialize access' errors per minute on the orders table. Trace the diagnosis.
Heap fillfactor and HOT chain length
pg_stat_all_tables.n_tup_hot_upd divided by n_tup_upd gives the HOT ratio. On UPDATE-heavy schemas you want this above 80%. Below 50% means fillfactor is too aggressive or the schema has the wrong index set for the workload.
The pgstattuple extension exposes tuple_count, dead_tuple_count, and free_space per table for deeper introspection.
Production tuning of autovacuum
The default knobs are too lax for any high-write table. Per-table overrides via storage parameters:
ALTER TABLE orders SET (
autovacuum_vacuum_scale_factor = 0.05,
autovacuum_vacuum_cost_limit = 2000,
autovacuum_vacuum_cost_delay = 2
);This turns on autovacuum at 5% dead-tuple ratio (instead of 20%), with higher cost-budget per cycle and shorter delay between IO chunks. On NVMe-backed clusters the cost_delay can be zero.
For tables with bursty write patterns, an explicit VACUUM (PARALLEL 4) orders; overnight via cron is common in addition to autovacuum.
autovacuum_max_workers bumps the worker pool from 3 to 6 or 8 on large hosts.
Watch pg_stat_progress_vacuum.phase to see live progress:
scanning heapvacuuming indexesvacuuming heapcleaning up indexestruncating heapperforming final cleanup
Observability: the metrics that actually move
Minimum-viable Postgres bloat dashboard:
-- Dead tuple ratios
SELECT relname, n_dead_tup, n_live_tup,
n_dead_tup::float / NULLIF(n_live_tup + n_dead_tup, 0) AS dead_ratio
FROM pg_stat_user_tables ORDER BY dead_ratio DESC LIMIT 20;
-- Snapshot pinners
SELECT pid, backend_xmin, now() - xact_start AS duration, state, query
FROM pg_stat_activity WHERE backend_xmin IS NOT NULL ORDER BY backend_xmin LIMIT 5;
-- Orphan replication slots
SELECT slot_name, xmin, confirmed_flush_lsn FROM pg_replication_slots WHERE xmin IS NOT NULL;
-- Wraparound ages
SELECT datname, age(datfrozenxid), age(datminmxid) FROM pg_database;A bank's transfer service moves money between accounts. Pick the isolation strategy that best balances correctness, throughput, and operational complexity.
Design an autovacuum policy for a high-write OLTP table receiving 5,000 row inserts + updates per second, with a target dead-tuple ratio below 10% and zero impact on customer-facing latency.
- Table is 200 GB, growing 2-3 GB per day.
- Underlying storage is NVMe, single-tenant.
- Application opens 5-minute analytics transactions intermittently from a replica.
- Customer-facing p99 must stay under 50 ms during autovacuum runs.
- No downtime windows.
- Per-table scale_factor much lower than the global 0.2 default.
- cost_delay = 0 only because NVMe can absorb the IO; rotational storage needs cost_delay > 0.
- hot_standby_feedback = OFF on the replica is the single most important setting — it decouples analytics from primary bloat.
- Fillfactor 80 preserves room for HOT updates without wasting too much space.
- Explicit nightly VACUUM is belt-and-suspenders but worth it for planner stats.
Postgres logs 'cannot remove tuples deleted by transactions in progress' during autovacuum on a 200 GB orders table. What is the most likely root cause?
A bank insists on SERIALIZABLE isolation for transfers. Throughput at peak is 5x lower than under READ COMMITTED. The DBA proposes READ COMMITTED + SELECT FOR UPDATE. The compliance officer worries about correctness. Who is right and why?
- 01Explain how SSI detects write skew — what data structure does it use and when does it abort?
- 02Explain how a HOT chain breaks and what observable symptoms tell you it is happening too often.
- 03Why is hot_standby_feedback = OFF the recommended setting on analytics replicas, and what is the risk of turning it ON?
SSI (Serializable Snapshot Isolation, Postgres 9.1) detects write skew by tracking predicate locks (what I read) and rw-antidependencies (what I wrote that others were watching). A cycle means a dangerous schedule; SSI aborts one participant. False-positive rate of 0.1–1% is normal; above 5% indicates predicate-lock coarsening due to memory pressure or genuine workload conflict. Production autovacuum on high-write tables requires per-table overrides: scale_factor = 0.02–0.05, cost_delay = 0 on NVMe, max_workers = 6–8. Analytics replicas should run with hot_standby_feedback = OFF to decouple primary bloat from replica query duration. The minimum-viable bloat dashboard is dead_tuple ratio + backend_xmin age + 40001 error rate, plus replication slot xmin monitoring.
appears again in140
- Why GraphQL gets N+1junior
- DataLoader mechanics: tick-boundary batchingmiddle
- Batch function contracts: ordering, shapes, errorsmiddle
- Federation and lookahead: batching beyond DataLoadermiddle
- Query complexity defences: depth, cost, persisted queriesmiddle
- Senior GraphQL API: scheduling contract, tenant isolation, observabilitysenior
- Why idempotency: making retries safejunior
- Server-side state machine: four states of an idempotency keymiddle
- Outbox and inbox: effectively-once across the dual-write boundarymiddle
- Concurrency and cache architecture for idempotency at scalesenior
- Observability, production failures, and global-scale designsenior
- The event loop: one thread, three queuesjunior
- Tasks, microtasks, and scheduler.yield()middle
- Microtask starvation, Long Tasks, and LoAFsenior
- Node.js event loop: phases, nextTick, and loop lagsenior
- React, Vue, and INP observability in productionsenior
- The render pipeline: six stages from bytes to pixelsjunior
- Stage costs and the renderer process modelmiddle
- Invalidation, dirty bits, and containmiddle
- Compositor layers: promotion, overlap, and GPU memorymiddle
- DevTools flame strip and the frame lifecyclemiddle
- Layout thrash: forced synchronous layoutsenior
- BeginMainFrame, compositor-driven animations, and GPU memorysenior
- Production observability: LoAF, INP, and the full attack surfacesenior
- What V8 is and why performance varies 100×junior
- V8''''s four-tier JIT pipeline and profile-guided tieringmiddle
- Hidden classes, transition trees, and memory layoutmiddle
- Inline caches, IC states, and deoptimizationmiddle
- Orinoco GC: parallel scavenger, concurrent marking, and write barriersmiddle
- TurboFan''''s speculative engine and the deopt-loop trapsenior
- V8 in production: isolates, pointer compression, and real failuressenior
- Service worker lifecycle and cache strategiesmiddle
- Service worker edge cases: version skew, durability, and navigation trapssenior
- What the reconciler does: render vs commitjunior
- The fiber object and the double-buffer treemiddle
- Render phase purity and commit phase sub-stepsmiddle
- Reconciliation: diffing heuristics and the key trapmiddle
- Priority lanes, time-slicing, and useTransitionmiddle
- Bailout, memoisation, and tearingsenior
- React Profiler, the Compiler, and production observabilitysenior
- Rendering strategies: SSG, SSR, ISR, streaming, and hydrationjunior
- SSG, SSR, ISR, streaming, and RSC — how each worksmiddle
- Hydration cost: selective, progressive, islands, resumabilitymiddle
- Hydration mismatch: causes, detection, and the determinism rulesenior
- RSC, per-route strategy, and production observabilitysenior
- Core Web Vitals: what LCP, INP, and CLS measurejunior
- CLS: why layout shifts happen and how to stop themmiddle
- Metric tradeoffs, RUM attribution, and the CI+field loopsenior
- The full picture: URL to LCP to INP as a relay racejunior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- What is a cache stampede and why it makes things worsejunior
- Lock and single-flight: bounding concurrent rebuildsmiddle
- XFetch: coordination-free probabilistic early expirationmiddle
- Stale-while-revalidate and CDN request coalescingmiddle
- Detecting stampedes and designing TTL for productionmiddle
- Metastable failure, fencing tokens, and production postmortemssenior
- Raft roles, terms, and why majority quorums prevent split brainjunior
- How Raft replicates a log entry and decides it is safe to commitmiddle
- Raft leader election: timeouts, voting rules, and the four safety propertiesmiddle
- Raft in the real world: partitions, slow disks, and client routingmiddle
- Raft extensions: pre-vote, learners, snapshots, and linearizable readssenior
- Raft in production: membership changes, Multi-Raft, and observabilitysenior
- Where data fetching happens — and why it decides LCPjunior
- Fetch waterfalls — diagnosis and the Promise.all curemiddle
- React Server Components and Suspense streamingmiddle
- Client-side cache: TanStack Query, SWR, and stale-while-revalidatemiddle
- LCP, prefetch, and race conditions in interactive fetchingmiddle
- Senior internals: RSC payload, caching layers, and production failure modessenior
- The three-way handshakejunior
- Sequence numbers and connection statemiddle
- DNS: what it does and why it existsjunior
- The resolver walk: referrals, record types, and gluemiddle
- TTL, caching, and DNS propagationmiddle
- The 1-RTT handshake: key shares and ECDHEmiddle
- Session resumption and 0-RTTmiddle
- WebSocket: the HTTP upgrade handshakejunior
- WebSocket frame format: opcodes, masking, fragmentationmiddle
- WebSocket backpressure: when clients can''''t keep upmiddle
- Reconnection: jittered backoff, thundering herd, message resumptionsenior
- WebSocket at scale: HTTP/2 multiplexing, permessage-deflate, C10Msenior
- WebSocket in production: proxies, security, and distributed architecturesenior
- What reverse proxies dojunior
- Health checks, connection draining, and slow startmiddle
- Session affinity, consistent hashing, and the right fixmiddle
- Retry storms, circuit breakers, and load sheddingsenior
- Resilient LB architecture: anycast, zone-aware routing, and observabilitysenior
- Why QUIC and not TCP+TLSjunior
- Connection IDs and network migrationmiddle
- 0-RTT resumption and packet encryptionsenior
- DDoS: what it is and why it worksjunior
- Amplification attacks and state exhaustionmiddle
- Rate limiting: algorithms and architecturemiddle
- WAFs, firewalls, mTLS, and HSTSmiddle
- DNS cache poisoning and BGP hijackingsenior
- Defense-in-depth architecture and attack economicssenior
- DNS, TCP, TLS in sequence: where the milliseconds gomiddle
- Proxy intercepts and security gates: rate limiters, WAF, mTLSmiddle
- Alternate paths: QUIC 0-RTT, WebSocket upgrade, connection migrationmiddle
- Observability: distributed traces, USE/RED, and samplingsenior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- What the three signals are: logs, metrics, and tracesjunior
- Why structured logs exist: the diary vs the spreadsheetjunior
- The production log schema: fields every line must carrymiddle
- PII redaction and log injectionsenior
- OTel Logs Data Model and audit logs as a subsystemsenior
- SLI, SLO, and the error budget: reliability by the numbersjunior
- Error budget policy, latency SLOs, and composite journeysmiddle
- Production SLO failures, self-observability, security, and the big picturesenior
- The incident loop: from pager to postmortem to preventionmiddle
- Cache lines, struct layout, and false sharingmiddle
- SIMD, SoA vs AoS, and memory bandwidthmiddle
- Cache-oblivious algorithms, PGO, and production failuressenior
- GC in production: observability, security, edge cases, and fleet governancesenior
- Batching: amortize fixed cost per operationjunior
- The batching window: size and wait timemiddle
- Batching in Kafka and Postgresmiddle
- io_uring and observability of batchingmiddle
- From Nagle to io_uring: evolution of batchingmiddle
- Backpressure, failure isolation, and batch security in productionsenior
- CI enforcement and RUM: making budgets stickmiddle
- V8 JIT pipeline, HTTP priorities, and bundle securitysenior
- The performance loop: discipline, not a projectjunior
- Classify and fix: matching bottleneck families to remediesmiddle
- Observability stack and CI gates: catching regressions before they shipmiddle
- Incident to enforcement: SLO burn to verified fix in 35 minutesmiddle
- Culture, economics, and org-scale performancesenior
- At-most-once, at-least-once, exactly-once: the three delivery contractsjunior
- The three failure legs — where duplicates and losses actually happenmiddle
- Consumer-side dedup: the cheapest path to exactly-once processingmiddle
- Kafka exactly-once semantics: idempotent producer and transactionsmiddle
- SQS visibility timeout, DLQ, and the outbox patternmiddle
- Exactly-once in production: impossibility proof, hybrid patterns, and real incidentssenior
- What OAuth is and why passwords are not the answerjunior
- Authorization code flow with PKCEmiddle
- ID token validation and JWKS cache managementmiddle
- Refresh token rotation and scope-based least privilegemiddle
- Sender-constrained tokens: DPoP and mTLSsenior
- OAuth in production: audience attacks, observability, and real failuressenior