Databases
Vacuum and bloat: keeping the storage tax bounded
An autovacuum worker ran for 30 minutes on your orders table, removed zero dead tuples, and logged “287 million are not yet removable.” The table keeps growing. This is not an autovacuum configuration problem — it is a snapshot pinning problem, and knowing which is which determines the fix.
Where the cost is paid
Each UPDATE creates one fresh tuple plus marks the old tuple dead — both on disk, both in the heap, both eligible for index entries. Until VACUUM clears the dead tuples, every sequential scan walks past them and every page is bigger than the strict minimum.
A common rule of thumb at production scale: target dead-tuple ratio below 20% for tables above 50 GB. Smaller tables tolerate higher bloat (30–50%) because the absolute waste is small.
VACUUM never physically shrinks the file — it just marks space reusable inside the file. Reclaiming disk back to the operating system requires:
VACUUM FULL— rewrites the whole table while holding anACCESS EXCLUSIVElock, blocking every reader and writerpg_repackextension — rewrites concurrently, swaps at the end (covered in lesson 07)
| Operation | What it reclaims | Blocks reads/writes? | Shrinks file? |
|---|---|---|---|
| Regular VACUUM / autovacuum | Dead tuple slots → reusable | No (SHARE UPDATE EXCLUSIVE) | No |
| VACUUM FULL | Rewrites whole table | Yes (ACCESS EXCLUSIVE) | Yes |
| pg_repack | Rewrites concurrently | Brief swap only | Yes |
How autovacuum actually decides to run
Autovacuum is a background process pool (3 workers by default) that wakes every minute and inspects every table for two threshold conditions:
- Dead-tuple threshold:
autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor × n_live_tup(default: 50 + 20% of live tuples) - Insert threshold: similar formula for fresh inserts that need ANALYZE
When a table crosses a threshold, autovacuum dispatches a worker. The worker takes a SHARE UPDATE EXCLUSIVE lock — strong enough to block other VACUUMs and schema changes, weak enough to let regular reads and writes proceed in parallel.
The worker computes the global oldest xmin across all sessions (the cap on reclaimable tuples), scans the heap, identifies dead tuples whose t_xmax is older than oldest xmin, marks their slots reusable, walks each index to remove stale pointers, and rebuilds the table’s free-space map.
Throughout, autovacuum is rate-limited by a cost-based delay: every page read costs a few units, every page write costs more units, and once the worker exceeds autovacuum_vacuum_cost_limit, it sleeps for autovacuum_vacuum_cost_delay milliseconds.
Put the steps of an autovacuum run on a bloated table in order:
- 1 Scheduler wakes up; checks pg_stat_all_tables.n_dead_tup vs autovacuum_vacuum_threshold + scale_factor × n_live_tup
- 2 Worker connects to the database; acquires a SHARE UPDATE EXCLUSIVE lock on the table (concurrent reads + writes continue)
- 3 Worker computes the oldest xmin across all sessions (pg_stat_activity.backend_xmin) — only tuples older than this can be removed
- 4 Heap scan: visit each page, identify dead tuples whose t_xmax < oldest xmin
- 5 Mark dead tuple slots reusable inside each page; rebuild the page's free-space map entry
- 6 Index cleanup: walk each index and remove entries pointing at reclaimed heap slots
- 7 Update pg_class.relfrozenxid + pg_stat_all_tables counters; release lock
What pins the oldest xmin
The most common cause of bloat that autovacuum cannot reclaim: a long-running transaction or orphan replication slot is holding back the global oldest xmin.
Diagnose with:
SELECT pid, backend_xmin, now() - xact_start AS duration
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL
ORDER BY backend_xmin LIMIT 5;
SELECT slot_name, xmin FROM pg_replication_slots WHERE xmin IS NOT NULL;Fix: terminate the long transaction (SELECT pg_terminate_backend(pid)) or drop the unused slot (SELECT pg_drop_replication_slot('name')). The very next autovacuum cycle then reclaims the pinned bloat.
Prevention: set idle_in_transaction_session_timeout to 5–15 minutes in production. This kills sessions that have an open transaction without activity, preventing them from pinning xmin indefinitely.
The hot_standby_feedback wrinkle
Streaming replicas can be configured with hot_standby_feedback = on, which sends the replica’s oldest active xmin back to the primary so the primary’s autovacuum knows not to reclaim tuples a replica still needs. Without it, a long analytical query on a replica can fail with canceling statement due to conflict with recovery.
With it, the primary’s bloat is held hostage by the replica’s longest-running query. Most production setups choose off on the replica and accept the occasional query cancellation, because pinning primary bloat indefinitely is the worse failure mode. The alternative is to route long analytics to a logical replica that maintains its own snapshot policy independently.
Why this works
The cost model for autovacuum’s IO throttle was designed for spinning disks. On NVMe storage the cost delay can safely be set to 0 — the disk can absorb continuous VACUUM IO without impacting foreground queries. On spinning disks, keep cost_delay > 0.
A long-running batch job has accumulated 10 GB of dead tuples on an orders table. Autovacuum has been running but n_dead_tup is not dropping. What is the likely cause and how do you confirm?
After VACUUM (not VACUUM FULL) runs successfully on a 200 GB table with 40% bloat, what happens to the file size?
hot_standby_feedback = on on a replica sends what information to the primary, and what is the risk?
- 01What is the global oldest xmin, where does it come from, and why does it matter for autovacuum?
- 02What is the autovacuum dead-tuple threshold formula, and what does each parameter do?
- 03What lock does autovacuum take, and what does it block?
Every UPDATE leaves a dead tuple on disk; autovacuum reclaims those slots by marking them reusable — but never shrinks the file. VACUUM FULL or pg_repack are needed to return bytes to the OS, with their respective downtime costs. Autovacuum triggers when n_dead_tup exceeds a per-table threshold, takes a non-blocking SHARE UPDATE EXCLUSIVE lock, and can only reclaim tuples older than the cluster’s global oldest xmin. A long-running transaction or orphan replication slot pins that xmin and makes autovacuum’s work silently futile; diagnose with pg_stat_activity.backend_xmin and fix by terminating the session or dropping the slot. Setting idle_in_transaction_session_timeout prevents this at the infrastructure level.
- CLOG, XID wraparound, and MultiXact: deep visibility internalssenior
- SSI internals and production autovacuum tuningsenior
- Real-world MVCC failures, deployment patterns, and distributed snapshotssenior
- MVCC and isolation: diagnose bloat and a write-skew anomalysenior
- MVCC and isolation: multiple-choice reviewsenior
- MVCC and isolation: free-recall reviewsenior
appears again in140
- Why GraphQL gets N+1junior
- DataLoader mechanics: tick-boundary batchingmiddle
- Batch function contracts: ordering, shapes, errorsmiddle
- Federation and lookahead: batching beyond DataLoadermiddle
- Query complexity defences: depth, cost, persisted queriesmiddle
- Senior GraphQL API: scheduling contract, tenant isolation, observabilitysenior
- Why idempotency: making retries safejunior
- Server-side state machine: four states of an idempotency keymiddle
- Outbox and inbox: effectively-once across the dual-write boundarymiddle
- Concurrency and cache architecture for idempotency at scalesenior
- Observability, production failures, and global-scale designsenior
- The event loop: one thread, three queuesjunior
- Tasks, microtasks, and scheduler.yield()middle
- Microtask starvation, Long Tasks, and LoAFsenior
- Node.js event loop: phases, nextTick, and loop lagsenior
- React, Vue, and INP observability in productionsenior
- The render pipeline: six stages from bytes to pixelsjunior
- Stage costs and the renderer process modelmiddle
- Invalidation, dirty bits, and containmiddle
- Compositor layers: promotion, overlap, and GPU memorymiddle
- DevTools flame strip and the frame lifecyclemiddle
- Layout thrash: forced synchronous layoutsenior
- BeginMainFrame, compositor-driven animations, and GPU memorysenior
- Production observability: LoAF, INP, and the full attack surfacesenior
- What V8 is and why performance varies 100×junior
- V8''''s four-tier JIT pipeline and profile-guided tieringmiddle
- Hidden classes, transition trees, and memory layoutmiddle
- Inline caches, IC states, and deoptimizationmiddle
- Orinoco GC: parallel scavenger, concurrent marking, and write barriersmiddle
- TurboFan''''s speculative engine and the deopt-loop trapsenior
- V8 in production: isolates, pointer compression, and real failuressenior
- Service worker lifecycle and cache strategiesmiddle
- Service worker edge cases: version skew, durability, and navigation trapssenior
- What the reconciler does: render vs commitjunior
- The fiber object and the double-buffer treemiddle
- Render phase purity and commit phase sub-stepsmiddle
- Reconciliation: diffing heuristics and the key trapmiddle
- Priority lanes, time-slicing, and useTransitionmiddle
- Bailout, memoisation, and tearingsenior
- React Profiler, the Compiler, and production observabilitysenior
- Rendering strategies: SSG, SSR, ISR, streaming, and hydrationjunior
- SSG, SSR, ISR, streaming, and RSC — how each worksmiddle
- Hydration cost: selective, progressive, islands, resumabilitymiddle
- Hydration mismatch: causes, detection, and the determinism rulesenior
- RSC, per-route strategy, and production observabilitysenior
- Core Web Vitals: what LCP, INP, and CLS measurejunior
- CLS: why layout shifts happen and how to stop themmiddle
- Metric tradeoffs, RUM attribution, and the CI+field loopsenior
- The full picture: URL to LCP to INP as a relay racejunior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- What is a cache stampede and why it makes things worsejunior
- Lock and single-flight: bounding concurrent rebuildsmiddle
- XFetch: coordination-free probabilistic early expirationmiddle
- Stale-while-revalidate and CDN request coalescingmiddle
- Detecting stampedes and designing TTL for productionmiddle
- Metastable failure, fencing tokens, and production postmortemssenior
- Raft roles, terms, and why majority quorums prevent split brainjunior
- How Raft replicates a log entry and decides it is safe to commitmiddle
- Raft leader election: timeouts, voting rules, and the four safety propertiesmiddle
- Raft in the real world: partitions, slow disks, and client routingmiddle
- Raft extensions: pre-vote, learners, snapshots, and linearizable readssenior
- Raft in production: membership changes, Multi-Raft, and observabilitysenior
- Where data fetching happens — and why it decides LCPjunior
- Fetch waterfalls — diagnosis and the Promise.all curemiddle
- React Server Components and Suspense streamingmiddle
- Client-side cache: TanStack Query, SWR, and stale-while-revalidatemiddle
- LCP, prefetch, and race conditions in interactive fetchingmiddle
- Senior internals: RSC payload, caching layers, and production failure modessenior
- The three-way handshakejunior
- Sequence numbers and connection statemiddle
- DNS: what it does and why it existsjunior
- The resolver walk: referrals, record types, and gluemiddle
- TTL, caching, and DNS propagationmiddle
- The 1-RTT handshake: key shares and ECDHEmiddle
- Session resumption and 0-RTTmiddle
- WebSocket: the HTTP upgrade handshakejunior
- WebSocket frame format: opcodes, masking, fragmentationmiddle
- WebSocket backpressure: when clients can''''t keep upmiddle
- Reconnection: jittered backoff, thundering herd, message resumptionsenior
- WebSocket at scale: HTTP/2 multiplexing, permessage-deflate, C10Msenior
- WebSocket in production: proxies, security, and distributed architecturesenior
- What reverse proxies dojunior
- Health checks, connection draining, and slow startmiddle
- Session affinity, consistent hashing, and the right fixmiddle
- Retry storms, circuit breakers, and load sheddingsenior
- Resilient LB architecture: anycast, zone-aware routing, and observabilitysenior
- Why QUIC and not TCP+TLSjunior
- Connection IDs and network migrationmiddle
- 0-RTT resumption and packet encryptionsenior
- DDoS: what it is and why it worksjunior
- Amplification attacks and state exhaustionmiddle
- Rate limiting: algorithms and architecturemiddle
- WAFs, firewalls, mTLS, and HSTSmiddle
- DNS cache poisoning and BGP hijackingsenior
- Defense-in-depth architecture and attack economicssenior
- DNS, TCP, TLS in sequence: where the milliseconds gomiddle
- Proxy intercepts and security gates: rate limiters, WAF, mTLSmiddle
- Alternate paths: QUIC 0-RTT, WebSocket upgrade, connection migrationmiddle
- Observability: distributed traces, USE/RED, and samplingsenior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- What the three signals are: logs, metrics, and tracesjunior
- Why structured logs exist: the diary vs the spreadsheetjunior
- The production log schema: fields every line must carrymiddle
- PII redaction and log injectionsenior
- OTel Logs Data Model and audit logs as a subsystemsenior
- SLI, SLO, and the error budget: reliability by the numbersjunior
- Error budget policy, latency SLOs, and composite journeysmiddle
- Production SLO failures, self-observability, security, and the big picturesenior
- The incident loop: from pager to postmortem to preventionmiddle
- Cache lines, struct layout, and false sharingmiddle
- SIMD, SoA vs AoS, and memory bandwidthmiddle
- Cache-oblivious algorithms, PGO, and production failuressenior
- GC in production: observability, security, edge cases, and fleet governancesenior
- Batching: amortize fixed cost per operationjunior
- The batching window: size and wait timemiddle
- Batching in Kafka and Postgresmiddle
- io_uring and observability of batchingmiddle
- From Nagle to io_uring: evolution of batchingmiddle
- Backpressure, failure isolation, and batch security in productionsenior
- CI enforcement and RUM: making budgets stickmiddle
- V8 JIT pipeline, HTTP priorities, and bundle securitysenior
- The performance loop: discipline, not a projectjunior
- Classify and fix: matching bottleneck families to remediesmiddle
- Observability stack and CI gates: catching regressions before they shipmiddle
- Incident to enforcement: SLO burn to verified fix in 35 minutesmiddle
- Culture, economics, and org-scale performancesenior
- At-most-once, at-least-once, exactly-once: the three delivery contractsjunior
- The three failure legs — where duplicates and losses actually happenmiddle
- Consumer-side dedup: the cheapest path to exactly-once processingmiddle
- Kafka exactly-once semantics: idempotent producer and transactionsmiddle
- SQS visibility timeout, DLQ, and the outbox patternmiddle
- Exactly-once in production: impossibility proof, hybrid patterns, and real incidentssenior
- What OAuth is and why passwords are not the answerjunior
- Authorization code flow with PKCEmiddle
- ID token validation and JWKS cache managementmiddle
- Refresh token rotation and scope-based least privilegemiddle
- Sender-constrained tokens: DPoP and mTLSsenior
- OAuth in production: audience attacks, observability, and real failuressenior