Databases DB · 08 · 05

Observability, anti-patterns, and production triage

USE + RED maps which lever to pull; pg_stat_statements, pg_stat_activity, and pg_stat_user_tables are the eyes. An inherited Postgres at 200M rows, 600 backends, and 40 GB bloat has a three-week triage order — the seven-act model doubles as an emergency protocol.

DB Senior ◷ 16 min

Level

FoundationsJuniorMiddleSenior

You inherit a Postgres at 200M rows, p95 1500 ms, 600 concurrent backends, 40 GB bloat, no pooler, single node. Three weeks to stabilize. The seven-act model is not just a growth story — it is also the triage protocol. Work from the most acute failure mode to the most structural one, not the other way around.

USE + RED for Postgres

Without a systematic signal-to-act mapping, engineers pull the wrong lever under pressure — running VACUUM FULL on a bloated table while the real killer is an idle connection holding the xmin horizon. USE + RED gives you that mapping.

The USE method (Utilization, Saturation, Errors) maps to Postgres signals:

Utilization: backend count vs max_connections (pg_stat_activity), CPU from pg_stat_bgwriter.checkpoints_req ratio, buffer cache hit rate (pg_buffercache).
Saturation: autovacuum lag (pg_stat_user_tables.last_autovacuum), replication lag (pg_stat_replication.replay_lag), connection queue depth (PgBouncer SHOW STATS).
Errors: deadlock count (pg_stat_database.deadlocks), checksum failures, replication errors from server logs.

The RED method (Rate, Errors, Duration) on top of pg_stat_statements:

Queries/second per queryid.
Error rate from app logs (timeouts, deadlocks).
p95/p99 duration per queryid.

Together they tell you which lever to pull:

Signal	Likely act
High backend count, low CPU	Act 5 (pooling problem)
High dead tuples, stable row count	Act 4 (vacuum/bloat problem)
Specific queryid with rising duration	Act 2 or 3 (index or stats problem)
Rising p95 + one tenant’s shards hot	Act 7 (hot shard problem)

The signal you observe names the act — and therefore the lever. Read the symptom, then pull the matching act's fix; do not guess under pressure.

Diagnostic tooling per act

Act 1 (schema): no good tooling — read the table definition and trace migrations. \d+ tablename in psql.
Act 2 (indexes): pg_indexes for existence; pg_stat_user_indexes (idx_scan column) to find unused indexes — an index with 0 scans after a week of traffic is a write-overhead tax.
Act 3 (planning): pg_stat_statements for top-N queries by total time; EXPLAIN (ANALYZE, BUFFERS) for the actual plan; auto_explain.log_min_duration = '500ms' for tail-latency triage.
Act 4 (bloat): pg_stat_user_tables.n_dead_tup vs n_live_tup; pg_table_size(relname) for actual disk; pgstattuple for precise bloat percentage without a full scan.
Act 5 (pooling): pg_stat_activity for backend count, state distribution, idle in transaction sessions; PgBouncer SHOW STATS for queue depth and avg query time.
Act 6 (migrations): pg_locks joined to pg_stat_activity to detect lock contention and blockers before running ALTER.
Act 7 (sharding): application-level metrics — tail latency per shard, write variance across shards, cross-shard join frequency. Missing or misinterpreted tooling is why teams skip acts: “We can’t see the bloat” means autovacuum visibility is off, not that bloat does not exist.

The three-week triage order

You inherit a Postgres at 200M rows, p95 1500 ms, 600 concurrent backends, 40 GB bloat, no pooler:

Week 1 — stop the bleeding:

Kill long transactions: SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction' AND query_start < now() - interval '5 minutes'.
Set idle_in_transaction_session_timeout = 60s so the problem cannot recur overnight.
Deploy PgBouncer in transaction-mode in front of Postgres. Size the pool on active concurrent transactions, not worker count (step 1 reveals actual concurrency).
Run pg_repack on the bloated tables online — no AccessExclusiveLock required.

Week 2 — re-establish plan quality:

ANALYZE on the most-queried tables.
CREATE STATISTICS for the top correlated column pairs from pg_stat_statements slow queries.
Review pg_stat_statements top-N by total_time; drop unused indexes (idx_scan = 0); add obviously-missing ones.

Week 3 — capacity assessment:

Measure actual single-node ceiling with the pool in place. Plot p95 vs QPS curve.
If single node cannot keep up at expected load: evaluate declarative partitioning (single node, fewer ops) before sharding.
Only then plan Act 7 — with shard key, co-location, and online resharding rehearsed in staging.

XID wraparound and autovacuum freeze

Every row carries a 32-bit transaction ID (xmin). Once the cluster reaches 2^31 transactions (~2.1 billion), IDs wrap around and rows can appear to be “from the future.” Postgres protects against this by freezing old rows (rewriting xmin to a special “frozen” value) before wraparound. autovacuum_freeze_max_age controls the trigger.

On a write-heavy cluster at 10K TPS, that is ~864M transactions per day — the freeze runs every ~2.5 days. A misconfiguration (very high autovacuum_freeze_max_age) can lead to anti-wraparound emergency vacuums that hold an exclusive lock on the table.

Monitor: SELECT datname, age(datfrozenxid) FROM pg_database; — alert at 80% of autovacuum_freeze_max_age.

The anti-pattern catalogue when an act is skipped

Skip Act 1: a year later you spend three weeks renaming a column because the schema lacks a surrogate key and every FK must change.
Skip Act 2: every dashboard query is a 30-second seq-scan; the app team adds caches everywhere; the cache becomes the new source of truth and goes inconsistent.
Skip Act 3: you have indexes, but the planner ignores them half the time; you write a runbook that says “if slow, restart Postgres” because that flushes the plan cache.
Skip Act 4: the OOM-killer takes Postgres down on Friday; the postmortem says “we’ll add monitoring”; it never happens.
Skip Act 5: every deploy triggers a connection storm; CI now spreads pod rollouts across 10 minutes “as a workaround.”
Skip Act 6: migrations only run during a quarterly maintenance window because nothing else is safe; the queue of pending migrations grows to 40 entries.
Skip Act 7: you spend six months explaining to leadership that one tenant is using 40% of capacity and the only fix is a re-architecture.

Key Postgres observability thresholds

Cache hit ratio — alert threshold: below 95–99%
n_dead_tup / n_live_tup — bloat alert: above 20%
age(datfrozenxid) — wraparound alert: 80% of autovacuum_freeze_max_age
PgBouncer SHOW STATS: avg_wait_time alert: above 5 ms sustained
pg_stat_statements total_time alert: queries > 1s consuming > 5% of total
Replication lag alert: above 30 seconds on async replica

▸Why this works

The seven-act model is both a growth framework (design levers in order) and a triage framework (apply fixes in reverse of failure acuteness). During an outage, you do not blame the schema (Act 1); you kill long transactions (Act 4), then deploy a pooler (Act 5), then optimize queries (Act 2–3). Once stability is restored, you postmortem and write runbooks for each act to prevent the next incident. The order of acts is not just a design constraint — it is an emergency protocol.

Quiz

pg_stat_user_tables shows n_dead_tup = 80M on a table with n_live_tup = 100M. What is the correct diagnosis and first lever?

Quiz

What is the role of backend_xmin in pg_stat_activity for diagnosing bloat?

Quiz

Why does hot_standby_feedback = on shift bloat from the replica to the primary?

Act 4 — Kill long transactions, repack bloat stop the bleeding

Act 5 — Deploy pooler, size on active tx stabilize connections

Act 3 — ANALYZE, CREATE STATISTICS plan quality

Act 2 — Add missing, drop unused indexes access paths

Act 1 — Schema fixes structural, last

Act 7 — Shard only if single node maxed structural, last

Triage runs in reverse of failure acuteness: top = most acute (fix first), bottom = most structural (fix last). The seven-act growth order doubles as the emergency protocol.

Recall before you leave

01
Map the USE method to the Postgres signals that indicate each of the seven acts.
02
You inherit a production Postgres at 200M rows, p95 1500 ms, 600 backends, 40 GB bloat, no pooler. Give the triage order for the first three weeks and justify each step.
03
What is XID wraparound, why is it a correctness cliff (not a performance issue), and how do you monitor it?

Recap

The USE + RED framework maps directly to the seven acts: high backend count is Act 5, high dead-tuple ratio is Act 4, a specific query’s rising duration is Act 2 or 3, and diverging per-shard tail latency is Act 7. Each act has its diagnostic tool — pg_stat_user_tables for bloat, pg_stat_activity for connection storms, pg_stat_statements for slow queries, pg_locks for migration blockers, and application-level shard metrics for Act 7. When inheriting a degraded Postgres, the correct triage order is Act 4 first (kill long transactions, stop bleeding), then Act 5 (deploy pooler), then Acts 3 and 2 (plan quality), then structural Acts 1 and 7. XID wraparound is a correctness cliff — it does not slow queries, it corrupts them — monitor age(datfrozenxid) per database and alert at 80% of autovacuum_freeze_max_age. The seven-act model is simultaneously a growth framework and an emergency protocol: the order of acts is the order of triage when everything is on fire. Now when you inherit a degraded Postgres, open pg_stat_activity before touching anything else — it tells you which act you are in.

Connected lessons

builds on

appears again in292

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.