awesome-everything RU
↑ Back to the climb

Databases

Observability, anti-patterns, and production triage

Crux USE + RED maps which lever to pull; pg_stat_statements, pg_stat_activity, and pg_stat_user_tables are the eyes. An inherited Postgres at 200M rows, 600 backends, and 40 GB bloat has a three-week triage order — the seven-act model doubles as an emergency protocol.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 16 min

You inherit a Postgres at 200M rows, p95 1500 ms, 600 concurrent backends, 40 GB bloat, no pooler, single node. Three weeks to stabilize. The seven-act model is not just a growth story — it is also the triage protocol. Work from the most acute failure mode to the most structural one, not the other way around.

USE + RED for Postgres

The USE method (Utilization, Saturation, Errors) maps to Postgres signals:

  • Utilization: backend count vs max_connections (pg_stat_activity), CPU from pg_stat_bgwriter.checkpoints_req ratio, buffer cache hit rate (pg_buffercache).
  • Saturation: autovacuum lag (pg_stat_user_tables.last_autovacuum), replication lag (pg_stat_replication.replay_lag), connection queue depth (PgBouncer SHOW STATS).
  • Errors: deadlock count (pg_stat_database.deadlocks), checksum failures, replication errors from server logs.

The RED method (Rate, Errors, Duration) on top of pg_stat_statements:

  • Queries/second per queryid.
  • Error rate from app logs (timeouts, deadlocks).
  • p95/p99 duration per queryid.

Together they tell you which lever to pull:

SignalLikely act
High backend count, low CPUAct 5 (pooling problem)
High dead tuples, stable row countAct 4 (vacuum/bloat problem)
Specific queryid with rising durationAct 2 or 3 (index or stats problem)
Rising p95 + one tenant’s shards hotAct 7 (hot shard problem)

Diagnostic tooling per act

  • Act 1 (schema): no good tooling — read the table definition and trace migrations. \d+ tablename in psql.
  • Act 2 (indexes): pg_indexes for existence; pg_stat_user_indexes (idx_scan column) to find unused indexes — an index with 0 scans after a week of traffic is a write-overhead tax.
  • Act 3 (planning): pg_stat_statements for top-N queries by total time; EXPLAIN (ANALYZE, BUFFERS) for the actual plan; auto_explain.log_min_duration = '500ms' for tail-latency triage.
  • Act 4 (bloat): pg_stat_user_tables.n_dead_tup vs n_live_tup; pg_table_size(relname) for actual disk; pgstattuple for precise bloat percentage without a full scan.
  • Act 5 (pooling): pg_stat_activity for backend count, state distribution, idle in transaction sessions; PgBouncer SHOW STATS for queue depth and avg query time.
  • Act 6 (migrations): pg_locks joined to pg_stat_activity to detect lock contention and blockers before running ALTER.
  • Act 7 (sharding): application-level metrics — tail latency per shard, write variance across shards, cross-shard join frequency. Missing or misinterpreted tooling is why teams skip acts: “We can’t see the bloat” means autovacuum visibility is off, not that bloat does not exist.

The three-week triage order

You inherit a Postgres at 200M rows, p95 1500 ms, 600 concurrent backends, 40 GB bloat, no pooler:

Week 1 — stop the bleeding:

  1. Kill long transactions: SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction' AND query_start < now() - interval '5 minutes'.
  2. Set idle_in_transaction_session_timeout = 60s so the problem cannot recur overnight.
  3. Deploy PgBouncer in transaction-mode in front of Postgres. Size the pool on active concurrent transactions, not worker count (step 1 reveals actual concurrency).
  4. Run pg_repack on the bloated tables online — no AccessExclusiveLock required.

Week 2 — re-establish plan quality:

  1. ANALYZE on the most-queried tables.
  2. CREATE STATISTICS for the top correlated column pairs from pg_stat_statements slow queries.
  3. Review pg_stat_statements top-N by total_time; drop unused indexes (idx_scan = 0); add obviously-missing ones.

Week 3 — capacity assessment:

  1. Measure actual single-node ceiling with the pool in place. Plot p95 vs QPS curve.
  2. If single node cannot keep up at expected load: evaluate declarative partitioning (single node, fewer ops) before sharding.
  3. Only then plan Act 7 — with shard key, co-location, and online resharding rehearsed in staging.

XID wraparound and autovacuum freeze

Every row carries a 32-bit transaction ID (xmin). Once the cluster reaches 2^31 transactions (~2.1 billion), IDs wrap around and rows can appear to be “from the future.” Postgres protects against this by freezing old rows (rewriting xmin to a special “frozen” value) before wraparound. autovacuum_freeze_max_age controls the trigger.

On a write-heavy cluster at 10K TPS, that is ~864M transactions per day — the freeze runs every ~2.5 days. A misconfiguration (very high autovacuum_freeze_max_age) can lead to anti-wraparound emergency vacuums that hold an exclusive lock on the table.

Monitor: SELECT datname, age(datfrozenxid) FROM pg_database; — alert at 80% of autovacuum_freeze_max_age.

The anti-pattern catalogue when an act is skipped

  • Skip Act 1: a year later you spend three weeks renaming a column because the schema lacks a surrogate key and every FK must change.
  • Skip Act 2: every dashboard query is a 30-second seq-scan; the app team adds caches everywhere; the cache becomes the new source of truth and goes inconsistent.
  • Skip Act 3: you have indexes, but the planner ignores them half the time; you write a runbook that says “if slow, restart Postgres” because that flushes the plan cache.
  • Skip Act 4: the OOM-killer takes Postgres down on Friday; the postmortem says “we’ll add monitoring”; it never happens.
  • Skip Act 5: every deploy triggers a connection storm; CI now spreads pod rollouts across 10 minutes “as a workaround.”
  • Skip Act 6: migrations only run during a quarterly maintenance window because nothing else is safe; the queue of pending migrations grows to 40 entries.
  • Skip Act 7: you spend six months explaining to leadership that one tenant is using 40% of capacity and the only fix is a re-architecture.
Key Postgres observability thresholds
Cache hit ratio — alert threshold
below 95–99%
n_dead_tup / n_live_tup — bloat alert
above 20%
age(datfrozenxid) — wraparound alert
80% of autovacuum_freeze_max_age
PgBouncer SHOW STATS: avg_wait_time alert
above 5 ms sustained
pg_stat_statements total_time alert
queries > 1s consuming > 5% of total
Replication lag alert
above 30 seconds on async replica
Why this works

The seven-act model is both a growth framework (design levers in order) and a triage framework (apply fixes in reverse of failure acuteness). During an outage, you do not blame the schema (Act 1); you kill long transactions (Act 4), then deploy a pooler (Act 5), then optimize queries (Act 2–3). Once stability is restored, you postmortem and write runbooks for each act to prevent the next incident. The order of acts is not just a design constraint — it is an emergency protocol.

Quiz

pg_stat_user_tables shows n_dead_tup = 80M on a table with n_live_tup = 100M. What is the correct diagnosis and first lever?

Quiz

What is the role of backend_xmin in pg_stat_activity for diagnosing bloat?

Quiz

Why does hot_standby_feedback = on shift bloat from the replica to the primary?

Recall before you leave
  1. 01
    Map the USE method to the Postgres signals that indicate each of the seven acts.
  2. 02
    You inherit a production Postgres at 200M rows, p95 1500 ms, 600 backends, 40 GB bloat, no pooler. Give the triage order for the first three weeks and justify each step.
  3. 03
    What is XID wraparound, why is it a correctness cliff (not a performance issue), and how do you monitor it?
Recap

The USE + RED framework maps directly to the seven acts: high backend count is Act 5, high dead-tuple ratio is Act 4, a specific query’s rising duration is Act 2 or 3, and diverging per-shard tail latency is Act 7. Each act has its diagnostic tool — pg_stat_user_tables for bloat, pg_stat_activity for connection storms, pg_stat_statements for slow queries, pg_locks for migration blockers, and application-level shard metrics for Act 7. When inheriting a degraded Postgres, the correct triage order is Act 4 first (kill long transactions, stop bleeding), then Act 5 (deploy pooler), then Acts 3 and 2 (plan quality), then structural Acts 1 and 7. XID wraparound is a correctness cliff — it does not slow queries, it corrupts them — monitor age(datfrozenxid) per database and alert at 80% of autovacuum_freeze_max_age. The seven-act model is simultaneously a growth framework and an emergency protocol: the order of acts is the order of triage when everything is on fire.

Connected lessons
appears again in263
Continue the climb ↑Database track: multiple-choice synthesis
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources6
expand
  1. 01
  2. 02
  3. 03
  4. 04
  5. 05
  6. 06

Trademarks belong to their respective owners. Editorial reference only.