Observability
The incident loop: from pager to postmortem to prevention
Two teams have identical observability tooling. Team A’s MTTR is 45 minutes. Team B’s is 8 minutes. The difference is not dashboards, not vendor, not headcount. Team B has a blameless postmortem culture, a runbook on every alert, a signed error budget policy, and monthly game days. Tools collect data; culture decides what to do with it.
The full incident loop, end to end
A production incident resolved correctly looks like this:
T+0: SLO burn-rate alert fires, paging the on-call. The alert payload contains: service name, SLO id, current burn rate, time window, and four deeplinks — RED dashboard, trace view filtered to the burn window, profile view filtered to the burn window, runbook.
T+30 s: On-call acks via PagerDuty. The first deeplink (RED) auto-opens.
T+1 min: On-call reads RED’s three panels and identifies which of Rate / Errors / Duration moved and the shape (spike vs drift vs plateau).
T+1 min 30 s: On-call clicks the trace deeplink, sees 5–10 representative slow or errored traces, identifies which service and which span has the bulk of the latency.
T+2 min: On-call clicks the profile deeplink (pre-filtered by the trace-id from step above), sees the flame graph, identifies the widest leaf frame — the function that consumed the time.
T+2 min 30 s: git blame on the function reveals commit, author, date. Cross-reference with the deploy timeline; cause confirmed.
T+3 min: Rollback initiated or hotfix drafted.
T+5–10 min: Burn rate returns to baseline. Alert clears.
T+1 h: Blameless postmortem document created with timeline and root cause. Action items filed.
T+1 day: Action items begin work. Runbook updated with the new pattern.
T+1 week: Action items complete. The next incident of this class is prevented.
The loop is reproducible. It gets faster with practice. It does not require heroics.
| Phase | Time | Action |
|---|---|---|
| Detect | T+0 | SLO burn alert fires, on-call paged |
| Diagnose | T+0 to T+3 min | Follow funnel: RED → trace → profile → git blame |
| Resolve | T+3 to T+10 min | Rollback or hotfix; watch burn rate return |
| Learn | T+1 h | Blameless postmortem, action items filed |
| Prevent | T+1 day to T+1 week | Action items complete; runbook updated |
The five cultural mechanisms
Each technical piece in this unit only pays off when the team has the following in place.
1. Signed error budget policy. A written agreement — signed at director level — that freezes non-critical deploys when the error budget is exhausted. Without it, engineers ship anyway “just this once” and the SLO becomes a metric no one acts on. The policy is what makes the SLO a real contract between engineering and the business.
2. Blameless postmortem culture. Every SEV-1 and SEV-2 incident produces a postmortem within 24–48 hours. The document records: timeline, root cause (system failure, not personal failure), and concrete action items. Action items are tracked and completed like product work. Without this, the same incident recurs. With it, each incident makes the next class of incident either impossible or fast to diagnose.
3. Runbooks on every alert. Every alert links to a runbook owned by a named engineer and reviewed quarterly. The runbook contains: what the alert means, what the on-call should check first, what the likely causes are, and how to fix each. An on-call paged at 3 am who opens a good runbook for a recurring incident resolves it in minutes. An on-call with no runbook re-investigates from scratch every time.
4. Game days. Scheduled exercises where engineering injects a realistic fault (kill a pod, slow a downstream, blow a region) and observes the on-call response: does the funnel get followed? Did the runbook help? Did the alert fire fast enough? Each game day produces runbook updates and dashboard improvements. Teams that run monthly game days build muscle memory that converts 3 am incidents into 10-minute resolutions.
5. Cost reviews. Observability spend is audited quarterly the same way infra spend is audited. Each team sees its own signal volume, cardinality, and cost. Teams that leak budget get engineering attention before they become the next Datadog 2021 story ($680k → $2M in a week from one misconfigured metric).
The action-item flywheel
Each postmortem’s action items are the org’s most valuable reliability asset, not the incident itself. The pattern over 12 months:
- Action items that recur across postmortems become higher-priority policy work (“we keep deploying schema changes without backwards compat” → “backwards compat is now required in CI”).
- Pattern detection across postmortems (“60% of incidents come from one team’s deploy pipeline”) guides architectural investment.
- Action-item completion rate becomes a team-health metric — tracked at the VP level, reviewed monthly.
An org that runs this flywheel for a year sees: MTTR halved (45 → 20 min), incident count down 30%, observability cost flat or down despite 2x traffic growth, team satisfaction up (fewer 3 am pages).
Why this works
Orgs with strong tooling and weak culture see MTTR stuck at 30+ minutes. Orgs with mediocre tooling and strong culture beat them on MTTR by 2–3x. The chapter exists to make the tooling table stakes so the cultural mechanisms have something to land on. Cultural fixes are harder to install than tools — they require management commitment and patience — but they compound forever. Tool upgrades depreciate; culture compounds.
- MTTR improvement (funnel + culture)
- 50–80% reduction
- Incident count reduction (action-item flywheel)
- ~30% over 12 months
- Postmortem completion target (SEV1/2)
- 100% within 48 h
- Action-item completion target
- ≥ 80% within 30 days
- Game day cadence (mature org)
- Monthly minimum per region
- Runbook coverage target
- Every alert has a named owner
A team's MTTR has been stuck at 25 minutes for a year despite multiple tool upgrades. What is the most likely missing piece?
The same SEV-1 incident has fired four times in three months. Each time MTTR is 40–50 minutes. What does this pattern indicate?
- 01What is the blameless postmortem and why does it matter for MTTR over time?
- 02What must an SLO burn-rate alert payload contain for the funnel to work in under three minutes?
- 03Name the five cultural mechanisms and state what breaks if each one is absent.
The full incident loop runs from T+0 (alert fires) to T+1 week (action items complete and root cause prevented), with the funnel-driven diagnosis completing in under three minutes when deeplinks are embedded in the alert payload. Five cultural mechanisms make the loop compound: a signed error budget policy that actually freezes deploys, blameless postmortems that convert incidents into tracked action items, runbooks on every alert owned by a named engineer, monthly game days that maintain funnel-discipline muscle memory, and quarterly cost reviews that catch cardinality leaks before they become budget crises. The action-item flywheel is the compounding asset: each postmortem’s items make the next incident class either impossible or fast to diagnose. Teams with strong tooling and weak culture plateau at 30-minute MTTR; teams with mediocre tooling and strong culture beat them by 2–3x. Culture is harder to install than a new dashboard, but unlike dashboards it compounds forever.
appears again in186
- Why GraphQL gets N+1junior
- DataLoader mechanics: tick-boundary batchingmiddle
- Batch function contracts: ordering, shapes, errorsmiddle
- Federation and lookahead: batching beyond DataLoadermiddle
- Query complexity defences: depth, cost, persisted queriesmiddle
- Senior GraphQL API: scheduling contract, tenant isolation, observabilitysenior
- Why idempotency: making retries safejunior
- Server-side state machine: four states of an idempotency keymiddle
- Outbox and inbox: effectively-once across the dual-write boundarymiddle
- Concurrency and cache architecture for idempotency at scalesenior
- Observability, production failures, and global-scale designsenior
- The event loop: one thread, three queuesjunior
- Tasks, microtasks, and scheduler.yield()middle
- Microtask starvation, Long Tasks, and LoAFsenior
- Node.js event loop: phases, nextTick, and loop lagsenior
- React, Vue, and INP observability in productionsenior
- The render pipeline: six stages from bytes to pixelsjunior
- Stage costs and the renderer process modelmiddle
- Invalidation, dirty bits, and containmiddle
- Compositor layers: promotion, overlap, and GPU memorymiddle
- DevTools flame strip and the frame lifecyclemiddle
- Layout thrash: forced synchronous layoutsenior
- BeginMainFrame, compositor-driven animations, and GPU memorysenior
- Production observability: LoAF, INP, and the full attack surfacesenior
- What V8 is and why performance varies 100×junior
- V8''''s four-tier JIT pipeline and profile-guided tieringmiddle
- Hidden classes, transition trees, and memory layoutmiddle
- Inline caches, IC states, and deoptimizationmiddle
- Orinoco GC: parallel scavenger, concurrent marking, and write barriersmiddle
- TurboFan''''s speculative engine and the deopt-loop trapsenior
- V8 in production: isolates, pointer compression, and real failuressenior
- What workers are and why they existjunior
- Web worker mechanics: dedicated, shared, and OffscreenCanvasmiddle
- Structured clone and transferablesmiddle
- Service worker lifecycle and cache strategiesmiddle
- SharedArrayBuffer, Atomics, and cross-origin isolationsenior
- Service worker edge cases: version skew, durability, and navigation trapssenior
- Worker pools, Comlink, and production observabilitysenior
- What the reconciler does: render vs commitjunior
- The fiber object and the double-buffer treemiddle
- Render phase purity and commit phase sub-stepsmiddle
- Reconciliation: diffing heuristics and the key trapmiddle
- Priority lanes, time-slicing, and useTransitionmiddle
- Bailout, memoisation, and tearingsenior
- React Profiler, the Compiler, and production observabilitysenior
- Rendering strategies: SSG, SSR, ISR, streaming, and hydrationjunior
- SSG, SSR, ISR, streaming, and RSC — how each worksmiddle
- Hydration cost: selective, progressive, islands, resumabilitymiddle
- Hydration mismatch: causes, detection, and the determinism rulesenior
- RSC, per-route strategy, and production observabilitysenior
- Core Web Vitals: what LCP, INP, and CLS measurejunior
- CLS: why layout shifts happen and how to stop themmiddle
- Metric tradeoffs, RUM attribution, and the CI+field loopsenior
- The full picture: URL to LCP to INP as a relay racejunior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- What is a cache stampede and why it makes things worsejunior
- Lock and single-flight: bounding concurrent rebuildsmiddle
- XFetch: coordination-free probabilistic early expirationmiddle
- Stale-while-revalidate and CDN request coalescingmiddle
- Detecting stampedes and designing TTL for productionmiddle
- Metastable failure, fencing tokens, and production postmortemssenior
- What a relation is: tables, rows, keys, and constraintsjunior
- Constraints, keys, and Postgres data typesmiddle
- Normal forms, denormalization, and why schemas stickmiddle
- JSONB, arrays, and when a side table winsmiddle
- Heap storage, TOAST, and column alignmentsenior
- Schema integrity: deferral, versioning, and production failure modessenior
- Relational vs document, wide-column, graph, and key-valuesenior
- Index-only scans, the Visibility Map, and INCLUDEsenior
- Production failure modes and the index audit playbooksenior
- pg_statistic, ANALYZE, and production observabilitymiddle
- Production failure modes and plan stabilitysenior
- MVCC: why readers and writers never wait for each otherjunior
- Row versions and snapshots: the on-disk mechanicsmiddle
- HOT updates and isolation levels: what you gain and what you paymiddle
- Vacuum and bloat: keeping the storage tax boundedmiddle
- CLOG, XID wraparound, and MultiXact: deep visibility internalssenior
- SSI internals and production autovacuum tuningsenior
- Real-world MVCC failures, deployment patterns, and distributed snapshotssenior
- Connection pools: amortising the cost of a Postgres backendjunior
- PgBouncer session, transaction, and statement modesmiddle
- Pool sizing: the (cores × 2) + spindles formula and the two-layer stackmiddle
- Pool exhaustion and idle-in-transaction: the 3 AM failure modemiddle
- Migrating to transaction mode: rollout playbook and PgBouncer 1.21 prepared statementsmiddle
- The Postgres process model and why raising max_connections degrades throughputsenior
- Pooler landscape 2026, serverless connection storms, and the full failure-mode taxonomysenior
- What a schema migration is and why it replaces ad-hoc DDLjunior
- ADD COLUMN: instant in PG 11+ vs rewrite in older Postgresjunior
- The lock-queue failure mode: why instant DDL can freeze the databasemiddle
- Safe DDL patterns: NOT VALID, CONCURRENTLY, and unsafe-op fixesmiddle
- Expand-contract: zero-downtime for breaking schema changesmiddle
- Advisory locks, migration tools, and deploy coordinationsenior
- Migration failure taxonomy and production disciplinesenior
- Why sharding exists: the single-Postgres ceilingjunior
- Shard-key selection: hash, range, list, and directory strategiesmiddle
- Partitioning vs sharding: same word, two different thingsmiddle
- Co-location and Citus: the invariant that makes sharding usablemiddle
- The hot-shard failure mode: detection, isolation, and durable policymiddle
- Schema-based sharding and multi-tenancy alternativessenior
- Online resharding, 2PC, and the operational cost of shardingsenior
- The seven acts: from CREATE TABLE to Citusjunior
- Acts 1–3 in depth: schema, indexes, and planner statisticsmiddle
- Acts 4–6 in depth: MVCC bloat, connection pooling, and safe migrationsmiddle
- Act 7 in depth: sharding, co-location, and the seven-tier tradeoff cascademiddle
- Observability, anti-patterns, and production triagesenior
- Raft roles, terms, and why majority quorums prevent split brainjunior
- How Raft replicates a log entry and decides it is safe to commitmiddle
- Raft leader election: timeouts, voting rules, and the four safety propertiesmiddle
- Raft in the real world: partitions, slow disks, and client routingmiddle
- Raft extensions: pre-vote, learners, snapshots, and linearizable readssenior
- Raft in production: membership changes, Multi-Raft, and observabilitysenior
- Where data fetching happens — and why it decides LCPjunior
- Fetch waterfalls — diagnosis and the Promise.all curemiddle
- React Server Components and Suspense streamingmiddle
- Client-side cache: TanStack Query, SWR, and stale-while-revalidatemiddle
- LCP, prefetch, and race conditions in interactive fetchingmiddle
- Senior internals: RSC payload, caching layers, and production failure modessenior
- The IP envelopejunior
- Reading the IP headermiddle
- The three-way handshakejunior
- Sequence numbers and connection statemiddle
- DNS: what it does and why it existsjunior
- The resolver walk: referrals, record types, and gluemiddle
- TTL, caching, and DNS propagationmiddle
- What TLS does and why it existsjunior
- The 1-RTT handshake: key shares and ECDHEmiddle
- Session resumption and 0-RTTmiddle
- Key schedule, SNI, ALPN, and extensionssenior
- 0-RTT defenses, ECH, hybrid PQ, and production TLSsenior
- WebSocket: the HTTP upgrade handshakejunior
- WebSocket frame format: opcodes, masking, fragmentationmiddle
- WebSocket backpressure: when clients can''''t keep upmiddle
- Reconnection: jittered backoff, thundering herd, message resumptionsenior
- WebSocket at scale: HTTP/2 multiplexing, permessage-deflate, C10Msenior
- WebSocket in production: proxies, security, and distributed architecturesenior
- What reverse proxies dojunior
- Health checks, connection draining, and slow startmiddle
- Session affinity, consistent hashing, and the right fixmiddle
- Retry storms, circuit breakers, and load sheddingsenior
- Resilient LB architecture: anycast, zone-aware routing, and observabilitysenior
- Why QUIC and not TCP+TLSjunior
- Connection IDs and network migrationmiddle
- 0-RTT resumption and packet encryptionsenior
- DDoS: what it is and why it worksjunior
- Amplification attacks and state exhaustionmiddle
- Rate limiting: algorithms and architecturemiddle
- WAFs, firewalls, mTLS, and HSTSmiddle
- DNS cache poisoning and BGP hijackingsenior
- Defense-in-depth architecture and attack economicssenior
- The twelve layers: one URL, seven actorsjunior
- DNS, TCP, TLS in sequence: where the milliseconds gomiddle
- Proxy intercepts and security gates: rate limiters, WAF, mTLSmiddle
- Alternate paths: QUIC 0-RTT, WebSocket upgrade, connection migrationmiddle
- Observability: distributed traces, USE/RED, and samplingsenior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- Cache lines, struct layout, and false sharingmiddle
- SIMD, SoA vs AoS, and memory bandwidthmiddle
- Cache-oblivious algorithms, PGO, and production failuressenior
- GC in production: observability, security, edge cases, and fleet governancesenior
- Batching: amortize fixed cost per operationjunior
- The batching window: size and wait timemiddle
- Batching in Kafka and Postgresmiddle
- io_uring and observability of batchingmiddle
- From Nagle to io_uring: evolution of batchingmiddle
- Backpressure, failure isolation, and batch security in productionsenior
- CI enforcement and RUM: making budgets stickmiddle
- V8 JIT pipeline, HTTP priorities, and bundle securitysenior
- The performance loop: discipline, not a projectjunior
- Classify and fix: matching bottleneck families to remediesmiddle
- Observability stack and CI gates: catching regressions before they shipmiddle
- Incident to enforcement: SLO burn to verified fix in 35 minutesmiddle
- Culture, economics, and org-scale performancesenior
- At-most-once, at-least-once, exactly-once: the three delivery contractsjunior
- The three failure legs — where duplicates and losses actually happenmiddle
- Consumer-side dedup: the cheapest path to exactly-once processingmiddle
- Kafka exactly-once semantics: idempotent producer and transactionsmiddle
- SQS visibility timeout, DLQ, and the outbox patternmiddle
- Exactly-once in production: impossibility proof, hybrid patterns, and real incidentssenior
- What OAuth is and why passwords are not the answerjunior
- Authorization code flow with PKCEmiddle
- ID token validation and JWKS cache managementmiddle
- Refresh token rotation and scope-based least privilegemiddle
- Sender-constrained tokens: DPoP and mTLSsenior
- OAuth in production: audience attacks, observability, and real failuressenior