awesome-everything RU
↑ Back to the climb

Observability

The incident loop: from pager to postmortem to prevention

Crux The full production incident loop — T+0 to T+1 week — plus the cultural mechanisms (blameless postmortem, runbooks, game days, error budget policy) that compound MTTR improvement over time.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 14 min

Two teams have identical observability tooling. Team A’s MTTR is 45 minutes. Team B’s is 8 minutes. The difference is not dashboards, not vendor, not headcount. Team B has a blameless postmortem culture, a runbook on every alert, a signed error budget policy, and monthly game days. Tools collect data; culture decides what to do with it.

The full incident loop, end to end

A production incident resolved correctly looks like this:

T+0: SLO burn-rate alert fires, paging the on-call. The alert payload contains: service name, SLO id, current burn rate, time window, and four deeplinks — RED dashboard, trace view filtered to the burn window, profile view filtered to the burn window, runbook.

T+30 s: On-call acks via PagerDuty. The first deeplink (RED) auto-opens.

T+1 min: On-call reads RED’s three panels and identifies which of Rate / Errors / Duration moved and the shape (spike vs drift vs plateau).

T+1 min 30 s: On-call clicks the trace deeplink, sees 5–10 representative slow or errored traces, identifies which service and which span has the bulk of the latency.

T+2 min: On-call clicks the profile deeplink (pre-filtered by the trace-id from step above), sees the flame graph, identifies the widest leaf frame — the function that consumed the time.

T+2 min 30 s: git blame on the function reveals commit, author, date. Cross-reference with the deploy timeline; cause confirmed.

T+3 min: Rollback initiated or hotfix drafted.

T+5–10 min: Burn rate returns to baseline. Alert clears.

T+1 h: Blameless postmortem document created with timeline and root cause. Action items filed.

T+1 day: Action items begin work. Runbook updated with the new pattern.

T+1 week: Action items complete. The next incident of this class is prevented.

The loop is reproducible. It gets faster with practice. It does not require heroics.

PhaseTimeAction
DetectT+0SLO burn alert fires, on-call paged
DiagnoseT+0 to T+3 minFollow funnel: RED → trace → profile → git blame
ResolveT+3 to T+10 minRollback or hotfix; watch burn rate return
LearnT+1 hBlameless postmortem, action items filed
PreventT+1 day to T+1 weekAction items complete; runbook updated

The five cultural mechanisms

Each technical piece in this unit only pays off when the team has the following in place.

1. Signed error budget policy. A written agreement — signed at director level — that freezes non-critical deploys when the error budget is exhausted. Without it, engineers ship anyway “just this once” and the SLO becomes a metric no one acts on. The policy is what makes the SLO a real contract between engineering and the business.

2. Blameless postmortem culture. Every SEV-1 and SEV-2 incident produces a postmortem within 24–48 hours. The document records: timeline, root cause (system failure, not personal failure), and concrete action items. Action items are tracked and completed like product work. Without this, the same incident recurs. With it, each incident makes the next class of incident either impossible or fast to diagnose.

3. Runbooks on every alert. Every alert links to a runbook owned by a named engineer and reviewed quarterly. The runbook contains: what the alert means, what the on-call should check first, what the likely causes are, and how to fix each. An on-call paged at 3 am who opens a good runbook for a recurring incident resolves it in minutes. An on-call with no runbook re-investigates from scratch every time.

4. Game days. Scheduled exercises where engineering injects a realistic fault (kill a pod, slow a downstream, blow a region) and observes the on-call response: does the funnel get followed? Did the runbook help? Did the alert fire fast enough? Each game day produces runbook updates and dashboard improvements. Teams that run monthly game days build muscle memory that converts 3 am incidents into 10-minute resolutions.

5. Cost reviews. Observability spend is audited quarterly the same way infra spend is audited. Each team sees its own signal volume, cardinality, and cost. Teams that leak budget get engineering attention before they become the next Datadog 2021 story ($680k → $2M in a week from one misconfigured metric).

The action-item flywheel

Each postmortem’s action items are the org’s most valuable reliability asset, not the incident itself. The pattern over 12 months:

  • Action items that recur across postmortems become higher-priority policy work (“we keep deploying schema changes without backwards compat” → “backwards compat is now required in CI”).
  • Pattern detection across postmortems (“60% of incidents come from one team’s deploy pipeline”) guides architectural investment.
  • Action-item completion rate becomes a team-health metric — tracked at the VP level, reviewed monthly.

An org that runs this flywheel for a year sees: MTTR halved (45 → 20 min), incident count down 30%, observability cost flat or down despite 2x traffic growth, team satisfaction up (fewer 3 am pages).

Why this works

Orgs with strong tooling and weak culture see MTTR stuck at 30+ minutes. Orgs with mediocre tooling and strong culture beat them on MTTR by 2–3x. The chapter exists to make the tooling table stakes so the cultural mechanisms have something to land on. Cultural fixes are harder to install than tools — they require management commitment and patience — but they compound forever. Tool upgrades depreciate; culture compounds.

What a mature incident culture looks like after 12 months
MTTR improvement (funnel + culture)
50–80% reduction
Incident count reduction (action-item flywheel)
~30% over 12 months
Postmortem completion target (SEV1/2)
100% within 48 h
Action-item completion target
≥ 80% within 30 days
Game day cadence (mature org)
Monthly minimum per region
Runbook coverage target
Every alert has a named owner
Quiz

A team's MTTR has been stuck at 25 minutes for a year despite multiple tool upgrades. What is the most likely missing piece?

Quiz

The same SEV-1 incident has fired four times in three months. Each time MTTR is 40–50 minutes. What does this pattern indicate?

Recall before you leave
  1. 01
    What is the blameless postmortem and why does it matter for MTTR over time?
  2. 02
    What must an SLO burn-rate alert payload contain for the funnel to work in under three minutes?
  3. 03
    Name the five cultural mechanisms and state what breaks if each one is absent.
Recap

The full incident loop runs from T+0 (alert fires) to T+1 week (action items complete and root cause prevented), with the funnel-driven diagnosis completing in under three minutes when deeplinks are embedded in the alert payload. Five cultural mechanisms make the loop compound: a signed error budget policy that actually freezes deploys, blameless postmortems that convert incidents into tracked action items, runbooks on every alert owned by a named engineer, monthly game days that maintain funnel-discipline muscle memory, and quarterly cost reviews that catch cardinality leaks before they become budget crises. The action-item flywheel is the compounding asset: each postmortem’s items make the next incident class either impossible or fast to diagnose. Teams with strong tooling and weak culture plateau at 30-minute MTTR; teams with mediocre tooling and strong culture beat them by 2–3x. Culture is harder to install than a new dashboard, but unlike dashboards it compounds forever.

Connected lessons
appears again in186
Continue the climb ↑Scale, security, and the ROI of observable systems
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.