Engineering Practice ENG · 06 · 01

Blameless postmortems: turning outages into systems you fix, not people you blame

Blame makes people hide incidents, so you lose the learning. A blameless postmortem treats failure as data: timeline, impact, contributing factors (not one root cause), and owned action items that actually ship.

ENG Junior ◷ 16 min

Level

FoundationsJuniorMiddleSenior

A deploy takes down checkout for 40 minutes. In the retro, a director asks “who pushed it?” The engineer who shipped it goes quiet, the timeline gets sanitized, and the report concludes “human error — added a deploy checklist.” Six weeks later the same class of outage recurs, because the real story — a CI gate that silently skipped on a flaky test, a config flag with no staging coverage, an alert that paged the wrong team — never made it into the room. Blame did not make the system safer. It made the next incident quieter.

Why blameless is an engineering decision, not just kindness

By the end of this lesson you will know why the standard “who did it?” retro makes the next incident worse, and what to run instead. The case for blameless is not that people deserve to be spared discomfort. It is that blame is information-destroying. The moment an engineer believes the retro is hunting for someone to punish, they stop volunteering the details that matter: the half-understood workaround, the alert they muted last week, the gut feeling they ignored under deadline. Google’s SRE book states the premise plainly — a blameless postmortem “assumes that everyone involved in an incident had good intentions and did the right thing with the information they had.” Etsy’s John Allspaw frames the same idea as a just culture: you treat the incident “as a source of data, not something embarrassing to shy away from.”

The mechanism is simple and brutal. Punish the messenger and you train your best engineers to become quiet about risk. The data shows up not as fewer incidents but as fewer reported incidents — until one you could not hide takes down checkout. A senior reads a suspiciously clean incident history as a smell, not a trophy: it usually means the organization has driven failure underground rather than out of the system.

The structure: timeline, impact, contributing factors, action items

A postmortem that earns its cost has four load-bearing sections. The timeline is a neutral, timestamped account — detection, escalation, mitigation, recovery — written in past tense with no adjectives about competence. The impact quantifies the blast radius: minutes of downtime, requests failed, customers affected, revenue or SLO budget burned. Vague impact (“some users saw errors”) is how an outage gets under-prioritized and the fix never gets staffed.

When you write a postmortem, the third section is where blameless and naive practice diverge: contributing factors, plural, not “the root cause.” Google’s guidance is two to five systemic causes — a process gap, a tooling limitation, a missing test, an alert routed wrong, a documentation hole — each framed as a property of the system, not a person. The fourth section, action items, is the only part that changes the future. Each item must be specific, have a single named owner, and carry a due date; an action item without an owner is a wish.

Section	Blameless / useful	Blameful / theater
Timeline	Neutral timestamps: detection → mitigation → recovery	”Engineer X carelessly pushed at 14:02”
Impact	40 min down, 12k failed checkouts, 30% of SLO budget	”Some users were affected”
Cause	2–5 contributing factors across people, tools, process	One root cause: “human error”
Action items	Specific, single owner, due date, tracked to done	”Be more careful” / “added a checklist”

Why “5 whys” and single root cause are the wrong model

The folk method is “5 whys”: ask why five times and you reach the root cause. For complex systems this is actively misleading. Allspaw’s essay The Infinite Hows — drawing on Sidney Dekker and Nancy Leveson — argues that asking “why” repeatedly drags you toward a single linear chain, hindsight bias, and ultimately blame, because each “why” demands a justification and justifications point at people. Real outages are multi-causal: the deploy, the flaky test that let the gate skip, the config flag with no staging coverage, and the alert misroute all had to line up. No single one “caused” it; their interaction did.

Allspaw’s reframe is to ask “how” instead of “why” — the infinite hows. “How” questions surface the conditions that made an action reasonable at the time, the hidden expert knowledge people used, the efficiency-versus-thoroughness tradeoffs they navigated under pressure. The shift from “why did you do that?” to “how did doing that make sense given what you knew?” is the difference between an interrogation and an investigation. You cannot fix a person; you can fix the system that made the dangerous path the easy one.

▸Why this works

“Root cause” is grammatically singular, and that is the trap — it nudges everyone to stop at the first plausible factor and pin it on whoever was nearest the keyboard. Dekker calls this the old view: error as the cause of failure. The new view treats error as a symptom of deeper systemic conditions. Switching from “the root cause” to “contributing factors” is not pedantry; it changes where the room looks.

The senior tradeoff: retros cost time and only pay off if items ship

A thorough postmortem is expensive: hours of engineer time writing the timeline, a cross-team review meeting, follow-up tracking. That cost is only justified if the action items actually get done — and the industry data here is grim. Reported norms put action-item completion below 50%, with many teams under 40% completed within 90 days, and repeat-incident rates landing around 35–50%. When items are written and filed with zero follow-through, you have paid the full price of the retro and bought nothing: the same outage recurs, and now your engineers also believe postmortems are theater.

Ask yourself: if you filed this postmortem today, what is the chance every action item ships within 90 days? That number is what the senior move is designed to improve. So the senior move is to ration the ceremony and protect the follow-through. Not every blip earns a full postmortem — teams set a severity trigger (a SLO breach, customer-visible downtime past N minutes, a sev1/sev2). Google’s SRE workbook recommends publishing fast (aim under 48 hours while memory is fresh) and tracking action items to closure like any other prioritized work, with 30/60/90-day follow-ups and a target completion well above 85%. The postmortem document is the cheap part; the expensive, valuable part is the work it commits you to.

Pick the best fit

A 40-minute checkout outage just resolved. How should the team run the retro?

Quiz

Why does a blame-driven incident review tend to make systems less safe over time?

Quiz

A retro concludes with one root cause: 'human error, added a checklist.' What is the senior critique?

Order the steps

Order the sections of a blameless postmortem as you build it:

1 Timeline: neutral, timestamped account of detection → mitigation → recovery
2 Impact: quantify blast radius — minutes down, requests failed, SLO budget burned
3 Contributing factors: 2–5 systemic causes across people, tools, and process
4 Action items: specific, single owner, due date
5 Follow-through: track each item to closure with 30/60/90-day checks

Order is the mechanism: detect → mitigate (restore first, diagnose second) → resolve → blameless postmortem. Contributing factors replace single root cause. Tracked action items are the only part that changes the system.

Recall before you leave

01
Explain to a teammate why 'who pushed the bad deploy?' is the wrong opening question, and what to ask instead.
02
Why is a postmortem with action items but no follow-through worse than no postmortem at all?

Recap

Blameless postmortems are an engineering decision, not a courtesy: blame destroys the information you need, because punished engineers learn to hide incidents and sanitize timelines, leaving you with fewer reported failures rather than fewer real ones. A useful postmortem has four load-bearing parts — a neutral timestamped timeline, a quantified impact, two to five contributing factors instead of a single root cause, and specific action items each with one owner and a due date. The “5 whys” and single-root-cause habit is the wrong model for complex systems, which fail multi-causally; Allspaw’s Infinite Hows reframes the inquiry from “why” (which points at people) to “how” (which surfaces the conditions that made the dangerous path the easy one). The senior tradeoff is that retros are expensive and only pay off when items actually ship, so you ration the ceremony with a severity trigger, publish fast, and track every action item to closure — because a postmortem filed with zero follow-through costs the full price and buys nothing while the same outage recurs. Now when you sit down after the next outage, the first question is not “who pushed it?” but “what did the system make easy that should have been hard?”

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Distributed rate limiterBuild a token-bucket limiter that holds across many app instances by keeping the counter in Redis, not in process memory.URL shortener at scaleBuild a URL shortener that survives real traffic — then run it: deploy it, watch it, and work the incident when one hot link melts your cache.