Engineering Practice
Postmortems: multiple-choice review
Six questions that cut across the whole unit. Each one mirrors a judgement call you make in a real retro — not a definition to recite, but the difference between fixing a system and blaming a person.
Confirm you can separate blame from systemic analysis: why blameless is an information decision, why complex failures are multi-causal, what makes an action item real, and where the five-whys habit fails.
A team has a remarkably clean incident history — almost no sev1 postmortems filed in a year. A new SRE lead reads this as a warning sign. Why?
A retro concludes: 'Root cause: engineer pushed an untested config. Action: added a deploy checklist.' What is the strongest senior critique?
Two proposed action items: (A) 'The team should be more careful when deploying.' (B) 'Add a staging smoke test that exercises the payment config path; owner Mara; due 2026-06-15.' Why is only B a real action item?
Allspaw's 'Infinite Hows' argues for asking 'how' instead of 'why' in an investigation. What is the mechanism behind that preference?
A team writes thorough postmortems but ships under 40% of action items within 90 days; the same class of outage recurs. What does the unit say this combination means?
Your org wants a full postmortem for every production hiccup, including transient blips that self-resolve in seconds. What is the senior position on this policy?
The through-line of the unit is one stance: failure is data about the system, not evidence against a person. A clean incident history can signal suppressed reporting; a single human root cause hides the multi-causal reality; a real action item is specific, owned, and dated; asking ‘how’ beats asking ‘why’ because it surfaces conditions instead of culprits; and the whole ceremony only pays off if you ration it with a severity trigger and track items to closure well above 85%.