Engineering Practice
Postmortems: write and review a blameless retro
Reading about blameless postmortems is not the same as writing one that survives a senior review. Take a real incident (or the seeded one below), write the full document, then run the same critique pass a staff engineer would — hunting for blame language, a collapsed root cause, and action items nobody can actually ship.
Turn the unit’s model into a repeatable artifact and review loop: build a neutral timeline, quantify impact, identify two to five contributing factors instead of one root cause, write owned and dated action items, and prove the document would change the future, not just document the past.
Author a complete blameless postmortem for one incident, then review it against a checklist that catches the three failure modes — blame language, single-root-cause collapse, and untrackable action items — and rewrite until it passes.
- A review checklist applied to your own document, with a pass/fail per row: timeline is neutral and timestamped; impact is quantified; there are 2–5 systemic contributing factors and no single human root cause; every action item has one owner, a date, and a definition of done; a follow-through plan exists.
- A short before/after note showing at least one sentence you rewrote to remove blame language or to break a single-root-cause framing, with the rewrite beside it.
- Every action item passes the trackable test — a stranger could read it, know who owns it, when it is due, and how to tell it is done — with zero 'be more careful' style items remaining.
- A one-paragraph rationale for the severity trigger you chose and why running a full retro for every blip would dilute follow-through on the incidents that matter.
- Turn your checklist into a reusable postmortem template (timeline / impact / contributing factors / action items / follow-through) and a one-page reviewer guide for catching the three failure modes.
- Add a lightweight action-item tracker (a sheet or issue label) and define the 30/60/90-day report that surfaces overdue items and computes the completion rate against an 85%+ target.
- Write the same incident twice — once blameful ('human error, added a checklist') and once blameless — and annotate exactly which information the blameful version destroys and which recurrence it fails to prevent.
- Facilitate a 30-minute mock review of your postmortem with a peer using only 'how' questions; capture which new contributing factors surfaced that your solo write-up missed.
This is the artifact and review loop you will run after every real incident: build a neutral timeline, quantify the impact, name two to five systemic contributing factors instead of one human root cause, run an ‘infinite hows’ pass to surface conditions, and write action items that are specific, owned, and dated. Then review your own document for the three failure modes — blame language, single-root-cause collapse, untrackable items — and protect the follow-through with a severity trigger and 30/60/90-day tracking. Doing it once deliberately makes the production version, written under pressure at 2 a.m., muscle memory.