Crux Read real incident timelines, log lines, action-item lists, and a bad postmortem draft, then pick the senior critique or the highest-leverage fix.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min
A postmortem is read as an artifact — a timeline, a log line, an action-item list, a draft document. Read each one the way a senior reviewer does, and pick the move that turns it from theater into a system that gets fixed.
Goal
Practise the review loop you run on every retro draft: scan the timeline for blame language, check the analysis for single-root-cause collapse, and judge whether each action item can actually be tracked to closure.
Artifact 1 — the timeline draft
14:02 Engineer J carelessly deploys release 4.18 straight to prod14:05 Site goes down because J skipped the staging step14:31 J finally notices the alert and starts rolling back14:46 Service restored after J reverts the change
Quiz
Completed
What is wrong with this timeline as the basis for a blameless postmortem?
Heads-up Length is not the defect. A four-event timeline can be fine; the problem here is the adjectives. Neutral phrasing matters far more than entry count.
Heads-up More technical detail can help, but the blocking defect is the blame-loaded language. Even with the exact command, 'carelessly' and 'finally' would still poison the retro.
Heads-up A neutral 'release 4.18 was deployed to prod at 14:02' carries the same fact without the blame. 'Carelessly' and 'skipped' are judgements, and they teach the next engineer to sanitize their timeline.
Artifact 2 — the analysis section
WHY did checkout fail? The new release crashed on startup.WHY did it crash? A required env var was unset in prod.WHY was it unset? The engineer forgot to add it.WHY did they forget? They were rushing before a meeting.ROOT CAUSE: human error (engineer rushed). FIX: tell the team not to rush.
Quiz
Completed
A five-whys chain landed on 'human error, tell the team not to rush.' What is the senior reframe?
Heads-up More whys deepen the same linear chain and push harder toward the person. The Infinite Hows critique is that the why-chain itself is the problem; you switch to 'how' to surface multiple conditions, not drill for one deeper cause.
Heads-up 'Do not rush' is unassignable, untrackable, and ignores that the system let an unvalidated config reach prod. It is the canonical non-action item the unit warns against.
Heads-up That is the exact blame move that destroys reporting. The engineer forgetting is a symptom; the system that deployed an unset required variable with no guard is the fixable cause.
Artifact 3 — the action-item list
AI-1 Improve deploy reliability owner: the teamAI-2 Add env-var validation at service startup owner: Priya due: 2026-06-10AI-3 Be more careful with prod config owner: - due: -AI-4 Add a staging gate that blocks deploys on missing required config; alert #payments on fail owner: Sam due: 2026-06-20
Quiz
Completed
Triaging this list before closing the retro, which items are real and which must be rewritten or dropped?
Heads-up Intent is not an action item. AI-1 is owned by 'the team' (nobody) and AI-3 has no owner, no date, and no definition of done — both are wishes that cannot be tracked to closure.
Heads-up Backwards. AI-1 is the vaguest line in the list. The narrow, owned, dated items (AI-2, AI-4) are exactly the ones that change the future.
Heads-up AI-4 is the strongest item — it adds a systemic guard with an owner and a date. Deleting it to shorten the list would remove the one change most likely to prevent recurrence.
Artifact 4 — the published postmortem summary
INCIDENT 412 — Checkout outageSeverity: sev1 Duration: not recordedImpact: some users had trouble checking out for a whileRoot cause: bad deploy by on-callResolution: rolled backAction items: none — issue resolved, no further action needed
Quiz
Completed
This sev1 summary is about to be filed and closed. What is the most serious problem with publishing it as-is?
Heads-up Polish is irrelevant. The substantive failures are unquantified impact, a blame-shaped single root cause, and no action items — a longer document with the same content would be just as useless.
Heads-up Recording sev1 is correct and fast publishing is encouraged. The real defects are the missing quantified impact and the absence of any tracked action items.
Heads-up 'Resolved' means service is back, not that the system is fixed. A sev1 with no contributing factors and no action items is the textbook setup for recurrence the unit warns against.
Recap
Every postmortem artifact is reviewed the same way: a timeline must be neutral and timestamped, not editorialized with ‘carelessly’ and ‘finally’; a five-whys chain that lands on ‘human error’ is the cue to switch from ‘why’ to ‘how’ and surface systemic conditions; an action item is only real if it is specific, singly owned, and dated; and a published sev1 with vague impact and zero action items is theater that guarantees recurrence. Read for blame language, single-root-cause collapse, and untrackable items — those are the three defects that turn a retro into box-checking.