awesome-everything RU
↑ Back to the climb

Engineering Practice

Postmortems: artifact reading and critique

Crux Read real incident timelines, log lines, action-item lists, and a bad postmortem draft, then pick the senior critique or the highest-leverage fix.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min

A postmortem is read as an artifact — a timeline, a log line, an action-item list, a draft document. Read each one the way a senior reviewer does, and pick the move that turns it from theater into a system that gets fixed.

Goal

Practise the review loop you run on every retro draft: scan the timeline for blame language, check the analysis for single-root-cause collapse, and judge whether each action item can actually be tracked to closure.

Artifact 1 — the timeline draft

14:02  Engineer J carelessly deploys release 4.18 straight to prod
14:05  Site goes down because J skipped the staging step
14:31  J finally notices the alert and starts rolling back
14:46  Service restored after J reverts the change
Quiz

What is wrong with this timeline as the basis for a blameless postmortem?

Artifact 2 — the analysis section

WHY did checkout fail?      The new release crashed on startup.
WHY did it crash?           A required env var was unset in prod.
WHY was it unset?           The engineer forgot to add it.
WHY did they forget?        They were rushing before a meeting.
ROOT CAUSE: human error (engineer rushed). FIX: tell the team not to rush.
Quiz

A five-whys chain landed on 'human error, tell the team not to rush.' What is the senior reframe?

Artifact 3 — the action-item list

AI-1  Improve deploy reliability                      owner: the team
AI-2  Add env-var validation at service startup       owner: Priya   due: 2026-06-10
AI-3  Be more careful with prod config                owner: -       due: -
AI-4  Add a staging gate that blocks deploys on
      missing required config; alert #payments on fail owner: Sam     due: 2026-06-20
Quiz

Triaging this list before closing the retro, which items are real and which must be rewritten or dropped?

Artifact 4 — the published postmortem summary

INCIDENT 412 — Checkout outage
Severity: sev1     Duration: not recorded
Impact: some users had trouble checking out for a while
Root cause: bad deploy by on-call
Resolution: rolled back
Action items: none — issue resolved, no further action needed
Quiz

This sev1 summary is about to be filed and closed. What is the most serious problem with publishing it as-is?

Recap

Every postmortem artifact is reviewed the same way: a timeline must be neutral and timestamped, not editorialized with ‘carelessly’ and ‘finally’; a five-whys chain that lands on ‘human error’ is the cue to switch from ‘why’ to ‘how’ and surface systemic conditions; an action item is only real if it is specific, singly owned, and dated; and a published sev1 with vague impact and zero action items is theater that guarantees recurrence. Read for blame language, single-root-cause collapse, and untrackable items — those are the three defects that turn a retro into box-checking.

Continue the climb ↑Postmortems: write and review a blameless retro
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.