Engineering Practice
On-call: multiple-choice review
Six questions that cut across the whole unit. None is a definition to recite — each mirrors a decision a senior makes about what fires a pager at 4 a.m. and what must not.
Confirm you can connect the unit’s spine: actionable symptom/SLO alerts on the pager, everything else demoted, load capped, runbooks linked, and the rotation measured by % actionable.
A disk-usage threshold pages on-call 40 times a month and self-resolves nearly every time. Why is this more dangerous than having no alert at all?
You must decide what pages a human. Which condition belongs on the pager, and what is the principle?
Google SRE caps a shift at about two incidents and operational work at 50% of an SRE's time. What breaks if you ignore the caps?
Which metric most directly signals a rotation degrading toward burnout?
A page fires but the primary responder is stuck and hasn't acknowledged. What is the right structural mechanism, and what is the runbook's role here?
A change is proposed to reduce noise: snooze the noisiest alerts and add Alertmanager group_by, inhibit_rules, and a 30s group_wait. What does this buy you, and what does it not?
The through-line of the unit is one rule applied at every layer: a page must be actionable or it is deleted. Page on symptoms and SLO burn rate, not causes like CPU or disk; demote the rest to tickets and dashboards. Cap load (≈2 incidents/shift, ≤50% ops time) so prevention work survives. Link a runbook to every page, escalate on a timer when a responder is stuck, and use grouping/inhibition to collapse storms — but remember routing only de-duplicates. Measure MTTA, MTTR, and page volume, and steer by the north star: % actionable.