Engineering Practice ENG · 07 · 07

On-call: multiple-choice review

Multiple-choice synthesis across the on-call unit — symptom vs cause alerting, burn-rate pages, alert fatigue, load caps, runbooks, escalation, and the metrics that keep a rotation honest.

ENG Senior ◷ 13 min

Level

FoundationsJuniorMiddleSenior

Six questions that cut across the whole unit. None is a definition to recite — each mirrors a decision a senior makes about what fires a pager at 4 a.m. and what must not.

Goal

Confirm you can connect the unit’s spine: actionable symptom/SLO alerts on the pager, everything else demoted, load capped, runbooks linked, and the rotation measured by % actionable.

Quiz

A disk-usage threshold pages on-call 40 times a month and self-resolves nearly every time. Why is this more dangerous than having no alert at all?

Quiz

You must decide what pages a human. Which condition belongs on the pager, and what is the principle?

Quiz

Google SRE caps a shift at about two incidents and operational work at 50% of an SRE's time. What breaks if you ignore the caps?

Quiz

Which metric most directly signals a rotation degrading toward burnout?

Quiz

A page fires but the primary responder is stuck and hasn't acknowledged. What is the right structural mechanism, and what is the runbook's role here?

Quiz

A change is proposed to reduce noise: snooze the noisiest alerts and add Alertmanager group_by, inhibit_rules, and a 30s group_wait. What does this buy you, and what does it not?

Recap

The through-line of the unit is one rule applied at every layer: a page must be actionable or it is deleted. Page on symptoms and SLO burn rate, not causes like CPU or disk; demote the rest to tickets and dashboards. Cap load (≈2 incidents/shift, ≤50% ops time) so prevention work survives. Link a runbook to every page, escalate on a timer when a responder is stuck, and use grouping/inhibition to collapse storms — but remember routing only de-duplicates. Measure MTTA, MTTR, and page volume, and steer by the north star: % actionable. Now when you see a rising page-volume trend alongside a falling % actionable, you know exactly what to do first: audit and delete, not add more alerts.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.