Engineering Practice
On-call: rebuild a noisy rotation
Reading that “every page must be actionable” is not the same as making a real rotation trustworthy. Take a noisy alert set, audit it, replace cause alerts with SLO burn-rate pages, wire runbooks and escalation, and prove the pager got quieter — with numbers, not vibes.
Turn the unit’s one rule into an operating system for a rotation: classify every alert by actionability, page only on symptoms and burn rate, demote the rest, give each remaining page a runbook and a timed escalation path, and measure the change in page volume, % actionable, and MTTA/MTTR.
Take a service with a noisy alert set (your own, or build a small HTTP service plus a Prometheus/Alertmanager stack with deliberately cause-based alerts) and rebuild its on-call so that every page is actionable, proving the rotation got quieter and faster with before/after metrics.
- A before/after table: page volume per shift, % actionable, false-positive rate, MTTA, and MTTR — measured or drilled, not estimated.
- Every alert still on the pager is symptom/SLO-based and links a runbook; a written rationale lists each demoted or deleted alert and why it was not actionable.
- The burn-rate rule demonstrably pages on a real fast burn during the drill and stays silent for a brief flap that does not threaten the budget.
- The escalation path fires on the timer when the primary does not ack, and the drill MTTR comes in under the rotation's target (aim for under 1 hour, DORA-elite).
- Add a toil log for one rotation: record every manual, repetitive mitigation, then automate the top one (auto-rollback on burn, auto-scale on saturation) and show the page that used to require a human is now self-healing or demoted.
- Wire a follow-the-sun handoff between two time zones with a written handoff checklist, so no routine page lands at 3 a.m. local for anyone.
- Build an alert-quality CI gate: every new alert rule must declare a runbook link and a symptom/SLO justification, and the build fails if a rule is tagged severity: page without one.
- Instrument a weekly on-call review that auto-surfaces the lowest-% -actionable alerts from the page log and proposes the next deletion candidate, closing the postmortem-to-deletion feedback loop.
This is the loop you will run on every real rotation: define the SLO and budget, audit every alert for actionability, page only on symptoms and burn rate, demote or delete the rest, give each surviving page a runbook and a timed escalation, then prove with before/after numbers that page volume fell, % actionable rose, and MTTA/MTTR held under target. Reduce toil by automating the repeated mitigations. Do it once on a real or toy stack and the production version becomes muscle memory — a pager people can still trust at 4 a.m.