Engineering Practice ENG · 07 · 10

On-call: rebuild a noisy rotation

Hands-on project — audit a noisy alert set, replace cause alerts with SLO burn-rate pages, write runbooks and escalation, and prove the rotation got quieter with before/after metrics.

ENG Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading that “every page must be actionable” is not the same as making a real rotation trustworthy. Take a noisy alert set, audit it, replace cause alerts with SLO burn-rate pages, wire runbooks and escalation, and prove the pager got quieter — with numbers, not vibes.

Goal

Turn the unit’s one rule into an operating system for a rotation: classify every alert by actionability, page only on symptoms and burn rate, demote the rest, give each remaining page a runbook and a timed escalation path, and measure the change in page volume, % actionable, and MTTA/MTTR.

Project

0 of 7

Objective

Take a service with a noisy alert set (your own, or build a small HTTP service plus a Prometheus/Alertmanager stack with deliberately cause-based alerts) and rebuild its on-call so that every page is actionable, proving the rotation got quieter and faster with before/after metrics.

Requirements

Acceptance criteria

A before/after table: page volume per shift, % actionable, false-positive rate, MTTA, and MTTR — measured or drilled, not estimated.
Every alert still on the pager is symptom/SLO-based and links a runbook; a written rationale lists each demoted or deleted alert and why it was not actionable.
The burn-rate rule demonstrably pages on a real fast burn during the drill and stays silent for a brief flap that does not threaten the budget.
The escalation path fires on the timer when the primary does not ack, and the drill MTTR comes in under the rotation's target (aim for under 1 hour, DORA-elite).

Senior stretch

Add a toil log for one rotation: record every manual, repetitive mitigation, then automate the top one (auto-rollback on burn, auto-scale on saturation) and show the page that used to require a human is now self-healing or demoted.
Wire a follow-the-sun handoff between two time zones with a written handoff checklist, so no routine page lands at 3 a.m. local for anyone.
Build an alert-quality CI gate: every new alert rule must declare a runbook link and a symptom/SLO justification, and the build fails if a rule is tagged severity: page without one.
Instrument a weekly on-call review that auto-surfaces the lowest-% -actionable alerts from the page log and proposes the next deletion candidate, closing the postmortem-to-deletion feedback loop.

Recap

This is the loop you will run on every real rotation: define the SLO and budget, audit every alert for actionability, page only on symptoms and burn rate, demote or delete the rest, give each surviving page a runbook and a timed escalation, then prove with before/after numbers that page volume fell, % actionable rose, and MTTA/MTTR held under target. Reduce toil by automating the repeated mitigations. Do it once on a real or toy stack and the production version becomes muscle memory — a pager people can still trust at 4 a.m.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.