awesome-everything RU
↑ Back to the climb

Engineering Practice

On-call: rebuild a noisy rotation

Crux Hands-on project — audit a noisy alert set, replace cause alerts with SLO burn-rate pages, write runbooks and escalation, and prove the rotation got quieter with before/after metrics.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 240 min

Reading that “every page must be actionable” is not the same as making a real rotation trustworthy. Take a noisy alert set, audit it, replace cause alerts with SLO burn-rate pages, wire runbooks and escalation, and prove the pager got quieter — with numbers, not vibes.

Goal

Turn the unit’s one rule into an operating system for a rotation: classify every alert by actionability, page only on symptoms and burn rate, demote the rest, give each remaining page a runbook and a timed escalation path, and measure the change in page volume, % actionable, and MTTA/MTTR.

Project
0 of 7
Objective

Take a service with a noisy alert set (your own, or build a small HTTP service plus a Prometheus/Alertmanager stack with deliberately cause-based alerts) and rebuild its on-call so that every page is actionable, proving the rotation got quieter and faster with before/after metrics.

Requirements
Acceptance criteria
  • A before/after table: page volume per shift, % actionable, false-positive rate, MTTA, and MTTR — measured or drilled, not estimated.
  • Every alert still on the pager is symptom/SLO-based and links a runbook; a written rationale lists each demoted or deleted alert and why it was not actionable.
  • The burn-rate rule demonstrably pages on a real fast burn during the drill and stays silent for a brief flap that does not threaten the budget.
  • The escalation path fires on the timer when the primary does not ack, and the drill MTTR comes in under the rotation's target (aim for under 1 hour, DORA-elite).
Senior stretch
  • Add a toil log for one rotation: record every manual, repetitive mitigation, then automate the top one (auto-rollback on burn, auto-scale on saturation) and show the page that used to require a human is now self-healing or demoted.
  • Wire a follow-the-sun handoff between two time zones with a written handoff checklist, so no routine page lands at 3 a.m. local for anyone.
  • Build an alert-quality CI gate: every new alert rule must declare a runbook link and a symptom/SLO justification, and the build fails if a rule is tagged severity: page without one.
  • Instrument a weekly on-call review that auto-surfaces the lowest-% -actionable alerts from the page log and proposes the next deletion candidate, closing the postmortem-to-deletion feedback loop.
Recap

This is the loop you will run on every real rotation: define the SLO and budget, audit every alert for actionability, page only on symptoms and burn rate, demote or delete the rest, give each surviving page a runbook and a timed escalation, then prove with before/after numbers that page volume fell, % actionable rose, and MTTA/MTTR held under target. Reduce toil by automating the repeated mitigations. Do it once on a real or toy stack and the production version becomes muscle memory — a pager people can still trust at 4 a.m.

Continue the climb ↑Putting it together: practices are one feedback loop, not a checklist
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.