Engineering Practice
On-call and incident response: every page must be actionable
3:47 a.m. The pager fires: DiskUsage > 80% on cache-node-7. The responder, six pages deep into a quiet-but-relentless night, swipes it away half-asleep — the disk alert has cried wolf forty times this month and self-resolved every time. Except tonight it didn’t. The node filled, the cache evicted hard, the database took the full read load and fell over at 4:10. The page that mattered was indistinguishable from the forty that didn’t. That is not a monitoring gap. That is alert fatigue, and it is the failure mode that defines on-call.
The rotation: on-call is a load you budget, not a hero shift
A rotation is a schedule of who carries the pager. The naive version — “whoever’s free” or one heroic owner — collapses: the hero burns out, leaves, and takes the only knowledge with them. A sustainable rotation spreads load across enough engineers that any single shift is survivable, and it treats responder time as a finite, expensive resource.
Google SRE codifies this with hard numbers. SREs spend at most 50% of their time on operational (“ops”) work — on-call, tickets, manual toil — so the other half goes to engineering that makes the next quarter quieter. Of that ops half, no more than 25% is on-call. And per shift, the target is a maximum of two incidents, because handling one incident end-to-end — triage, mitigation, root-cause, postmortem, follow-up fixes — runs about six hours of real work. Two incidents already fill a workday; a third means corners get cut and the postmortem never gets written.
Follow-the-sun is the structural fix for the worst part: nobody should be paged at 3 a.m. as a routine. You hand the pager across time zones — a team in EMEA covers their daylight, hands to the Americas, who hand to APAC — so every responder is awake and alert when they hold it. The cost is coordination overhead and clean handoffs; the payoff is that “on-call” stops meaning “destroyed sleep.”
Alerting philosophy: page on symptoms, not causes
This is the single highest-leverage decision in on-call, and most noisy rotations get it backwards. Two ways to decide what fires a page:
- Cause-based:
CPU > 80%,disk > 80%,connection pool 90% full,replica lag rising. These describe internal mechanics. The problem: high CPU is often fine, frequently self-corrects, and rarely means a user is harmed. Cause alerts fire constantly and most are non-events. - Symptom-based / SLO-based:
p99 latency > 500ms for 5 minutes,error rate burning the monthly budget at 6×. These describe what a user actually experiences and what your SLO promised. They fire only when something real is breaking.
Google’s rule is blunt: spend the effort catching symptoms, and only alert on causes that are definite and imminent. The four golden signals — latency, traffic, errors, saturation — together catch nearly every meaningful failure as a symptom. The mechanism that turns an SLO into a page is burn rate: how fast you’re consuming the error budget relative to the SLO. A multi-window, multi-burn-rate alert pages on a fast burn (say 6× over a short window confirmed against a longer one) and stays silent on a slow drip — so urgency matches the actual threat to your reliability promise, not a raw threshold.
| Alert | Type | Page a human? |
|---|---|---|
CPU > 80% | Cause | No — often fine, self-corrects; dashboard, not pager |
disk > 80% | Cause | No (alone) — ticket, or page only if trend hits full in < 1h |
p99 latency > 500ms / 5m | Symptom | Yes — users feel it now; SLO at risk |
Error budget burning 6× | Symptom (SLO) | Yes — budget gone in days at this rate |
The central failure mode: alert fatigue
Here is the trap, with numbers. Industry false-positive rates run 60–80% — most pages don’t need a human. The median on-call engineer absorbs about 42 pages per week; once the false-positive ratio crosses ~60%, responders start pattern-matching (“disk alert again, ignore”) which adds 2–5 minutes of MTTA per page even when they do engage. The human cost is brutal: 62% report weekly sleep disruption and 41% of on-call engineers have considered leaving over alert load. Pager burnout is an attrition driver, and attrition takes operational knowledge out the door with it.
The mechanism is simple and lethal: too many low-value pages → responders desensitize → they reflexively dismiss → the one real incident arrives wearing the same costume as the noise and gets dismissed too. You cannot fix this with a better responder or more discipline. You fix it by deleting alerts. The hard rule a senior enforces: if a page is not actionable — if the response is “watch it” or “it’ll self-resolve” — it is not an alert, it’s a notification, and it does not go to the pager. Demote it to a ticket or a dashboard. The goal isn’t more coverage; it’s a pager you can still trust at 4 a.m.
Why this works
“Just snooze the noisy ones” feels like the cheap fix, and it’s how coverage quietly dies. The right tools shrink noise without losing signal: Alertmanager group_by collapses a storm of related alerts into one notification, inhibit_rules mutes the warning when the matching critical is already firing, and a group_wait of 30s lets a flap self-resolve before it ever pages. But routing config only de-duplicates — it can’t make a fundamentally non-actionable alert worth waking for. That decision happens upstream, when you decide the alert deserves a pager at all.
Runbooks, escalation, and the metrics that keep you honest
Every page that reaches a human should link a runbook: the diagnostic checklist and known mitigations for this alert. A runbook turns a 3 a.m. cold-start (“what even is cache-node-7?”) into a procedure, which is what actually lowers MTTR for the median responder who didn’t write the service. If an alert has no runbook, that’s a signal it isn’t ready to page on.
When the primary responder can’t resolve it, the escalation path routes onward — primary → secondary → service owner / incident commander — on a timer, so a stuck page doesn’t sit silent. And you keep the whole system honest with metrics:
- MTTA (mean time to acknowledge) — industry median 8–15 min; rising MTTA is the early signature of fatigue or bad routing.
- MTTR (mean time to resolve) — DORA 2024 elite is under 1 hour; low performers exceed a week.
- Page volume per shift — trending up means the rotation is degrading; this is what the ≤2-incidents target watches.
- % actionable — the north star. Every page that wasn’t actionable is a candidate for deletion. Push this toward 100% and MTTA, MTTR, and retention all follow.
Your service has 5% of users seeing 2s+ load times intermittently, but no SLO is breached yet and you can't reliably reproduce it. How do you wire this into on-call?
A disk-usage > 80% alert pages the on-call 40 times a month and self-resolves nearly every time. What's the senior move?
Which metric most directly tells you the rotation is degrading toward burnout?
Order the lifecycle of a well-run page, from definition to closure:
- 1 Decide the alert is actionable and symptom/SLO-based before it ever reaches the pager
- 2 Page fires; Alertmanager groups + inhibits so the responder gets one signal, not a storm
- 3 Responder acknowledges (MTTA) and opens the linked runbook
- 4 Mitigate to restore service (MTTR); escalate on a timer if stuck
- 5 Write the postmortem and delete or tune any alert that fired without being actionable
- 01Explain to a teammate why a noisy 'disk > 80%' alert is more dangerous than having no alert at all.
- 02Why does Google SRE cap on-call at two incidents per shift and operational work at 50% of an SRE's time, and what breaks if you ignore those caps?
On-call succeeds or fails on one rule: every alert that wakes a human must be actionable, or it must be deleted. Page on symptoms and SLO burn rate — what a user actually feels and what your reliability promise covers — not on causes like CPU or disk that fire on non-events and mostly self-resolve. The failure mode that defines the discipline is alert fatigue: with 60–80% false positives and a median 42 pages a week, responders desensitize and dismiss the one real incident along with the noise, while 41% consider quitting over the load. Defend against it structurally — delete non-actionable alerts, use grouping and inhibition to collapse storms, cap load at roughly two incidents per shift and 50% ops time, and use follow-the-sun so nobody is routinely paged at 3 a.m. Every page links a runbook so the median responder can act, escalation routes onward on a timer, and you keep the system honest with MTTA, MTTR, page volume per shift, and the north-star metric, % actionable. Push that toward 100% and a trustworthy pager, healthier responders, and faster recovery all follow.