Engineering Practice ENG · 07 · 01

On-call and incident response: every page must be actionable

On-call lives or dies on one rule: every alert is actionable or it is deleted. Symptom-based, SLO-driven pages catch real harm; cause-based noise breeds alert fatigue — and a fatigued responder misses the one page that mattered.

ENG Junior ◷ 17 min

Level

FoundationsJuniorMiddleSenior

3:47 a.m. The pager fires: DiskUsage > 80% on cache-node-7. The responder, six pages deep into a quiet-but-relentless night, swipes it away half-asleep — the disk alert has cried wolf forty times this month and self-resolved every time. Except tonight it didn’t. The node filled, the cache evicted hard, the database took the full read load and fell over at 4:10. The page that mattered was indistinguishable from the forty that didn’t. That is not a monitoring gap. That is alert fatigue, and it is the failure mode that defines on-call.

The rotation: on-call is a load you budget, not a hero shift

Ask yourself: when the one engineer who knows the service leaves, does the rotation survive the next outage — or does it collapse? A rotation is a schedule of who carries the pager. The naive version — “whoever’s free” or one heroic owner — collapses: the hero burns out, leaves, and takes the only knowledge with them. A sustainable rotation spreads load across enough engineers that any single shift is survivable, and it treats responder time as a finite, expensive resource.

Google SRE codifies this with hard numbers. SREs spend at most 50% of their time on operational (“ops”) work — on-call, tickets, manual toil — so the other half goes to engineering that makes the next quarter quieter. Of that ops half, no more than 25% is on-call. And per shift, the target is a maximum of two incidents, because handling one incident end-to-end — triage, mitigation, root-cause, postmortem, follow-up fixes — runs about six hours of real work. Two incidents already fill a workday; a third means corners get cut and the postmortem never gets written.

Follow-the-sun is the structural fix for the worst part: nobody should be paged at 3 a.m. as a routine. You hand the pager across time zones — a team in EMEA covers their daylight, hands to the Americas, who hand to APAC — so every responder is awake and alert when they hold it. The cost is coordination overhead and clean handoffs; the payoff is that “on-call” stops meaning “destroyed sleep.”

Alerting philosophy: page on symptoms, not causes

This is the single highest-leverage decision in on-call, and most noisy rotations get it backwards. Two ways to decide what fires a page:

Cause-based: CPU > 80%, disk > 80%, connection pool 90% full, replica lag rising. These describe internal mechanics. The problem: high CPU is often fine, frequently self-corrects, and rarely means a user is harmed. Cause alerts fire constantly and most are non-events.
Symptom-based / SLO-based: p99 latency > 500ms for 5 minutes, error rate burning the monthly budget at 6×. These describe what a user actually experiences and what your SLO promised. They fire only when something real is breaking.

Google’s rule is blunt: spend the effort catching symptoms, and only alert on causes that are definite and imminent. The four golden signals — latency, traffic, errors, saturation — together catch nearly every meaningful failure as a symptom. The mechanism that turns an SLO into a page is burn rate: how fast you’re consuming the error budget relative to the SLO. A multi-window, multi-burn-rate alert pages on a fast burn (say 6× over a short window confirmed against a longer one) and stays silent on a slow drip — so urgency matches the actual threat to your reliability promise, not a raw threshold.

Alert	Type	Page a human?
`CPU > 80%`	Cause	No — often fine, self-corrects; dashboard, not pager
`disk > 80%`	Cause	No (alone) — ticket, or page only if trend hits full in < 1h
`p99 latency > 500ms / 5m`	Symptom	Yes — users feel it now; SLO at risk
Error budget burning `6×`	Symptom (SLO)	Yes — budget gone in days at this rate

The central failure mode: alert fatigue

Here is the trap, with numbers. Industry false-positive rates run 60–80% — most pages don’t need a human. The median on-call engineer absorbs about 42 pages per week; once the false-positive ratio crosses ~60%, responders start pattern-matching (“disk alert again, ignore”) which adds 2–5 minutes of MTTA per page even when they do engage. The human cost is brutal: 62% report weekly sleep disruption and 41% of on-call engineers have considered leaving over alert load. Pager burnout is an attrition driver, and attrition takes operational knowledge out the door with it.

The mechanism is simple and lethal: too many low-value pages → responders desensitize → they reflexively dismiss → the one real incident arrives wearing the same costume as the noise and gets dismissed too. You cannot fix this with a better responder or more discipline. You fix it by deleting alerts. The hard rule a senior enforces: if a page is not actionable — if the response is “watch it” or “it’ll self-resolve” — it is not an alert, it’s a notification, and it does not go to the pager. Demote it to a ticket or a dashboard. The goal isn’t more coverage; it’s a pager you can still trust at 4 a.m.

▸Why this works

“Just snooze the noisy ones” feels like the cheap fix, and it’s how coverage quietly dies. The right tools shrink noise without losing signal: Alertmanager group_by collapses a storm of related alerts into one notification, inhibit_rules mutes the warning when the matching critical is already firing, and a group_wait of 30s lets a flap self-resolve before it ever pages. But routing config only de-duplicates — it can’t make a fundamentally non-actionable alert worth waking for. That decision happens upstream, when you decide the alert deserves a pager at all.

Runbooks, escalation, and the metrics that keep you honest

Every page that reaches a human should link a runbook: the diagnostic checklist and known mitigations for this alert. A runbook turns a 3 a.m. cold-start (“what even is cache-node-7?”) into a procedure, which is what actually lowers MTTR for the median responder who didn’t write the service. If an alert has no runbook, that’s a signal it isn’t ready to page on.

When the primary responder can’t resolve it, the escalation path routes onward — primary → secondary → service owner / incident commander — on a timer, so a stuck page doesn’t sit silent. And you keep the whole system honest with metrics:

MTTA (mean time to acknowledge) — industry median 8–15 min; rising MTTA is the early signature of fatigue or bad routing.
MTTR (mean time to resolve) — DORA 2024 elite is under 1 hour; low performers exceed a week.
Page volume per shift — trending up means the rotation is degrading; this is what the ≤2-incidents target watches.
% actionable — the north star. Every page that wasn’t actionable is a candidate for deletion. Push this toward 100% and MTTA, MTTR, and retention all follow.

Pick the best fit

Your service has 5% of users seeing 2s+ load times intermittently, but no SLO is breached yet and you can't reliably reproduce it. How do you wire this into on-call?

Quiz

A disk-usage > 80% alert pages the on-call 40 times a month and self-resolves nearly every time. What's the senior move?

Quiz

Which metric most directly tells you the rotation is degrading toward burnout?

Order the steps

Order the lifecycle of a well-run page, from definition to closure:

1 Decide the alert is actionable and symptom/SLO-based before it ever reaches the pager
2 Page fires; Alertmanager groups + inhibits so the responder gets one signal, not a storm
3 Responder acknowledges (MTTA) and opens the linked runbook
4 Mitigate to restore service (MTTR); escalate on a timer if stuck
5 Write the postmortem and delete or tune any alert that fired without being actionable

Every page must be actionable before it reaches the pager. Ack within MTTA (8–15 min median), follow the runbook, escalate on a timer if stuck. After resolution, delete or demote any alert that fired without being actionable.

Recall before you leave

01
Explain to a teammate why a noisy 'disk > 80%' alert is more dangerous than having no alert at all.
02
Why does Google SRE cap on-call at two incidents per shift and operational work at 50% of an SRE's time, and what breaks if you ignore those caps?

Recap

On-call succeeds or fails on one rule: every alert that wakes a human must be actionable, or it must be deleted. Page on symptoms and SLO burn rate — what a user actually feels and what your reliability promise covers — not on causes like CPU or disk that fire on non-events and mostly self-resolve. The failure mode that defines the discipline is alert fatigue: with 60–80% false positives and a median 42 pages a week, responders desensitize and dismiss the one real incident along with the noise, while 41% consider quitting over the load. Defend against it structurally — delete non-actionable alerts, use grouping and inhibition to collapse storms, cap load at roughly two incidents per shift and 50% ops time, and use follow-the-sun so nobody is routinely paged at 3 a.m. Every page links a runbook so the median responder can act, escalation routes onward on a timer, and you keep the system honest with MTTA, MTTR, page volume per shift, and the north-star metric, % actionable. Push that toward 100% and a trustworthy pager, healthier responders, and faster recovery all follow. Now when you inherit a noisy rotation, you have the vocabulary to name every failure — and the first action to take is auditing what should never have reached the pager.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.