awesome-everything RU
↑ Back to the climb

Engineering Practice

On-call and incident response: every page must be actionable

Crux On-call lives or dies on one rule: every alert is actionable or it is deleted. Symptom-based, SLO-driven pages catch real harm; cause-based noise breeds alert fatigue — and a fatigued responder misses the one page that mattered.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at junior altitude — the surface
◷ 17 min

3:47 a.m. The pager fires: DiskUsage > 80% on cache-node-7. The responder, six pages deep into a quiet-but-relentless night, swipes it away half-asleep — the disk alert has cried wolf forty times this month and self-resolved every time. Except tonight it didn’t. The node filled, the cache evicted hard, the database took the full read load and fell over at 4:10. The page that mattered was indistinguishable from the forty that didn’t. That is not a monitoring gap. That is alert fatigue, and it is the failure mode that defines on-call.

The rotation: on-call is a load you budget, not a hero shift

A rotation is a schedule of who carries the pager. The naive version — “whoever’s free” or one heroic owner — collapses: the hero burns out, leaves, and takes the only knowledge with them. A sustainable rotation spreads load across enough engineers that any single shift is survivable, and it treats responder time as a finite, expensive resource.

Google SRE codifies this with hard numbers. SREs spend at most 50% of their time on operational (“ops”) work — on-call, tickets, manual toil — so the other half goes to engineering that makes the next quarter quieter. Of that ops half, no more than 25% is on-call. And per shift, the target is a maximum of two incidents, because handling one incident end-to-end — triage, mitigation, root-cause, postmortem, follow-up fixes — runs about six hours of real work. Two incidents already fill a workday; a third means corners get cut and the postmortem never gets written.

Follow-the-sun is the structural fix for the worst part: nobody should be paged at 3 a.m. as a routine. You hand the pager across time zones — a team in EMEA covers their daylight, hands to the Americas, who hand to APAC — so every responder is awake and alert when they hold it. The cost is coordination overhead and clean handoffs; the payoff is that “on-call” stops meaning “destroyed sleep.”

Alerting philosophy: page on symptoms, not causes

This is the single highest-leverage decision in on-call, and most noisy rotations get it backwards. Two ways to decide what fires a page:

  • Cause-based: CPU > 80%, disk > 80%, connection pool 90% full, replica lag rising. These describe internal mechanics. The problem: high CPU is often fine, frequently self-corrects, and rarely means a user is harmed. Cause alerts fire constantly and most are non-events.
  • Symptom-based / SLO-based: p99 latency > 500ms for 5 minutes, error rate burning the monthly budget at 6×. These describe what a user actually experiences and what your SLO promised. They fire only when something real is breaking.

Google’s rule is blunt: spend the effort catching symptoms, and only alert on causes that are definite and imminent. The four golden signals — latency, traffic, errors, saturation — together catch nearly every meaningful failure as a symptom. The mechanism that turns an SLO into a page is burn rate: how fast you’re consuming the error budget relative to the SLO. A multi-window, multi-burn-rate alert pages on a fast burn (say over a short window confirmed against a longer one) and stays silent on a slow drip — so urgency matches the actual threat to your reliability promise, not a raw threshold.

AlertTypePage a human?
CPU > 80%CauseNo — often fine, self-corrects; dashboard, not pager
disk > 80%CauseNo (alone) — ticket, or page only if trend hits full in < 1h
p99 latency > 500ms / 5mSymptomYes — users feel it now; SLO at risk
Error budget burning Symptom (SLO)Yes — budget gone in days at this rate

The central failure mode: alert fatigue

Here is the trap, with numbers. Industry false-positive rates run 60–80% — most pages don’t need a human. The median on-call engineer absorbs about 42 pages per week; once the false-positive ratio crosses ~60%, responders start pattern-matching (“disk alert again, ignore”) which adds 2–5 minutes of MTTA per page even when they do engage. The human cost is brutal: 62% report weekly sleep disruption and 41% of on-call engineers have considered leaving over alert load. Pager burnout is an attrition driver, and attrition takes operational knowledge out the door with it.

The mechanism is simple and lethal: too many low-value pages → responders desensitize → they reflexively dismiss → the one real incident arrives wearing the same costume as the noise and gets dismissed too. You cannot fix this with a better responder or more discipline. You fix it by deleting alerts. The hard rule a senior enforces: if a page is not actionable — if the response is “watch it” or “it’ll self-resolve” — it is not an alert, it’s a notification, and it does not go to the pager. Demote it to a ticket or a dashboard. The goal isn’t more coverage; it’s a pager you can still trust at 4 a.m.

Why this works

“Just snooze the noisy ones” feels like the cheap fix, and it’s how coverage quietly dies. The right tools shrink noise without losing signal: Alertmanager group_by collapses a storm of related alerts into one notification, inhibit_rules mutes the warning when the matching critical is already firing, and a group_wait of 30s lets a flap self-resolve before it ever pages. But routing config only de-duplicates — it can’t make a fundamentally non-actionable alert worth waking for. That decision happens upstream, when you decide the alert deserves a pager at all.

Runbooks, escalation, and the metrics that keep you honest

Every page that reaches a human should link a runbook: the diagnostic checklist and known mitigations for this alert. A runbook turns a 3 a.m. cold-start (“what even is cache-node-7?”) into a procedure, which is what actually lowers MTTR for the median responder who didn’t write the service. If an alert has no runbook, that’s a signal it isn’t ready to page on.

When the primary responder can’t resolve it, the escalation path routes onward — primary → secondary → service owner / incident commander — on a timer, so a stuck page doesn’t sit silent. And you keep the whole system honest with metrics:

  • MTTA (mean time to acknowledge) — industry median 8–15 min; rising MTTA is the early signature of fatigue or bad routing.
  • MTTR (mean time to resolve) — DORA 2024 elite is under 1 hour; low performers exceed a week.
  • Page volume per shift — trending up means the rotation is degrading; this is what the ≤2-incidents target watches.
  • % actionable — the north star. Every page that wasn’t actionable is a candidate for deletion. Push this toward 100% and MTTA, MTTR, and retention all follow.
Pick the best fit

Your service has 5% of users seeing 2s+ load times intermittently, but no SLO is breached yet and you can't reliably reproduce it. How do you wire this into on-call?

Quiz

A disk-usage > 80% alert pages the on-call 40 times a month and self-resolves nearly every time. What's the senior move?

Quiz

Which metric most directly tells you the rotation is degrading toward burnout?

Order the steps

Order the lifecycle of a well-run page, from definition to closure:

  1. 1 Decide the alert is actionable and symptom/SLO-based before it ever reaches the pager
  2. 2 Page fires; Alertmanager groups + inhibits so the responder gets one signal, not a storm
  3. 3 Responder acknowledges (MTTA) and opens the linked runbook
  4. 4 Mitigate to restore service (MTTR); escalate on a timer if stuck
  5. 5 Write the postmortem and delete or tune any alert that fired without being actionable
Recall before you leave
  1. 01
    Explain to a teammate why a noisy 'disk > 80%' alert is more dangerous than having no alert at all.
  2. 02
    Why does Google SRE cap on-call at two incidents per shift and operational work at 50% of an SRE's time, and what breaks if you ignore those caps?
Recap

On-call succeeds or fails on one rule: every alert that wakes a human must be actionable, or it must be deleted. Page on symptoms and SLO burn rate — what a user actually feels and what your reliability promise covers — not on causes like CPU or disk that fire on non-events and mostly self-resolve. The failure mode that defines the discipline is alert fatigue: with 60–80% false positives and a median 42 pages a week, responders desensitize and dismiss the one real incident along with the noise, while 41% consider quitting over the load. Defend against it structurally — delete non-actionable alerts, use grouping and inhibition to collapse storms, cap load at roughly two incidents per shift and 50% ops time, and use follow-the-sun so nobody is routinely paged at 3 a.m. Every page links a runbook so the median responder can act, escalation routes onward on a timer, and you keep the system honest with MTTA, MTTR, page volume per shift, and the north-star metric, % actionable. Push that toward 100% and a trustworthy pager, healthier responders, and faster recovery all follow.

Continue the climb ↑On-call: multiple-choice review
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources4
expand
  1. 01
  2. 02
  3. 03
  4. 04

Trademarks belong to their respective owners. Editorial reference only.