Engineering Practice
On-call: free-recall review
Retrieval beats re-reading. For each prompt, say or write a full answer from memory before you open the model answer — the effort of recall is what makes the on-call discipline stick when you are actually holding the pager.
Reconstruct the unit’s core mechanisms — why symptoms beat causes, what burn-rate alerting does, the anatomy of alert fatigue, the load caps, the page lifecycle, and how toil reduction keeps a rotation sustainable — without looking back.
- 01Why is symptom-based alerting higher-leverage than cause-based alerting, and what is the one exception?
- 02Explain burn-rate alerting and why multi-window, multi-burn-rate beats a raw error-rate threshold.
- 03Walk through the mechanism of alert fatigue and why you can't fix it by telling the responder to be more careful.
- 04State Google SRE's on-call load caps and the reasoning that makes each cap a reliability lever, not just a humane one.
- 05Describe the lifecycle of a well-run page from definition to closure, and where each on-call mechanism plugs in.
- 06What is toil, why does reducing it matter for on-call specifically, and how do the metrics MTTA, MTTR, page volume, and % actionable steer that reduction?
If you could reconstruct each answer from memory, you hold the unit’s spine: page on symptoms and SLO burn rate, not causes; burn-rate alerting matches urgency to real budget threat; alert fatigue is a structural failure cured by deletion, not discipline; the load caps (≈2 incidents/shift, ≤50% ops, ≤25% on-call) protect the engineering that keeps the rotation quiet; every page runs a defined lifecycle with a runbook and timed escalation; and toil reduction, steered by MTTA, MTTR, page volume, and % actionable, is what keeps it all sustainable.