What is being on-call?
On-call means being the person responsible for responding when something breaks outside business hours. Done well, it is a fair rotation with actionable alerts and clear runbooks. Done badly, it is a pager that screams all night at things nobody can fix.
Why it matters
On-call is the human side of reliability, and it is where good observability pays off or fails. Engineers who design humane on-call — useful alerts, written runbooks, blameless follow-up — keep teams healthy and systems reliable. Burnout from bad on-call drives people out of the field.
What to learn
- Actionable alerts versus noise
- Severity levels and what each demands
- Escalation paths and rotations
- Runbooks: what to check and what to do
- Acknowledging, mitigating, then fixing
- Alert fatigue and how to fight it
- Following up so the same page does not recur
Common pitfall
Alerting on causes instead of symptoms, so the pager fires for high CPU that users never notice while a real outage slips through. Alert on user-facing symptoms — errors, latency, unavailability — that genuinely need a human now. Every alert that is not actionable trains people to ignore the pager.
Resources
Primary (free):
- Google SRE — Being on-call · docs
- Google SRE — Practical alerting · docs
- PagerDuty — Incident response · docs
Practice
Take an existing alert and judge it: does it fire on a user-facing symptom, is it actionable, and is there a runbook? Rewrite one cause-based alert as a symptom-based one, and write a short runbook for it. Done when the alert would only wake someone for something worth waking them.
Outcomes
- Distinguish actionable alerts from noise.
- Alert on user-facing symptoms, not raw causes.
- Write a runbook that guides a responder.
- Design a rotation that does not burn people out.