On-call fundamentals · DevOps · Code with Animation

What is being on-call?

On-call means being the person responsible for responding when something breaks outside business hours. Done well, it is a fair rotation with actionable alerts and clear runbooks. Done badly, it is a pager that screams all night at things nobody can fix.

Why it matters

On-call is the human side of reliability, and it is where good observability pays off or fails. Engineers who design humane on-call — useful alerts, written runbooks, blameless follow-up — keep teams healthy and systems reliable. Burnout from bad on-call drives people out of the field.

What to learn

Actionable alerts versus noise
Severity levels and what each demands
Escalation paths and rotations
Runbooks: what to check and what to do
Acknowledging, mitigating, then fixing
Alert fatigue and how to fight it
Following up so the same page does not recur

Common pitfall

Alerting on causes instead of symptoms, so the pager fires for high CPU that users never notice while a real outage slips through. Alert on user-facing symptoms — errors, latency, unavailability — that genuinely need a human now. Every alert that is not actionable trains people to ignore the pager.

Resources

Primary (free):

Practice

Take an existing alert and judge it: does it fire on a user-facing symptom, is it actionable, and is there a runbook? Rewrite one cause-based alert as a symptom-based one, and write a short runbook for it. Done when the alert would only wake someone for something worth waking them.

Outcomes

Distinguish actionable alerts from noise.
Alert on user-facing symptoms, not raw causes.
Write a runbook that guides a responder.
Design a rotation that does not burn people out.