Why on-call burns people out
On-call is the most-cited reason engineers leave operationally-heavy roles. The complaints are remarkably consistent across companies and seniority levels:
- Alerts that aren\'t actionable.
- Alerts during the night for things that could have waited until morning.
- Same alert firing repeatedly without anyone fixing the root cause.
- No handoff — you inherit a full alerts queue from the previous on-call.
- No recovery time after a brutal week.
- No pay differential despite being on-call effectively 24/7.
- Single-person rotations where you can never truly disconnect.
The fix isn\'t "on-call is just hard, deal with it." Each of these is a system design problem with a solution.
Picking a rotation cadence
The standard options:
Daily rotation
Sounds gentle — one day at a time. In practice, exhausting. You\'re always either preparing for on-call, on-call, or recovering. Context never builds because every incident hand-off is to someone else within hours.
Weekly rotation
The pragmatic default. One week is long enough to build context (you remember the alert that fired Monday when it fires again Friday) and short enough to recover. Handoff happens once a week with shared context.
Bi-weekly or monthly
Common at larger companies but tends to destroy off-week context. By week 3 of "off-call" you\'ve forgotten what the production system looks like. When you come back on-call, you\'re relearning.
Recommendation: weekly. It\'s the standard for a reason.
The handoff ritual
The transition from one week\'s on-call to the next is where context lives or dies. A good handoff is a 15–30 minute structured conversation, not a Slack message saying "you\'re up."
What to cover:
- What incidents fired this week and how they were resolved.
- Open issues or known weirdness still in flight.
- Anything currently degraded or under elevated risk.
- Scheduled maintenance, deploys, or external events in the coming week.
- Any alert thresholds that were temporarily silenced and should be reviewed.
Document the handoff in writing (a simple shared doc or PR works). The written record helps the on-after-next-week person who didn\'t see the verbal handoff.
Escalation policy design
Two questions design the entire escalation policy:
- If the primary doesn\'t ack within X minutes, who gets it next?
- What\'s the maximum chain length before someone definitely answers?
A reasonable default for a small team:
- Page primary on-call. 5 minutes.
- If unack\'d, page secondary on-call. 5 minutes.
- If unack\'d, page entire team Slack channel + tech lead.
- If unack\'d after 15 more minutes, page CTO/founder.
The 5-minute first delay is important: it lets primary actually look at their phone, walk to a computer, and ack. Shorter and you\'re paging secondary while primary is putting on pants.
Document the escalation policy publicly in your wiki. Surprise escalations breed resentment.
The pay question
This is contentious but the data is clear: explicit compensation for on-call duty significantly improves retention.
Options:
- Weekly on-call stipend. $100–500 per week of on-call. Paid regardless of incident count.
- Per-incident pay. 1.5x or 2x normal hourly rate for time worked outside business hours.
- Comp time. A day off for a "rough" on-call week.
- Equity adjustment. Higher equity for roles with on-call responsibility.
The best pattern combines an explicit stipend (signals "we value this time") with comp time after rough weeks (signals "we know it sucked"). The total dollar amount matters less than the explicit acknowledgement.
Recovery time after rough weeks
If on-call had real overnight incidents, the next day or two off shouldn\'t require justifying. Build this into policy:
- Any incident that involved > 1 hour of overnight work: comp time the following day.
- A week with multiple overnight incidents: half-day off the following week, no questions asked.
- Any incident that pulled someone off vacation: their next on-call rotation gets swapped to someone else as compensation.
The rule of thumb: people should not be net-worse-off after a tough on-call week. If they are, you\'re creating a system that selects for "people who tolerate bad treatment."
The weekly alert review
The single most effective practice for keeping on-call sustainable: a 30-minute weekly review of every alert that fired.
For each alert:
- Was it a real incident?
- If yes: what\'s the root cause? What\'s the fix to prevent recurrence?
- If no: why did it fire? Tune the threshold, add multi-region confirmation, delete the alert entirely.
- Did it fire at a reasonable time? If at 3 AM for something that wasn\'t time-sensitive, downgrade priority.
- Was the runbook helpful? Update it.
This is mostly about preventing the slow drift toward alert noise. Without a forcing function, alerts only ever get added — never tuned or removed.
Team-size realities
Different team sizes have different on-call realities, and pretending otherwise is dishonest:
1–2 people
You\'re always on. There\'s no rotation. Mitigate by minimizing alert volume aggressively and being transparent in hiring that on-call is part of the job. Don\'t pretend otherwise.
3–5 people
Real rotation is possible but rough. One week on, two or three weeks off. Pay explicitly. Have an off-rotation backup ready to swap.
6–10 people
The sustainable zone for most teams. Weekly primary + secondary, six-week intervals. Manageable.
10+ people
You can split into product or service teams with separate rotations. Avoid having one team "cover everything" — on-callers can\'t reasonably know all the systems.
The larger insight: on-call isn\'t something you "scale through" by adding people. You scale through reducing alert volume, improving runbooks, automating recovery, and treating the on-call role with the respect (and pay) it deserves.