On-Call That Doesn't Burn Out Your Engineers

A senior engineer on your team just resigned. In the exit interview, they said "I haven't slept through a night in three months." You looked at the on-call schedule. They were on it 50% of the time, because nobody else knew the system.

This is the most common preventable cause of senior engineer attrition. And it's almost always solvable.

What on-call actually costs

On-call isn't free time. Every week of primary on-call costs:

~10 hours of attention even with zero pages (carrying the laptop, watching alerts)
1-3 nights of disrupted sleep on average
Inability to plan personal life (no concerts, no dinners that can't be cancelled)
Stress that lingers for days after rotation ends

If you compensate this with "comp days" or extra PTO, you've just made on-call negative-EV: the engineer is working extra to recover from working extra.

The real cost is closer to 1.5x salary for the on-call hours. If you have no on-call pay and a senior engineer is on call 25% of the time, you're effectively underpaying them by 12%.

The math of bad rotations

If only 2 engineers know how to handle prod incidents, your rotation is 1-week-on, 1-week-off. Both burn out within 6 months.

If 4 engineers can handle it, rotation is 1-on, 3-off. Tolerable.

If 8 engineers can handle it, rotation is 1-on, 7-off. Sustainable indefinitely.

The threshold for "sustainable on-call" is at least 6 people in the rotation. Below that, your on-call program is a slow-motion attrition pipeline.

Why you don't have 6 people

The two reasons:

1. Not enough engineers. Real constraint at small companies. Solve by reducing alert volume aggressively (see below) so on-call is mostly unbothered.

2. Knowledge concentration. Three people understand the system. The other five don't trust themselves to fix it at 2am.

Knowledge concentration is fixable. It's the work of an on-call program: every incident becomes a runbook, every runbook gets exercised in a non-emergency.

The runbook test

For every alert your team has, ask: "If a new engineer got paged for this at 2am tonight, with no Slack help, could they resolve it?"

If yes — the alert has a good runbook.

If no — the alert isn't safe to delegate. Either fix the runbook or remove the alert.

This is a hard exercise. Most alerts fail it. That's the work.

The other half: kill alerts that don't matter

The fastest way to make on-call sustainable is to page less.

Audit your alerts. For each one:

Did it page someone in the last 30 days? If yes: was the action taken human-required, or could it have auto-recovered? Auto-recover.
Did it not page in the last 90 days? Delete it. It's not real.
Did it page but no action was taken? Lower its severity. Page = action required. Slack = informational. Email = trends.

Apply this quarterly. Alert volume drops 60-80% on the first pass. The remaining alerts are real.

The structure that works

Primary — first responder, ack within 5 min, attempts to fix.

Secondary — backup if primary doesn't ack within 15 min, or if primary needs help.

Manager escalation — if primary + secondary can't resolve in 30 min, page the manager. Their job is not to fix it but to coordinate (wake up the right specialist, communicate to stakeholders).

Rotations should be 1 week, Wednesday-to-Wednesday (not Monday — gives a buffer to hand off after weekend chaos). Primary and secondary should be different time zones if possible.

On-call compensation

Pay it. Either money or time, but pay it. The signal it sends matters more than the amount.

Common patterns:

Hourly stipend: $200-500 per week of primary, half for secondary
Comp days: 1 day off per week of on-call, used within 30 days
Volunteer-only with bonus: opt-in rotation with significant comp ($1k+/week)

Whatever you pick, make it explicit. Engineers should know what they're trading.

What to do during incidents

The single rule: one driver, one scribe, one comms.

Driver: types the commands. Fixes the system.
Scribe: writes a running timeline in #incidents — what's been tried, what's the current hypothesis.
Comms: keeps stakeholders updated, fields questions, shields the driver from "any update?" pings.

Without role separation, the on-call engineer does all three badly. Incidents stretch from 30 min to 3 hours.

The post-mortem rule

Every incident over 30 min: written post-mortem within 5 business days.

Focus areas:

What happened (timeline)
Why it happened (root cause)
How we knew (how detection worked or failed)
How we fix it from happening again
What we learned about our system

No blame. Hunt and fix systems, not people. If your post-mortems blame people, your engineers will hide problems.

The takeaway

On-call is a tax on your best engineers. If you don't pay attention to rotation design, alert quality, and knowledge distribution, that tax compounds into burnout and attrition. Spend 1 day per quarter auditing alerts and runbooks, and you'll keep your senior engineers an extra year each.