Skip to main content
← All posts
7 min read

Oncall Burnout Is a Design Failure

Paging fatigue isn't a staffing problem. It's a design problem. Systems that generate noise do so because they weren't designed for operability.

Share

When an oncall rotation is described as "brutal," the usual response is organizational: hire more engineers to spread the load, rotate more people through to reduce individual burden, invest in better runbooks, schedule regular postmortems. These are sensible interventions. They are also mostly wrong about the root cause.

Brutal oncall is usually not a staffing problem. It is a signal that the system itself is poorly designed for operation. The alerts are noisy because the systems weren't built to produce clean signals. The runbooks are long because the failure modes are complex. The incidents are frequent because the architecture has not been shaped by the operational cost of its design choices.

You can hire your way to a manageable rotation. You cannot hire your way to a quiet one.

What noisy alerts actually indicate

Alert noise has a specific meaning. An alert fires when a configured threshold is breached. Noise means alerts fire frequently without corresponding action — either the alert resolves on its own, the action required is trivial and automatic, or the alert is simply wrong and gets acknowledged and closed without any investigation.

Each of these cases is a design failure of a different kind.

Self-resolving alerts indicate that the threshold is set below the system's normal variance. The metric routinely exceeds the threshold during normal operation; the alert fires; the system returns to normal; the engineer acknowledges and moves on. This is a threshold calibration problem, but it's often actually deeper: it's a system that has high normal variance, which is itself an architectural property. Services that spike and recover on every traffic burst are operating in a mode that makes threshold alerting inherently noisy. Smoothing the variance — through better load balancing, more predictable resource allocation, or caching — reduces alert noise more reliably than tuning the threshold.

Trivially-actioned alerts indicate that the response has been identified, is repeatable, and could be automated. If the right response to an alert is always "run this script" or "restart this service," the alert is doing work that a human should not need to do. These are the easiest category to address and often the last to get fixed, because fixing them requires prioritizing automation over features — a trade-off that doesn't get made in most planning cycles.

Wrongly-fired alerts indicate that the alert condition is not actually correlated with user-visible impact. The classic case: CPU usage on a background worker spikes, alert fires, nothing is wrong for users, engineer checks, closes. The CPU spike was expected behavior for the task the worker was doing. The alert was written before anyone understood the normal operating range of the service. These accumulate over time as system behavior evolves and alert definitions do not.

The architecture of quiet systems

The difference between a system that generates a page a week and one that generates ten pages a night is largely a function of architectural decisions made long before any alert was written.

Systems designed for operability have a small number of carefully chosen health signals that represent genuine user impact. Response latency at the 95th percentile. Error rate on core user flows. Queue depth for jobs that have SLA implications. These signals are coarse on purpose: they fire when something users would notice is happening. The oncall engineer who receives such an alert knows it requires immediate attention, because the system was designed to only raise that flag when something real is happening.

Systems not designed for operability have alerts written by engineers who added monitoring at the same time they wrote a feature — which is the right time to add monitoring, but without system-level oversight produces an alert suite where every service monitors its own internals, every metric has a threshold, and an engineer's shift is a triage session of fifty distinct things that may or may not matter.

The architectural intervention is to distinguish between signals and diagnostics. Signals page. Diagnostics don't page; they're available in a dashboard for investigation once a signal fires. The separation is not about ignoring problems — it's about ensuring that every page requires a human decision. If a page can be resolved by following a checklist without any judgment, it should not be a page. If a page fires 20% of the time with no user impact, it should not be a page. Pages are expensive cognitive interrupts. Reserve them for moments that actually require a human.

Runbook hygiene is a system property, not a documentation task

A runbook exists because a failure mode is complex enough that the response is not obvious. The length and complexity of a runbook is therefore a direct measurement of the operational complexity of the corresponding failure mode.

When runbooks get long, the standard intervention is to improve the runbooks: more detail, clearer steps, better formatting. This is sometimes useful. It never addresses why the failure mode is complex in the first place.

A runbook that says "check if service A is running; if not, check whether dependency B is healthy; if B is unhealthy, check configuration C, but only if the region is us-east-1 because us-west-2 uses a different configuration path" is documenting complexity in the system that should be reduced, not documented. Every branch in the runbook is a case that the system handles inconsistently across environments or over time. Making the runbook thorough makes the complexity more manageable; simplifying the system makes it less likely the runbook is needed.

The healthiest oncall programs treat long runbooks as engineering work requests: this runbook exists because the system behaves in a way that requires human reasoning to navigate, and making the system simpler to operate is an engineering priority, not a nice-to-have.

Who should feel the oncall pain

There is a structural intervention that is underused because it's uncomfortable: the engineers who make architecture decisions should be on the oncall rotation for the systems they design.

Not forever. Not as a punishment. As a calibration mechanism.

An engineer who decides to skip circuit breakers on a critical dependency to meet a deadline will recalibrate that trade-off differently after they've been paged at 3am because the dependency went down and the cascade took out the whole service. An engineer who knows they will be on rotation for a system is an engineer who designs with operational costs in mind.

This is not a novel observation. Teams that practice this consistently report quieter rotations over time, because the oncall feedback loop gets integrated into design decisions rather than separated from them. The distance between "who builds it" and "who operates it" is one of the most reliable predictors of operational quality, and closing that distance is an organizational choice.

The metric no one tracks

Most engineering organizations track mean time to resolution for incidents. Fewer track total interrupt load per engineer per week — the aggregate number of pages, acknowledgments, and context switches an oncall engineer absorbs, whether or not those interrupts result in formal incidents.

This matters because oncall burnout is not primarily about major incidents. It's about the cumulative load of low-stakes interrupts that consume attention, fragment deep work, and gradually make the rotation something people dread rather than own. Teams that only track incidents undercount the true load by a factor that varies by system but is often large.

Tracking interrupt load makes the design problem visible in a way that incident counts don't. A team that pages fifteen times a week for trivial issues that resolve in two minutes each is spending almost three hours of engineering attention on noise. That number, visible and tracked, creates pressure to design it away. Without the number, it's just "oncall is kind of annoying" — which is survivable in the short term and corrosive over a year.

Quiet oncall is an engineering achievement, not a lucky streak. It's the result of designing systems that fail cleanly, alert on what matters, and recover predictably. Building that takes longer than building systems that just work when nothing goes wrong. The cost of not building it shows up in your rotation schedule.

Work with me

I consult with engineering teams on AI adoption, cloud architecture, and engineering effectiveness. If this post surfaced a challenge you're facing, let's talk.

Get in touch →

Explore more on these topics:

Subscribe to new posts

Get an email when I publish something new. No spam, unsubscribe any time.