The Incident Response Playbook That Actually Works at 2am
A playbook is only useful if a stressed engineer can follow it half-asleep. Here's the structure that survives real incidents.
It's 2:14am. You get paged. The alert says "API error rate >10%". You open the runbook. It's 6,000 words of context. You give up and start poking at the system.
This is the failure mode of most incident response documentation. It's written for engineers who already understand the system, in a state of focused calm. The actual reader is someone who just woke up and has 90 seconds of attention before they have to act.
A useful playbook is structured for that reader. Here's the format that works.
What a 2am playbook looks like
Three sections, in this order:
- Stop the bleeding. What command/button do I run RIGHT NOW to reduce damage?
- Diagnose. Where do I look to figure out what's happening?
- Fix. Common root causes and their fixes.
Each section is short. Bullets, not paragraphs. Specific commands, not "investigate."
Example for "API error rate >10%":
## Stop the bleeding
- Check #incidents — is someone already on it?
- If error rate is database-related (DB CPU >80% in Grafana):
→ Run: `kubectl scale deploy worker --replicas=0` to drop background load
- If error rate is upstream-dependency-related:
→ Trip kill switch: `flag set kill_switch_<dep> on`
- Page secondary if not resolved in 10 min
## Diagnose
- Grafana → API service dashboard: which endpoint? Which error code?
- Sentry → recent error groups: any single error spiking?
- Datadog logs: `service:api status:5xx | count by error_code`
- Was there a deploy in the last hour? `kubectl rollout history deploy/api`
## Common causes
| Symptom | Cause | Fix |
|---------|-------|-----|
| 5xx + DB CPU 100% | Slow query | Find query in pg_stat_activity, kill it |
| 5xx + DB CPU normal, single endpoint | Upstream API down | Trip kill switch, queue requests |
| 5xx everywhere, just deployed | Bad deploy | `kubectl rollout undo deploy/api` |
| 4xx specifically 429 | Rate limit | Check upstream rate limits, page their on-call |
That's a complete playbook. ~50 lines. A new engineer can execute it at 2am.
What's missing from this playbook (intentionally)
- Background on what the API does
- History of how the system evolved
- Discussion of design trade-offs
- The phrase "investigate the root cause"
All of these are useful, just not at 2am. They go in a separate doc — the "system overview" — that you read in calm hours.
The "stop the bleeding" rule
The first section is the most important and most often missed. It answers: what's the action that buys me time?
Examples of stop-the-bleeding actions:
- Rollback the last deploy
- Trip a kill switch
- Scale down workers (reduce DB load)
- Drain traffic from a bad node
- Failover to standby region
- Rate-limit problematic users
These are reversible, fast, and low-risk. They don't fix the problem. They prevent it from getting worse.
If your playbook starts with "investigate," you've skipped this. Engineers will spend 30 minutes diagnosing while customers continue to be affected.
Make it greppable
Your playbooks should be in version control, in markdown, in the same repo as the system they describe.
Why:
git grep "kill_switch"works- They're updated next to the code that produced the alert
- Pull requests can require playbook updates for new alerts
Avoid:
- Confluence (untested, hard to grep, becomes stale fast)
- Slack pinned messages (lost in time)
- Engineer's personal notes (knowledge concentration)
Connect alerts to playbooks
Every alert message should link to its playbook. Example PagerDuty payload:
Alert: API error rate >10%
Service: api
Runbook: https://github.com/yourcompany/runbooks/blob/main/api-error-rate.md
Dashboard: https://grafana.example.com/d/api-overview
Click the link, you're at the playbook. Don't make the on-call engineer guess where it is.
The drill
Playbooks rot. Systems change. The fix that worked 6 months ago doesn't work now.
Run a quarterly chaos drill: pick a playbook, simulate the alert in staging or a tabletop exercise, follow the playbook step by step. Note where it breaks. Update.
Don't do this once per year and forget. Calendar it: first Thursday of the quarter, 1 hour, rotate which playbook you test.
Post-incident: update the playbook
After every incident, the engineer who fixed it should ask: "Does the playbook handle this?"
If yes — note that the playbook worked. If no — add the case. New row in the "common causes" table. New stop-the-bleeding action.
The playbook should be a living artifact. If it's the same after 50 incidents, either you have very predictable incidents (unlikely) or nobody's updating it (likely).
Playbook anti-patterns
The wall of text. "Background: This API was created in 2022 to handle... Architecture: It uses... Design rationale: We chose..." Useful for new hires. Useless at 2am. Move to a separate "system overview" doc.
Vague instructions. "Investigate the database." Investigate how? Which database? With what tool? Be specific.
Outdated commands. kubectl exec -it api-pod-... from when pods had predictable names. Always use kubectl exec deploy/api or similar.
Doesn't say when to escalate. Escalation criteria are explicit: "if not resolved in 30 min, page manager."
Doesn't say when to stop. "If you've tried these and nothing works, the situation is unusual — page the senior on-call and start a war room in #inc-."
The format that scales
After running this format across many teams, the pattern that works:
- One playbook per alert (not one per service)
- Stop-the-bleeding section ≤5 actions, each one command
- Diagnose section ≤5 places to look
- Fix table with 3-7 common causes
- Total length ≤2 pages
- Last updated date at the top
- Link to the alert that triggers it
If your playbook doesn't fit this, it's probably trying to do too much.
The takeaway
Incident response documentation fails because it's optimized for the writer, not the 2am reader. Structure it as: stop the bleeding, diagnose, fix. Be specific. Connect alerts to playbooks. Update after every incident. Your team will resolve incidents faster and burn out less.
Work with me
I consult with engineering teams on AI adoption, cloud architecture, and engineering effectiveness. If this post surfaced a challenge you're facing, let's talk.
Get in touch →Related posts
Explore more on these topics: