How to Write an SRE Runbook in English
Runbook structure, clear imperative instructions, decision trees, and troubleshooting language — a practical guide to writing SRE runbooks in English.
A runbook is a documented procedure that tells an on-call engineer exactly what to do when a specific incident or operational scenario occurs. A good runbook saves time, reduces errors, and lowers the stress of responding to incidents at 3am. A poor runbook — one that is vague, outdated, or hard to follow — can make incidents worse.
For non-native English speakers writing runbooks, the challenge is producing clear, unambiguous instructions that a stressed engineer can follow at speed. This guide covers the structure, language, and style you need.
The Purpose of a Runbook
A runbook is not a design document or a tutorial. It is a procedural reference — like a pilot’s checklist. It should be:
- Specific — written for one scenario, not a general guide
- Actionable — every step tells the reader what to do, not what to know
- Current — runbooks that are out of date are dangerous
- Scannable — use numbered steps, not paragraphs
Standard Runbook Structure
1. Title and Metadata
Title: Payment Service High Latency Response Procedure Service: payments-api | Owner: Payments Team | Last updated: 2026-06-14 Alert:
PaymentsAPILatencyHigh— P99 latency > 2 seconds for 5 minutes
2. Overview
One or two sentences describing the scenario and the typical cause.
“This runbook covers the response procedure for elevated latency in the payments API. Common causes include database connection pool exhaustion, downstream timeout storms, and traffic spikes exceeding autoscaling capacity.”
3. Severity and Escalation Path
“Severity: P1 (revenue impact). Escalate to the Payments team lead if not resolved within 30 minutes. Notify the Head of Engineering if customer-facing impact exceeds 15 minutes.”
4. Prerequisites
“Before following this runbook, ensure you have: - Access to the production AWS console - Access to the Datadog dashboard (link below) - The payments-api channel open in Slack”
5. Diagnostic Steps
This section tells the on-call engineer how to understand what is happening.
6. Resolution Steps
This section tells the on-call engineer how to fix it.
7. Post-Incident Actions
What to do after the incident is resolved.
Writing Diagnostic Steps
Use clear, numbered imperative sentences. Each step should have one action.
Good Diagnostic Language
“1. Open the Payments API Datadog dashboard: [link]”
“2. Check the P99 latency graph for the past 30 minutes. Note whether the spike is gradual or sudden — gradual spikes indicate resource exhaustion; sudden spikes indicate a traffic event or deployment.”
“3. Check the database connection pool metrics: navigate to RDS → payments-db → Performance Insights.”
“4. Check for recent deployments: run
gh run list --workflow=deploy --repo=payments-api --limit=5.”
“5. Check for upstream traffic anomalies in the API gateway logs. Filter by
service=paymentsfor the affected time window.”
Decision Trees
Use explicit conditional logic when the next step depends on what the engineer finds:
“If the connection pool is saturated (active connections > 90%):” ” → Proceed to Section A: Database Connection Pool Exhaustion”
“If a recent deployment correlates with the latency spike:” ” → Proceed to Section B: Rollback Procedure”
“If no deployment or connection issue is found:” ” → Proceed to Section C: Traffic Spike Investigation”
Writing Resolution Steps
Resolution steps must be even more precise than diagnostic steps. The engineer is now taking action that could affect production.
Use the Imperative Mood
Write every action as a direct command. Do not use passive voice or hedge:
Weak (passive/hedged):
“The deployment should probably be rolled back if the issue is confirmed to be related to the latest release.”
Strong (imperative):
“If the issue is confirmed to be deployment-related, roll back immediately using the following command:“
Include Exact Commands
“Run the following command to restart the payment worker pods:” “
kubectl rollout restart deployment/payments-worker -n production”
“To roll back to the previous deployment:” “
kubectl rollout undo deployment/payments-api -n production”
“Verify the rollout is complete:” “
kubectl rollout status deployment/payments-api -n production”
Include Expected Outcomes
Tell the engineer what success looks like:
“The P99 latency should begin dropping within two to three minutes of the restart. If latency does not improve within five minutes, proceed to escalation.”
Post-Incident Actions
“1. Post an incident summary in the
#incidentsSlack channel using the template pinned to the channel.”
“2. Update the incident ticket in PagerDuty with root cause and resolution.”
“3. If the runbook needs updating based on this incident, open a PR with the changes and assign it to the team lead for review.”
“4. Schedule a postmortem if the incident lasted more than 30 minutes or had customer impact.”
Common Runbook Writing Mistakes
Using passive voice for instructions. “The service should be restarted” is ambiguous. “Restart the service” is not.
Missing the expected outcome. Engineers under stress need to know whether what they just did worked. Always state what success looks like.
Outdated links and commands. A runbook with a dead link to the dashboard or a deprecated CLI command is worse than no runbook. Review runbooks after every relevant system change.
Too much background. Runbooks are not tutorials. Keep explanation to the minimum needed to make a decision. If more context is needed, link to a separate document.
A well-written runbook is one of the most valuable documents a team can have. The best time to write one is during a quiet period, immediately after reviewing a recent incident. The language should be clear enough that a tired engineer, unfamiliar with the system, can follow it successfully at midnight. That is the standard to write to.