How to Write Technical Runbooks in English
Learn the English vocabulary and writing patterns for clear, professional technical runbooks used in SRE and operations teams.
A runbook is only as good as its clarity. In a high-pressure incident, an engineer following a poorly written runbook — or worse, one written in ambiguous English — can make things worse instead of better. This post covers the vocabulary for runbook content, the English writing patterns that make runbooks unambiguous and actionable, and the most common mistakes that make runbooks fail in practice.
Key Vocabulary
Runbook A documented set of procedures for operating, troubleshooting, or recovering a system. Runbooks are written in advance so that anyone on the team — including someone unfamiliar with the system — can follow them under pressure. Example: “There’s a runbook for database failover in Confluence — follow it step by step rather than improvising.”
Preconditions The requirements that must be true before a procedure is started. Checking preconditions prevents you from running steps that will fail or cause harm because the environment isn’t in the expected state. Example: “Preconditions: confirm that the replica database is fully synced before you begin the failover procedure.”
Expected outcomes What the system should look like after each major step — used to verify that the step succeeded before continuing. Without expected outcomes, operators can’t tell whether a step worked or silently failed. Example: “Expected outcome: the health check endpoint returns HTTP 200 within 30 seconds of the restart.”
Rollback steps Procedures for undoing the changes made during an operation if something goes wrong. A runbook without rollback steps leaves engineers stranded if the primary procedure fails. Example: “If the deployment fails after step 3, follow the rollback steps in section 4 to restore the previous version.”
Severity threshold A defined criterion that determines whether an incident is escalated, what level of response it requires, or which runbook to use. Example: “If error rate exceeds 5% for more than 2 minutes, this incident crosses the severity threshold for a P1 and the on-call lead must be paged.”
Escalation procedure The documented steps for handing off an incident to a higher level of support — a senior engineer, an engineering manager, or an external vendor. Example: “If you cannot resolve the issue within 30 minutes, follow the escalation procedure: notify the on-call lead and open a ticket with the database vendor.”
Decision tree A branching structure in a runbook that guides the operator to different procedures based on what they observe. Essential for troubleshooting sections where the correct action depends on what the system is doing. Example: “The decision tree in section 2 will guide you to the correct procedure depending on whether the error is a timeout or an authentication failure.”
Troubleshooting flowchart A visual or structured representation of the decision tree, showing branching paths for different diagnostic findings. Often presented as a numbered list with conditional branches. Example: “Follow the troubleshooting flowchart — if the health check fails, go to step 4a; if it passes but latency is high, go to step 4b.”
Common Phrases and Collocations
Imperative mood instructions The standard writing style for runbook steps — use command verbs directly. Not “you should check” but “check.” Example: “Restart the service. Verify that it starts successfully by checking the health endpoint. If it fails to start, collect the logs before proceeding.”
Conditional instructions (“If X, then Y”)
Used for decision points in troubleshooting runbooks.
Example: “If the pod fails to start within 60 seconds, inspect the event log with kubectl describe pod. If you see an OOMKilled status, increase the memory limit and re-deploy.”
“Verify that…” / “Confirm that…” Used to introduce expected outcome checks after a step. Example: “Verify that the database connection count drops below 50 within 2 minutes of enabling the circuit breaker.”
“Do not proceed until…” Strong language used when a step must not be skipped even under time pressure. Example: “Do not proceed until the backup is confirmed complete. Check the backup status in the admin console.”
“Note:” and “Warning:” Callout labels for contextual information and critical cautions respectively. Example: “Warning: this step will cause approximately 30 seconds of downtime for the affected region. Confirm with the incident lead before executing.”
Practical Sentences to Practice
- “Preconditions: the maintenance window is active, the load balancer has been updated to exclude this node, and you have SSH access to the host.”
- “If the service fails to start, do not proceed to step 5 — collect the service logs and escalate using the escalation procedure in Appendix A.”
- “Expected outcome: the queue depth drops below 100 messages within 5 minutes. If it does not, the consumer may not be processing correctly — follow the decision tree in section 3.”
- “This runbook covers database failover. For application-layer outages, refer to the application failover runbook.”
- “Warning: executing the rollback steps will discard any writes made after the last snapshot. Confirm with the data owner before proceeding.”
Common Mistakes to Avoid
Writing in passive voice
Passive voice in runbooks creates ambiguity about who does what.
Instead of: “The service should be restarted.”
Say: “Restart the service using: systemctl restart app-service”
Missing expected outcomes after steps
Without expected outcomes, an operator can’t tell if a step worked. They may proceed on a broken system, making things worse.
Instead of: “3. Restart the database replica.”
Say: “3. Restart the database replica. Expected outcome: replica connects to primary within 90 seconds. Check replication lag with: SHOW SLAVE STATUS\G”
Not including rollback steps
Runbooks that only describe the “happy path” are incomplete. Always ask: what do we do if step N fails?
Instead of: describing only the deployment steps
Add: “Rollback: if the deployment fails at step 4 or later, execute the rollback script at /scripts/rollback.sh with the previous version tag as an argument.”
Summary
Writing effective runbooks in English requires a specific vocabulary (preconditions, expected outcomes, escalation procedures, decision trees) and a specific writing style — imperative mood, conditional branching, explicit verification steps. Runbooks are safety-critical documents: when an incident happens, ambiguous language costs time and causes mistakes. Investing in clear, precise English runbook writing pays off every time your team faces an incident — which is exactly when you need the language to be perfect.