How to Write a Runbook in English
Learn the English structure and phrasing for writing an operational runbook, covering triggers, step-by-step remediation, and escalation paths.
A runbook written for the person who already understands the system is useless to the exhausted on-call engineer at 3 a.m. who doesn’t — this guide covers the structure that makes a runbook actually usable under pressure.
Key Vocabulary
Trigger condition — the specific, observable signal that means this runbook applies, such as a particular alert firing or a metric crossing a threshold, stated precisely enough that someone can confirm they’re in the right runbook before acting. “The trigger condition needs to be specific — ‘high latency’ isn’t enough, but ‘p99 latency alert fires for the checkout service for more than five minutes’ tells the on-call engineer exactly when this runbook applies.”
Step-by-step remediation — the ordered, concrete actions to take to resolve the issue, written as commands or clear instructions rather than general advice, assuming the reader may be unfamiliar with the system under pressure. “Don’t just write ‘restart the service’ as step-by-step remediation — include the actual command, which host to run it on, and what output confirms it worked, because at 3 a.m. nobody wants to be guessing at syntax.”
Verification step — an explicit check included after remediation actions confirming the issue is actually resolved, distinct from just completing the steps, since finishing a procedure doesn’t always mean the underlying problem is fixed. “Add a verification step after the restart — check that the error rate has actually dropped on the dashboard, not just that the restart command exited successfully, because the restart could succeed while the real problem persists.”
Escalation path — the explicit next step and contact if the runbook’s remediation doesn’t resolve the issue, including who to page and after how long, so the on-call engineer isn’t left improvising when the documented fix doesn’t work. “The runbook needs an escalation path — if the standard remediation doesn’t clear the alert within twenty minutes, it should say exactly who to page next, instead of leaving the on-call engineer to figure out who owns this system.”
Common Phrases
- “What’s the trigger condition — how do we know this runbook actually applies?”
- “Is the step-by-step remediation specific enough to follow without extra context?”
- “Does this include a verification step, or does it just assume the fix worked?”
- “What’s the escalation path if remediation doesn’t resolve it in time?”
- “Has this runbook actually been tested by someone unfamiliar with the system?”
Example Sentences
Defining the trigger condition: “Trigger condition: the OrderQueueBacklog alert fires, indicating more than 10,000 unprocessed messages in the orders queue for over five minutes. Confirm this specific alert before proceeding — a general queue-depth warning is a different, less urgent situation.”
Writing clear remediation steps:
“Step 1: SSH into the worker host with ssh worker-prod-3. Step 2: Run sudo systemctl restart order-worker. Step 3: Confirm the service is active with systemctl status order-worker — you should see ‘active (running)’ within 30 seconds.”
Writing the escalation path: “If the queue depth hasn’t started decreasing within 15 minutes of the restart, escalate to the platform on-call via PagerDuty rather than continuing to retry the same remediation — a persistent backlog after a clean restart usually indicates a deeper issue.”
Professional Tips
- State the trigger condition precisely enough that someone can verify they’re following the right runbook before taking any action — vague triggers cause people to apply the wrong fix.
- Write step-by-step remediation as literal, copyable commands with expected output, not general descriptions — assume the reader is under pressure and unfamiliar with the system.
- Always include a verification step distinct from “the command ran successfully” — confirm the actual underlying problem is resolved, not just that a procedure completed.
- Define a concrete escalation path with a specific time threshold and contact — leaving escalation implicit means it gets improvised badly during an actual incident.
Practice Exercise
- Write a trigger condition precise enough that a colleague could confirm it applies without asking you.
- Draft three remediation steps for a hypothetical service restart, including expected output.
- Write an escalation path clause specifying a time threshold and who to contact next.