Writing Runbook Documentation in English: Clear, Action-Ready Steps
Learn to write runbook documentation in English that works under pressure: imperative steps, precise verbs, decision points, and before/after rewrites for clarity.
A runbook is a step-by-step guide for handling an operational task or incident — restarting a service, failing over a database, clearing a stuck queue. It’s read by a stressed engineer at 3 a.m. who needs to act, not interpret. That means runbook English must be ruthlessly clear: short imperative steps, precise verbs, and explicit decision points. This guide shows you how.
The golden rule: write commands, not descriptions
Runbook steps are instructions, so use the imperative mood (the command form of the verb). Start each step with a strong action verb.
| Descriptive (weak) | Imperative (strong) |
|---|---|
| “The service should be restarted." | "Restart the service." |
| "You will need to check the logs." | "Check the logs for OOMKilled." |
| "It is recommended to scale up." | "Scale the deployment to 5 replicas.” |
✅ “Drain the node. Cordon it. Wait for pods to reschedule. Verify traffic has moved before proceeding.”
Every step starts with a verb the reader can do. No “should,” no “it is recommended,” no passive voice.
Use precise, unambiguous verbs
Operational verbs have specific meanings. Choosing the right one prevents mistakes.
| Verb | Means | Don’t confuse with |
|---|---|---|
| Restart | Stop then start | Reload (re-read config without stopping) |
| Drain | Move work off gracefully | Kill (terminate abruptly) |
| Failover | Switch to standby | Failback (switch back) |
| Roll back | Revert to previous version | Roll out (deploy forward) |
| Throttle | Slow down | Stop |
| Purge | Delete permanently | Clear (may be reversible) |
“Drain the queue (don’t purge it — we need those messages). Then failover to the replica.”
The parenthetical warning prevents a destructive mistake. Always flag verbs that destroy data.
Make every step verifiable
After an action, tell the reader how to confirm it worked. A step without verification leaves them guessing.
- Restart the
apipods:kubectl rollout restart deployment/api- Verify all pods are
Running:✅ Expected: all podskubectl get pods -l app=apiRunning, noneCrashLoopBackOff.
The phrase “Expected:” followed by the success condition is a runbook superpower. The reader knows exactly what “done” looks like.
Handle decision points explicitly
Real operations branch: “if X, do this; if not, do that.” Write branches as clear conditionals, not buried prose.
3. Check the replica lag:
SELECT now() - pg_last_xact_replay_timestamp();
- If lag is under 5 seconds → proceed to step 4.
- If lag is over 5 seconds → stop. Page the DBA. Do not fail over.
Formatting branches as bullet points with bold conditions makes them scannable under pressure. Use “If… then…” structure and put the dangerous branch’s warning in bold.
Write warnings that actually stop people
Generic warnings get ignored. Specific, consequence-stating warnings work.
| Weak warning | Strong warning |
|---|---|
| ”Be careful here." | "⚠️ This command drops the table. It is irreversible. Confirm you have a snapshot first." |
| "This is important." | "⚠️ Do not run this on the primary — only on a replica.” |
“⚠️ STOP. The next step deletes production data. Confirm the timestamp matches the incident window before running it.”
State what happens and why it matters. Use ⚠️ and bold sparingly so they retain force.
Before and after: a full rewrite
Before (a paragraph nobody can follow at 3 a.m.):
“When the queue is stuck, you should probably look at the consumer and maybe restart it if it seems unhealthy, and you might need to check if messages are being processed, and be careful not to lose data because that would be bad.”
After (scannable, imperative, verifiable):
Runbook: Stuck message queue
1. Check consumer health:
kubectl get pods -l app=consumer
- If pods are
CrashLoopBackOff→ continue to step 2.- If pods are
Running→ the issue is downstream; go to “Downstream checks”.2. Inspect the logs for the failure reason:
kubectl logs -l app=consumer --tail=1003. Restart the consumer:
kubectl rollout restart deployment/consumer✅ Expected: pods return to
Runningwithin 60 seconds.⚠️ Do not purge the queue. Messages must be preserved for replay.
4. Verify messages are draining:
rabbitmqctl list_queues name messages✅ Expected: the
messagescount is decreasing.
Style rules for runbooks
- One action per step. Don’t combine “restart and verify and scale” into one line.
- Number sequential steps; bullet alternatives. Order matters in steps; it doesn’t among branches.
- Put commands in code blocks, never inline in a sentence where spacing is ambiguous.
- Write the success condition (“Expected:”) after risky steps.
- Avoid pronouns. “Restart it” — restart what? Name the object every time.
- Avoid time-relative words. “Recently,” “the new one,” “the latest fix” rot fast. Use names and versions.
Language tips for non-native writers
- Imperative ≠ rude. “Restart the service” sounds like an order in many languages but is the correct, neutral form in technical English.
- Avoid “please” in steps. Runbooks aren’t requests; they’re instructions. “Please restart the service” weakens it.
- Use “should” only for expected outcomes, not actions: “The pod should return to Running” (outcome) vs “Restart the pod” (action).
- Spell out abbreviations once. “Failover (FO)” the first time, then “FO” — but only if you use it repeatedly.
Key takeaways
- Write steps in the imperative mood, starting with a strong action verb.
- Choose verbs precisely: drain ≠ purge, roll back ≠ roll out.
- Add “Expected:” success conditions so readers know when a step worked.
- Format decision points as bold if/then branches.
- Make warnings specific and consequence-stating, with ⚠️ used sparingly.
A great runbook turns a panicked 3 a.m. into a calm checklist. Write every step as if the reader is exhausted, scared, and reading it for the first time — because one day, they will be.