DevOps Runbook Writing
Runbook purpose, step completeness, headers, decision branches, and escalation
Runbook writing essentials
- Purpose: step-by-step procedures for known incidents — executable by any on-call engineer at 3am
- Step formula: exact command + expected healthy output + decision branch for each outcome
- Header: trigger alert + severity + ETA + escalation contact if runbook fails
- Decision branches: "if X → go to step N; if Y → escalate to @person"
- Runbooks handle the top 80%; escalation IS the answer for unknown failure modes
Question 0 of 5
What is the primary purpose of a DevOps runbook?
Step-by-step procedures for known incidents and routine tasks — executable by any on-call engineer. Runbook vs. other docs:
- Runbook: "Here is what to do when X happens" — operational, action-oriented, present-tense
- Architecture doc: "Here is how the system is designed" — educational, explanatory
- Post-mortem: "Here is what happened and why" — retrospective
Which runbook step is written most effectively?
Exact command + expected output + decision branch for each outcome. Runbook step formula:
- What to check: "Check database connection"
- Exact command:
psql -U app_user -h db.internal -d production -c "SELECT 1;" - Healthy output: "if it returns '1 row', the connection is healthy"
- Unhealthy path: "if it returns 'Connection refused', proceed to Step 4"
What should appear at the top of every runbook?
Trigger + severity + ETA + escalation — the on-call engineer reads this first. Runbook header template:
- Service: payment-service
- Triggers: Alert "PaymentServiceHighLatency" (p99 > 2s for 5 min), or alert "PaymentServiceDown"
- Severity: SEV-1 (customer-facing, revenue impact)
- Estimated resolution time: 15–30 minutes for known causes; escalate if not resolved in 30 min
- Escalation: Page @payments-oncall-lead, then @cto if unresolved in 45 min
A runbook step says: "Scale up the service." Why is this insufficient for a production runbook?
Missing: exact command + from→to target + verification + side effects. Complete scaling step:
- ❌ "Scale up the service."
- ✅ "Scale the payment-service from 3 to 6 instances:
kubectl scale deployment payment-service --replicas=6 -n production
Verify:kubectl get pods -n production -l app=payment-service— wait until all 6 pods show STATUS: Running (typically 2–3 minutes).
Cost impact: scaling from 3→6 increases AWS costs by ~$180/day. Notify #on-call-costs Slack channel.
Undo: scale back to 3 after incident resolves."
What is the best way to handle a situation in a runbook where the engineer reaches a step that does not resolve the issue?
Decision branches at each step + explicit escalation path. Runbook decision structure:
- Step 3: Restart the payment-service pod
→ If latency returns to normal → ✅ Resolved. File incident report in PagerDuty.
→ If latency is still elevated → Go to Step 4. - Step 4: Check downstream dependencies (Stripe API, database)
→ If Stripe API is degraded → Go to "Stripe Outage Runbook" (link)
→ If database shows replication lag > 30s → Go to Step 5
→ If neither — escalate: page @payments-oncall-lead with current metrics