English for DevOps: Runbooks, Post-Mortems, and Incident Calls

The specific English vocabulary and phrases DevOps engineers need for on-call incidents, writing runbooks, conducting post-mortems, and daily operations communication. Templates and real examples.

DevOps engineers communicate under pressure: during live incidents, in post-mortem reports read by leadership, and in runbooks used by colleagues at 3am. The English used in these documents and conversations is specific, precise, and professional — and it differs significantly from general business English.

This guide covers the vocabulary, phrases, and document templates for the four core DevOps communication scenarios: incident calls, runbooks, post-mortems, and daily operations handoffs.


Part 1: Incident Call Language

When production is down, communication must be precise and fast. Vague language wastes time and creates confusion.

Core Vocabulary for Incidents

TermMeaning in incident context
incidentAn unplanned interruption to a service or degradation of quality
P1 / SEV1Severity level — P1 is the most critical (complete outage affecting all users)
blast radiusHow many users or systems are affected
impactThe measurable effect: users affected, revenue loss, SLA breach
mitigationAction taken to reduce impact while the root cause is investigated
workaroundA temporary fix that reduces impact but doesn’t address the root cause
root causeThe fundamental reason the incident occurred
rollbackReverting to a previous known-good version
MTTRMean Time To Recovery — how long incidents typically take to resolve
on-callThe engineer currently responsible for responding to alerts
bridgeAn incident call (voice/video channel where responders coordinate)
pagingAutomatically alerting the on-call engineer via PagerDuty / OpsGenie
runbookStep-by-step guide for handling a known problem type
escalationBringing in additional people or management due to severity or stalled progress
all-clearConfirmation that the incident is resolved and service is restored

Phrases for the Incident Bridge

Opening the incident call:

“This is [name], I’m the incident commander. The issue is [brief description]. Our current impact is [users / services affected]. Let’s go around — who’s on this call?”

Describing the problem:

“We’re seeing elevated error rates on the payment service — about 30% of checkout requests are failing with a 503.” “The deployment at 14:32 UTC appears to have triggered this — we’re investigating whether it’s the cause.” “We’ve confirmed the database is healthy — the issue appears to be in the application layer.”

Requesting actions:

“[Name], can you check the application logs for the 14:30–14:45 window?” “Let’s roll back the 14:32 deployment while we investigate. [Name], can you initiate that?” “We need to increase the connection pool limit. [Name], do you have access to modify that config?”

Providing status updates:

“Update: we’ve identified the root cause — a memory leak in the new connection handling code. Rollback is in progress.” “The rollback is complete. Error rates are returning to normal. Monitoring for 10 minutes before declaring resolved.”

Closing the incident:

“Error rates are back to baseline. We’re declaring this resolved as of 15:12 UTC. Impact duration was approximately 40 minutes. Post-mortem will be scheduled within 24 hours. Thanks everyone.”


Part 2: Runbook Writing

A runbook is a documented procedure for handling a specific operational scenario. It is read by someone under stress, possibly at 3am. Every word must earn its place.

Runbook Writing Principles

  1. Be explicit — assume zero context. Don’t say “restart the service”; say “run systemctl restart api-service on the app server.”
  2. Include expected outputs — what should the operator see if the step succeeds?
  3. Include failure paths — what if the step fails?
  4. Use imperative verbs — Run, Check, Verify, Navigate, Execute, Copy
  5. Avoid jargon without explanation — not everyone shares your exact context

Runbook Template

# Runbook: [Descriptive Title of the Problem]

**Service:** [service name]
**Last updated:** 2026-03-01
**Owner:** [team name]
**Severity:** P1 / P2 / P3
**Estimated resolution time:** 10–20 minutes

---

## Symptoms

What does this look like when it's happening?

- Alert: [exact alert name from PagerDuty / Grafana]
- Users report: [description of user-facing impact]
- You might also see: [related symptoms in logs or dashboards]

---

## Diagnosis

### Step 1: Verify the alert is genuine

Run the following command to check current error rates:

\`\`\`bash
kubectl logs -l app=api-service --since=5m | grep ERROR | wc -l
\`\`\`

**If output > 50**: Service is actively failing. Continue to Step 2.  
**If output < 10**: Alert may be a false positive. Check [dashboard link] and 
continue monitoring.

### Step 2: Identify the failing component

Navigate to [Grafana dashboard link] and check:
- `api_response_errors_total` — is the spike limited to one endpoint?
- `db_connection_pool_usage` — is the pool exhausted?

---

## Resolution

### Option A: Database connection pool exhausted

1. Navigate to [AWS Console → RDS → Parameter Groups]
2. Increase `max_connections` from 100 to 150
3. Apply immediately (no reboot required for this parameter)
4. Verify in the database: `SELECT count(*) FROM pg_stat_activity;`
   — should drop below 80 connections within 2 minutes

**Expected outcome:** Error rate returns to baseline within 3 minutes.

### Option B: Memory leak — restart required

1. Scale down the deployment to 0 replicas:
   `kubectl scale deployment api-service --replicas=0`
2. Wait 30 seconds for connections to drain
3. Scale back up: `kubectl scale deployment api-service --replicas=3`
4. Verify pods are running: `kubectl get pods -l app=api-service`

**Expected outcome:** All 3 pods show STATUS "Running" within 2 minutes.

---

## Escalation

If neither option resolves the issue within 10 minutes:

- Page the backend team lead: [PagerDuty escalation policy]
- Join the #incidents Slack channel and post current status
- Create an incident in [incident management tool] and link this runbook

---

## Related runbooks

- [Database failover procedure]
- [Rollback procedure for app deployments]

## Post-incident actions

After resolving:
1. Update the MTTR metric in the [incident tracking spreadsheet]
2. Schedule a post-mortem if severity was P1 or P2
3. Consider whether this runbook needs updating

Part 3: Post-Mortem Writing

A post-mortem (also called an incident review or retrospective) is a written analysis of what happened during an incident, why, and how to prevent recurrence. Blameless post-mortems focus on systems and processes, not individual mistakes.

Blameless Language

The language in a post-mortem determines whether people feel psychologically safe to be honest.

❌ Blame language✅ Blameless language
”John pushed without running tests""A code change was merged without automated test coverage"
"The team failed to monitor the alert""The alert threshold was misconfigured and did not fire"
"Sarah caused the outage""A configuration change introduced a regression"
"We should have known better""The system did not provide adequate feedback to detect this before deployment”

Post-Mortem Template

# Post-Mortem: [Service] [Type of Incident] — [Date]

**Status:** Draft / Final
**Severity:** P1 / P2
**Duration:** [start time UTC] → [end time UTC] ([N] minutes)
**Impact:** [number of users affected] / [% of traffic affected] / [revenue impact if known]
**Author:** [name]
**Reviewers:** [team lead, other relevant teams]

---

## Summary

A 2–3 sentence summary of what happened, the impact, and the resolution.
Written for an audience that includes non-technical leadership.

> *"On March 15 at 14:32 UTC, a deployment of the payment service introduced 
> a connection pool misconfiguration that caused 32% of payment requests to 
> fail. The incident lasted 41 minutes. Affected users were unable to complete 
> checkout; estimated revenue impact was $18,000. The issue was resolved by 
> rolling back to the previous deployment."*

---

## Timeline

All times in UTC.

| Time  | Event |
|-------|-------|
| 14:32 | Deployment of version 2.4.1 completed |
| 14:35 | Monitoring alert fired: payment error rate > 5% |
| 14:38 | On-call engineer [name] acknowledged the alert |
| 14:42 | Incident bridge opened; impact confirmed at ~30% error rate |
| 14:55 | Root cause identified: connection pool limit set to 10 (was 100) |
| 15:01 | Rollback initiated |
| 15:08 | Rollback complete; error rates returning to baseline |
| 15:13 | Service confirmed healthy; incident declared resolved |

---

## Root Cause

Describe the technical root cause and the contributing factors.

> *"A configuration change in the Helm chart for version 2.4.1 incorrectly 
> set the database connection pool limit to 10 instead of 100. Under 
> production load, the pool was exhausted within 3 minutes of deployment, 
> causing new requests to fail immediately with a connection timeout."*

---

## Contributing Factors

What conditions allowed this to happen?

- The misconfiguration was not caught in code review because the connection 
  pool limit is defined in a values.yaml file that is not typically reviewed.
- The staging environment uses a separate lower-traffic configuration that 
  masked the issue — the pool limit of 10 was sufficient for staging load 
  but not production load.
- No automated validation exists for connection pool parameters.

---

## What Went Well

Recognise things that worked correctly or helped recovery.

- The monitoring alert fired within 3 minutes of the issue occurring.
- The on-call engineer had the runbook for this failure mode and followed 
  it correctly.
- The rollback procedure completed successfully in under 8 minutes.

---

## What Went Poorly

Honest assessment of what should have prevented or shortened this.

- The staging configuration differs from production in ways that mask 
  configuration errors.
- The release process does not include a validation step for infrastructure 
  configuration parameters.
- Time-to-detect (3 minutes) was acceptable; time-to-acknowledge (3 minutes) 
  was acceptable; but time-to-identify root cause (17 minutes) was longer 
  than expected due to initially investigating application code rather than 
  configuration.

---

## Action Items

| Action | Owner | Due date | Priority |
|--------|-------|----------|----------|
| Add automated validation for critical config parameters in CI | [name] | 2026-04-01 | P1 |
| Align staging connection pool to match production defaults | [name] | 2026-03-22 | P1 |
| Add runbook for connection pool exhaustion detection | [name] | 2026-03-25 | P2 |
| Review all Helm chart values annotated with CHANGEME | [team] | 2026-04-15 | P2 |

Part 4: Handoff and Operations Communication

Daily Handoff Message (On-Call Rotation)

When handing off on-call responsibility between shifts or time zones:

**On-Call Handoff — [Date] [Time] UTC**

Handing to: [name]
Current status: ✅ All services healthy / ⚠️ See below

**Active issues:**
- Minor elevated latency on the search service (P3, tracking in #ops)
  — not alerting but worth monitoring. Runbook: [link]

**Completed during shift:**
- Resolved P2 disk space alert on db-replica-02: expanded volume to 500GB
- Deployed version 3.2.1 to production at 10:45 UTC — healthy

**Upcoming:**
- Scheduled maintenance window tonight 23:00–01:00 UTC for database patching
  — [name] from DB team will lead; your role is to monitor application health

**Notes:**
- The staging environment is intentionally down until tomorrow
- PagerDuty on-call schedule updated: you're on until Friday 09:00 UTC

Escalation Message (Slack/Teams)

When escalating an ongoing incident to a wider audience:

:red_circle: **[P1 INCIDENT] Payment Service Degradation**

**Status:** Active (41 min)
**Impact:** ~30% of checkout requests failing
**Current action:** Rollback in progress (ETA: 5 min)

Responders: @on-call-engineer, @backend-team-lead
Bridge: [Zoom link]
Incident doc: [Confluence/Notion link]

Will update every 15 minutes or on status change.

Vocabulary Quick Reference

Severity levels: P1 (critical outage) → P2 (significant degradation) → P3 (minor issue) → P4 (cosmetic / low impact)

Timeline verbs: triggered, spiked, degraded, recovered, rolled back, confirmed, escalated, resolved, declared

Post-mortem verbs: identified, contributed to, introduced, mitigated, prevented, revealed, exposed, masked

Runbook verbs: Run, Execute, Navigate, Verify, Check, Apply, Monitor, Escalate, Confirm