🚨 Incident Response
5 exercise sets. The English you need when production is down: fast, clear, blameless communication under pressure.
- Intermediate
Communicating During an Outage
Write clear status updates, escalations, and Slack/Teams messages while an incident is in progress. Stay calm and precise under pressure.
- Intermediate
Building the Incident Timeline
Document what happened and when. Practice the past tense narrative: "At 14:32 UTC, the deployment pipeline triggered…"
- Advanced
Writing Post-Mortems
Write blameless post-mortem reports: timeline, root cause analysis, impact statement, and action items in professional English.
- Advanced
War Room & Bridge Call Language
Run and participate in incident bridge calls. Vocabulary for assigning owners, calling rollbacks, and declaring "all clear".
- Beginner
On-Call Vocabulary
SLA, SLO, MTTR, MTTD, escalation policy, severity levels — master the vocabulary before your first on-call shift.
Useful language for incident response
Status updates
- "We are currently investigating an issue affecting…"
- "The root cause has been identified as…"
- "A fix has been deployed to production and the situation is being monitored."
- "All systems are operational. Incident resolved at 16:47 UTC."
War room phrases
- "Who owns the database layer right now?"
- "Let's roll back the last deployment."
- "Blast radius — how many users are affected?"
- "Call the all-clear when monitoring is green."
Post-mortem language
- "The contributing factors were…"
- "Action item: add alerting for X by [owner] by [date]."
- "This incident was a result of cascading failures…"
- "No single point of failure caused this; rather…"
Frequently Asked Questions
What English vocabulary is essential for incident response?
Core incident vocabulary: incident (unplanned service disruption), degradation (partial performance reduction), outage (complete service unavailability), SEV1/P0 (most critical severity), detection time (when incident was first noticed), resolution time (when service was restored), MTTR (Mean Time to Recovery), incident commander (person leading the response), war room / bridge call (emergency synchronous communication), runbook (step-by-step response guide), rollback (revert to previous stable state).
How do I communicate during a live production incident in English?
Incident communication patterns: "I'm declaring an incident — SEV1 — payment service is down", "I'm taking incident commander role", "Current status: we're investigating the root cause — detection was at 14:32 UTC", "Confirmed: the deployment at 14:28 is correlated with the outage — rolling back now", "Update: rollback complete, monitoring for recovery", "Resolved: service restored at 15:02 UTC, total impact 30 minutes". Keep updates frequent (every 10-15 min for SEV1), factual, and concise.
What does 'blameless postmortem' mean and why is it important?
A blameless postmortem analyses what went wrong without assigning personal blame — focusing on system and process failures rather than individual errors. Pioneered by Google/SRE culture. Blameless language: "The monitoring alert threshold was set too high, allowing the issue to go undetected for 20 minutes" (not "Alex missed the alert"). Blameless postmortems create psychological safety to report incidents honestly and learn from them, rather than concealing errors to avoid punishment.
What is an incident severity level (SEV/P level)?
Severity levels define response urgency: SEV1/P0 (critical — complete outage, all-hands response, immediate action), SEV2/P1 (major — significant degradation, urgent response needed), SEV3/P2 (moderate — partial impact, response during business hours), SEV4/P3 (minor — minimal impact, planned fix in next sprint). Each level has defined response SLAs (e.g., "P0: respond within 5 minutes, update every 15 minutes"). Using correct severity vocabulary sets appropriate expectations for all stakeholders.
How do I write a status page update during an incident?
Status page writing: (1) Be factual, not technical — "Payment processing is currently unavailable" not "TCP connection pool exhausted"; (2) Use present continuous for ongoing — "We are investigating"; (3) Use past tense for resolved — "The issue has been resolved"; (4) Include affected services and estimated impact; (5) Commit to update frequency — "We will post an update in 30 minutes"; (6) Avoid speculation — "We are investigating" not "We think it might be X". Customers value transparency and frequency over technical detail.
What is a runbook and how is it used in incidents?
A runbook is a step-by-step guide for responding to a specific incident type: "If alert X fires: (1) Check Y dashboard; (2) Run command Z; (3) If output shows A, do B; (4) If unresolved, escalate to [team]." During an incident, follow the runbook — don't improvise unless the runbook is clearly wrong. Runbooks are updated after incidents when new failure modes are discovered. Good runbooks reduce MTTR dramatically because responders don't have to think from scratch under pressure.
How do I escalate an incident professionally?
Escalation communication: "I've been investigating this for 20 minutes — I need to escalate to [team lead]", "I've exhausted the runbook steps — the issue persists — escalating to [service owner]", "This is beyond my area — looping in [expert] who owns [component]". Include in escalation: current status, what you've already tried, what you need from the escalation target, and the urgency level. Never escalate with "something's broken" — escalate with specific, actionable context.
What is the difference between an incident and a problem in ITIL?
In ITIL: Incident = any unplanned interruption or quality reduction (immediate focus: restore service). Problem = underlying cause of incidents (focus: root cause analysis, permanent fix). An incident is reactive (service is down now); a problem is proactive (what prevents this from happening again?). In practice: during the incident, you restore service (incident management); the post-incident review creates a problem record to track the permanent fix. Many organisations use these terms loosely, but understanding the distinction is useful in formal IT service environments.
What vocabulary is used in post-incident reviews?
PIR vocabulary: timeline (chronological sequence of events), detection gap (time between incident start and detection), contributing factors (conditions that made the incident possible), corrective actions (fixes to prevent recurrence), mitigations (temporary measures while permanent fix is implemented), action items (specific tasks with owners and deadlines), regression (previously fixed issue returning), human error (usually means system design failure in blameless cultures). Write in neutral, factual, past-tense language.
How do I notify customers about a service incident?
Customer incident notification: (1) Notify early even without full details — "We are investigating reports of [issue]"; (2) Use plain language — no technical jargon; (3) State impact clearly — "Users may be unable to [action]"; (4) Give a timeline — "We expect to resolve this by [time]" or "We will update in 30 minutes"; (5) Apologise without over-apologising — "We apologise for this disruption and are working urgently to restore service"; (6) Final resolution notice with RCA summary. Transparency builds trust even when things go wrong.