Question 1

What English vocabulary is essential for incident response?

Accepted Answer

Core incident vocabulary: incident (unplanned service disruption), degradation (partial performance reduction), outage (complete service unavailability), SEV1/P0 (most critical severity), detection time (when incident was first noticed), resolution time (when service was restored), MTTR (Mean Time to Recovery), incident commander (person leading the response), war room / bridge call (emergency synchronous communication), runbook (step-by-step response guide), rollback (revert to previous stable state).

Question 2

How do I communicate during a live production incident in English?

Accepted Answer

Incident communication patterns: 'I'm declaring an incident — SEV1 — payment service is down', 'I'm taking incident commander role', 'Current status: we're investigating the root cause — detection was at 14:32 UTC', 'Confirmed: the deployment at 14:28 is correlated with the outage — rolling back now', 'Resolved: service restored at 15:02 UTC, total impact 30 minutes'. Keep updates frequent (every 10-15 min for SEV1), factual, and concise.

Question 3

What does 'blameless postmortem' mean and why is it important?

Accepted Answer

A blameless postmortem analyses what went wrong without assigning personal blame — focusing on system and process failures rather than individual errors. Pioneered by Google/SRE culture. Blameless language: 'The monitoring alert threshold was set too high, allowing the issue to go undetected for 20 minutes' (not 'Alex missed the alert'). Blameless postmortems create psychological safety to report incidents honestly and learn from them.

Question 4

What is an incident severity level (SEV/P level)?

Accepted Answer

Severity levels define response urgency: SEV1/P0 (critical — complete outage, all-hands response, immediate action), SEV2/P1 (major — significant degradation, urgent response needed), SEV3/P2 (moderate — partial impact, response during business hours), SEV4/P3 (minor — minimal impact, planned fix in next sprint). Each level has defined response SLAs. Using correct severity vocabulary sets appropriate expectations for all stakeholders.

Question 5

How do I write a status page update during an incident?

Accepted Answer

Status page writing: (1) Be factual, not technical — 'Payment processing is currently unavailable' not 'TCP connection pool exhausted'; (2) Use present continuous for ongoing — 'We are investigating'; (3) Use past tense for resolved — 'The issue has been resolved'; (4) Include affected services and estimated impact; (5) Commit to update frequency; (6) Avoid speculation. Customers value transparency and frequency over technical detail.

Question 6

What is a runbook and how is it used in incidents?

Accepted Answer

A runbook is a step-by-step guide for responding to a specific incident type. During an incident, follow the runbook — don't improvise unless the runbook is clearly wrong. Runbooks are updated after incidents when new failure modes are discovered. Good runbooks reduce MTTR dramatically because responders don't have to think from scratch under pressure.

Question 7

How do I escalate an incident professionally?

Accepted Answer

Escalation communication: 'I've been investigating this for 20 minutes — I need to escalate to [team lead]', 'I've exhausted the runbook steps — the issue persists — escalating to [service owner]'. Include in escalation: current status, what you've already tried, what you need from the escalation target, and the urgency level. Never escalate with 'something's broken' — escalate with specific, actionable context.

Question 8

What is the difference between an incident and a problem in ITIL?

Accepted Answer

In ITIL: Incident = any unplanned interruption or quality reduction (immediate focus: restore service). Problem = underlying cause of incidents (focus: root cause analysis, permanent fix). An incident is reactive (service is down now); a problem is proactive (what prevents this from happening again?). During the incident you restore service; the post-incident review creates a problem record to track the permanent fix.

Question 9

What vocabulary is used in post-incident reviews?

Accepted Answer

PIR vocabulary: timeline (chronological sequence of events), detection gap (time between incident start and detection), contributing factors (conditions that made the incident possible), corrective actions (fixes to prevent recurrence), mitigations (temporary measures while permanent fix is implemented), action items (specific tasks with owners and deadlines), regression (previously fixed issue returning). Write in neutral, factual, past-tense language.

Question 10

How do I notify customers about a service incident?

Accepted Answer

Customer incident notification: (1) Notify early even without full details — 'We are investigating reports of [issue]'; (2) Use plain language — no technical jargon; (3) State impact clearly; (4) Give a timeline; (5) Apologise without over-apologising — 'We apologise for this disruption and are working urgently to restore service'; (6) Final resolution notice with RCA summary. Transparency builds trust even when things go wrong.

🚨 Incident Response

Communicating During an Outage

Building the Incident Timeline

Writing Post-Mortems

War Room & Bridge Call Language

On-Call Vocabulary

Useful language for incident response

Status updates

War room phrases

Post-mortem language

Frequently Asked Questions