Incident Communication
6 exercises — write P1 initial alerts, status updates with cadence, all-clear messages, root cause statements, contributing factors, and postmortem action items.
0 / 6 completed
Incident communication templates
- P1 opener: SEVERITY / timestamp UTC / what is broken / affected users / IC / bridge / ETA for update
- Status update: timestamp / "UPDATE" / what was ruled out / current hypothesis / next update time
- All-clear: timestamp / "RESOLVED" / restored time / root cause / fix / duration / affected count / postmortem link
- Root cause: systemic condition (not a person) — what process or check was absent
- Contributing factors: conditions that amplified severity or delayed detection — each gets an action item
- Action item: specific / owner / due date / ticket number
1 / 6
A P1 incident has just been declared. Which is the most effective initial alert message to post in the incident Slack channel?
Option B — a structured P1 opener with all required components.
P1 initial alert message structure:
Each component explained:
• Timestamp (UTC): establishes the incident timeline; always use UTC for global teams
• What is broken: specific: "Payment service / EU checkout" not "the site"
• Observed behavior: exact: "HTTP 500 for all EU checkout requests" — prevents teams from investigating different symptoms
• Affected users + scale: "~2,000 users/min" — quantified; enables triage prioritisation
• Impact: "Revenue impact: ongoing" — business consequence, not technical description
• IC: who owns the incident; single point of coordination
• Bridge: where to convene — prevents fractured communication across multiple channels
• ETA for update: sets expectations; prevents the channel from flooding with "any update?" messages
P1 initial alert message structure:
[SEVERITY] INCIDENT — [TIMESTAMP UTC] [WHAT IS BROKEN]: [specific service/endpoint] [OBSERVED BEHAVIOR]: [exact error, status code, behavior] Affected: [who is affected + scale] Impact: [revenue, data, business consequence] IC: [Incident Commander name/handle] Bridge: [channel link or call link] ETA for next update: [time]
Each component explained:
• Timestamp (UTC): establishes the incident timeline; always use UTC for global teams
• What is broken: specific: "Payment service / EU checkout" not "the site"
• Observed behavior: exact: "HTTP 500 for all EU checkout requests" — prevents teams from investigating different symptoms
• Affected users + scale: "~2,000 users/min" — quantified; enables triage prioritisation
• Impact: "Revenue impact: ongoing" — business consequence, not technical description
• IC: who owns the incident; single point of coordination
• Bridge: where to convene — prevents fractured communication across multiple channels
• ETA for update: sets expectations; prevents the channel from flooding with "any update?" messages
Next category: Estimation Language →