Incident communication rules
  • Declare early — it's easier to downgrade a P1 than to under-respond to one
  • Update on a cadence — every 20–30 min for P1, even if there's nothing new to say
  • Separate mitigation from fix — mitigation reduces impact; a fix removes the root cause
  • Post-mortem = blameless — focus on systems and processes, never individuals

Declaring & Opening an Incident

  • We're currently experiencing [issue] affecting [service / users].
    First public status update — factual, no speculation
  • There is an ongoing issue with [service]. We are investigating.
    Use on the status page before root cause is known
  • This is a P1 / SEV-1 — I'm declaring an incident. Joining the war room now.
    Formal declaration — use your company's severity language
  • I'm spinning up an incident channel — joining #incident-[date] now.
    Centralize communication immediately
  • We need an incident commander — can someone volunteer / [Name], can you take IC?
    Assign roles early: IC, scribe, comms lead
  • Symptoms so far: [X]. I'm not sure of the root cause yet.
    Share what you know, be honest about what you don't

Status Updates During an Incident

  • Update at [HH:MM UTC]: [status]. Root cause is still under investigation.
    Regular cadence — every 20–30 min for P1
  • We believe the root cause is [X]. Still confirming.
    Tentative root cause — hedged language is appropriate here
  • We've applied a mitigation: [action]. Monitoring the impact now.
    Distinguish mitigation (reduces impact) from fix (resolves root cause)
  • Error rate has dropped from [X]% to [Y]% following the rollback.
    Quantify improvement — builds confidence
  • The issue appears to be contained — no new alerts in the past [X] minutes.
    Cautious all-clear — don't declare resolved too early
  • We are rolling back [service / deployment] to [version]. ETA to complete: [X] minutes.
    Rollback update with time estimate

Resolving & Closing an Incident

  • The issue has been resolved as of [HH:MM UTC].
    Resolution statement — always include a timestamp
  • Normal service has been restored. All metrics are back within SLO.
    Confirm the service is healthy, not just "fixed"
  • Root cause: [brief explanation]. Fix: [action taken].
    Minimal root cause for the close message
  • Impact: approximately [X] users / [Y] requests affected over [duration].
    Quantify impact for the record
  • A post-mortem will be conducted this week. I'll share a draft by [date].
    Commit to the post-mortem timeline immediately
  • Thank you everyone who helped debug and respond — great teamwork.
    Close on a human note — blameless culture includes credit

Post-Mortem Writing Phrases

  • Root cause: [specific technical cause] led to [failure].
    Be precise — avoid "human error" as a root cause
  • Contributing factors: [factor A], [factor B], [factor C].
    Multiple causes — incidents are usually a chain of failures
  • Timeline: [HH:MM] — [event]; [HH:MM] — [event]; …
    Chronological timeline — basis for all analysis
  • Impact: [X users] impacted, [Y% error rate] for [Z minutes].
    Quantify always
  • What went well: [monitoring alerted quickly / rollback was fast / team coordinated well].
    Blameless means also recognizing what worked
  • Action items: [item] — owner: [Name] — due: [date]
    Every action item needs an owner and a deadline
  • To prevent recurrence, we will [specific system / process change].
    Prevention is the goal — not punishment