Writing the Initial Incident Notification
Notification elements, notification template, uncertainty language, channel strategy, and declaration timing
Initial notification essentials
- Formula: time of detection + what is affected + observable impact + who is responding + next update time
- Template: INCIDENT OPENED [time UTC] — SEV — what — scope — IC — responders — status channel — next update
- Uncertainty: "root cause: unknown — current hypothesis: [X]" — state what you don't know + your working theory
- Channels: incident channel (coordination) + PagerDuty (on-call) + team channel (awareness) + status page (customers)
- Declare on symptom, not root cause — waiting delays response and communication
Question 0 of 5
What must an initial incident notification contain to be actionable?
Initial notification formula: when + what + impact + who is responding + next update. Why each element matters:
- Time of detection: "At 14:35 UTC, our monitoring detected..." — anchors the timeline; on-call engineers in other time zones can correlate with their own logs
- What is affected: "elevated error rates in the checkout service" — specific system, not "some issues"
- Current observable impact: "approximately 15% of checkout attempts are failing" — the customer-visible symptom
- Who is responding: "I am the incident commander; @backend-oncall is investigating" — prevents duplication and shows coverage
- Next update: "by 15:00 UTC" — tells everyone when to expect more information
Which initial incident notification is written most effectively?
Incident notification template: INCIDENT OPENED + time + severity + what + scope + IC + responders + status channel + next update. Notification anatomy:
- "INCIDENT OPENED 14:35 UTC": all-caps label + precise time — clearly identifies this as an incident notification, not a Slack message
- "SEV-1": severity level — tells responders how to prioritise (drop everything vs. monitor)
- "Checkout service degraded": one-line description of the problem
- "~15% of checkout requests failing with 500 errors": specific metric + error code
- "started ~14:30 UTC": approximate start time — enables log correlation even before the incident was detected
- IC + on-call: @named individuals — no ambiguity about who is in charge
- Status channel: "#incident-2026-05-24" — all follow-up discussion goes here, not scattered
- Next update: "15:00 UTC" — prevents people from pinging the IC every 5 minutes
An incident notification says: "We are currently investigating the root cause." What is the correct uncertainty language for an initial notification when the root cause is unknown?
Uncertainty language: state what you don't know + what you're doing + your current hypothesis. Uncertainty communication formula:
- ❌ "We are investigating the root cause" — everyone knows you're investigating; this adds nothing
- ✅ "Root cause: unknown at this time" — explicit admission, not evasion
- ✅ "We are investigating" — active language confirms the work is happening
- ✅ "Current hypothesis: checkout service or auth dependency" — gives responders a focus area without overclaiming certainty
- Other engineers on the call can confirm or disprove it immediately — saves time
- It shows the investigation has direction, not that the team is thrashing
- Label it as "hypothesis" so it's not mistaken for a confirmed root cause
Which communication channel vocabulary is correct for an incident notification?
Multi-channel but structured: incident channel (coordination) + PagerDuty (on-call) + team channel (awareness) + status page (customers). Communication channel roles:
- #prod-incidents or #incident-[date]: all incident coordination and updates go here; keeps the incident discussion out of general channels
- PagerDuty: wakes up the right on-call engineers; provides escalation path if no one responds
- Team channel (#backend-team): brief "incident opened, see #incident-2026-05-24 for details" — awareness without duplication
- Status page: customer-facing; always update during a customer-impacting incident
- ❌ Posting in #general — creates noise for people not involved
- ❌ DMs to individual engineers — creates parallel conversations; responders don't see each other's work
- ❌ Waiting for PagerDuty before posting — the Slack notification creates awareness and enables self-selection by engineers who see the symptom
At what point should you declare a SEV-1 incident and send the initial notification — before or after you confirm the root cause?
Declare on symptom, not on root cause — waiting for confirmation delays response and communication. SEV-1 declaration triggers:
- Customer-facing service degradation above a threshold (e.g., error rate >5% for 5+ minutes)
- Complete service unavailability for any duration
- Data loss or data breach
- Security incident
- Starting the clock early creates accountability — incident duration is measured from declaration, not from root cause identification
- Other engineers can start looking immediately — more eyes accelerate resolution
- Status page and executive notifications can go out promptly — customers and executives prefer early, accurate acknowledgment over late, complete information