Reviewing & Improving Incident Communication Writing
False resolutions, template improvements, useful critique, detection gaps, and severity-linked communication rules
Incident communication retrospective essentials
- False resolutions: use Monitoring state before Resolved — define a stability window in the runbook
- Template improvements: add missing elements as required fields — templates beat guidelines under pressure
- Useful critique: specific communication + what was missing + what to say instead (with example)
- Detection gaps: quantify the delay + root cause of the gap + specific monitoring fix
- Severity-linked communication rules + IC/investigator separation solves the over/under-communication pattern
Question 0 of 5
Reviewing past incident communications, which failure pattern is most damaging to customer trust?
- Customers see "Resolved" and restart their work, only to fail again — this is worse than an ongoing incident because they now have to diagnose whether it's really resolved
- The second "Investigating" notice after a "Resolved" notice erodes confidence that the team knows what it's doing
- Support tickets spike because customers don't trust the status page
- Use "Monitoring" state before "Resolved" — this explicitly signals "we think it's fixed but we're watching"
- Define a stability window: "we will declare Resolved after 30 minutes of normal error rates"
- Resolution criteria should be documented in the incident runbook, not ad-hoc
An incident communication retrospective identifies: "The initial update was too vague — it didn't specify which users were affected." How should this improvement be implemented?
- During an incident, cognitive load is high — on-call engineers are investigating, managing stakeholders, and communicating simultaneously
- A guideline ("be more specific") is forgotten under pressure; a template field ("Affected users: [fill in]") is impossible to miss
- Required template fields also define the minimum viable notification — if a field can't be filled in, that itself is information ("scope unknown")
- After each incident, review the communications: what was missing, what was unclear, what caused confusion?
- Add the missing element to the template as a required field with an example
- Test the template in the next incident drill or tabletop exercise
- Bake the template into the runbook, the PagerDuty incident creation form, or the Slack incident bot
Which critique of an incident communication is most useful for improving future communications?
- ❌ "Bad" — no information about what to improve
- ❌ "Wrong tone" — tone is subjective without a reference standard
- ✅ "The 15:15 UTC update" — specific timestamp anchors the critique to a real example
- ✅ "without specifying what was being investigated" — names exactly what was missing
- ✅ "future updates should include the current hypothesis + example" — gives the writer a concrete pattern to follow
An incident was declared too late — 45 minutes after the first customer reports appeared on social media. What is the correct way to document and improve this in the communication retrospective?
- What happened: "Detection delay: 45 minutes" — quantified
- Root cause of the detection gap: "Social media was our de-facto monitoring" — shows the monitoring gap specifically
- Specific fix: "add user-facing transaction success rate to dashboard + PagerDuty alert at 5% failure rate for 3 consecutive minutes" — concrete, implementable, testable
- It implies the issue was attention, not tooling — but engineers can't watch dashboards 24/7
- It creates blame without creating change
- The correct fix is automation: if social media caught the incident before monitoring did, the monitoring gap is the failure, not the engineer's alertness
After reviewing 6 months of incident communications, a team finds they consistently over-communicate during minor incidents but under-communicate during severe ones. What is the root cause of this pattern?
- Minor incidents (under-communicated ones): on-call understands the issue, isn't stressed, writes thoughtful updates voluntarily
- Severe incidents (over- or under-communicated ones): on-call is investigating under pressure, managing stakeholders, communicating simultaneously — without explicit rules, communication quality degrades precisely when it matters most
- SEV-1: notify executives + update status page + post in #prod-incidents every 15–30 min + incident commander writes updates (not the investigator)
- SEV-2: notify team lead + update status page + post every 30 min
- SEV-3: monitor internally, no customer communication unless 30+ min duration