Writing an Executive Summary During an Incident
Audience translation, update format, technical-to-plain-language, tone, and updates without resolution
Incident executive summary essentials
- Executive audience: business impact + customer scope + current status + next update — not technical root cause
- Update format: timestamp UTC + status label + impact with numbers + action taken + next update commitment
- Translation: name the component plainly + what happened + impact + confirm current state
- Tone: confident + action-oriented + specific commitment — not apologetic or hedged
- Update on schedule even without resolution — "not yet identified" is a valid status to communicate
Question 0 of 5
An engineer writes this executive summary during an active incident: "The Redis cluster is experiencing memory pressure due to a hot key issue exacerbated by our LRU eviction policy misconfiguration, causing elevated cache miss rates and increased database load, which is degrading API response times for a subset of users." What is the main problem with this for a VP audience?
- What happened (plain language): "Our checkout service is currently slow for some customers"
- Customer impact: "Approximately 15% of customers in the EU region are experiencing checkout page load times above 10 seconds"
- Business impact: "We estimate 300–500 orders per hour are being abandoned as a result"
- Current status: "Our engineering team identified the cause and is implementing a fix"
- Next update: "We will provide an update by 16:00 UTC"
Which executive summary update is written correctly for a production incident?
- Timestamp: "15:30 UTC" — unambiguous, enables correlation with other reports
- Status label: "Investigating" — one word that tells the executive where in the incident lifecycle you are
- Impact with numbers: "~18% of EU checkout users", "8–15 second load times", "estimated 400 orders/hour" — specific enough to assess severity and prioritise
- Action: "identified the likely cause, deploying a fix" — shows progress without overcommitting
- Next update commitment: "16:00 UTC or sooner if status changes" — removes the need for the executive to chase for updates
How should you translate the technical phrase "our database primary has failover to a secondary replica" into an executive summary?
- "Database primary" → "main database server"
- "Failover to secondary replica" → "automatically switched to a backup server"
- "RTO of 45 seconds" → "45-second transition during which some requests failed"
- "Secondary is serving traffic normally" → "The backup server is now running normally"
- Never use acronyms without expansion in executive communications (RTO, RPO, MTTR, SLA are all meaningful to engineers; executives may not have these internalised)
- Plain verbs are better than technical nouns: "switched to" beats "failover to"
- Always confirm current state: "is now running normally" closes the loop so executives don't assume the incident is ongoing
What tone should an executive summary during an incident use, and which sentence exemplifies it?
- ✅ Confident: "We have identified the cause" — not "we think we may have found a possible cause"
- ✅ Action-oriented: "our on-call team is implementing the fix" — active, specific
- ✅ Specific commitment: "by 16:30 UTC" — gives executives a concrete expectation to manage against
- ❌ Excessive apology: apologies come later in the post-mortem; during an incident, executives need information and confidence, not contrition
- ❌ Technical language: causes confusion and signals the communicator hasn't thought about the audience
- ❌ Hedged language: "may possibly", "could potentially", "in certain regions" — signals uncertainty and erodes confidence in the response team
An incident began at 14:35 UTC. An executive asks for a summary at 15:00 UTC. The engineering team does not yet know the root cause. What should the executive summary say?
- Executives who don't hear updates assume the situation is either resolved (false) or out of control (also false) — the uncertainty is worse than the facts
- "Root cause: not yet identified" is honest and signals the team is working, not hiding
- "Current focus: isolating the failure" shows progress even without a fix — the team is narrowing the problem
- "Next update: 15:30 UTC regardless of progress" — commits to the cadence and removes the need to chase
- P0/P1 incidents: every 15–30 minutes to executives
- P2 incidents: every 30–60 minutes
- Each update should say: current status + impact + what's being done + next update time