Blameless Post-Mortem Structure
Document structure, blameless language, impact quantification, and lessons learned
Post-mortem structure
- Summary — duration + affected users + root cause in one paragraph
- Impact — user count, feature, duration in UTC, SLA/financial impact
- Timeline — UTC-timestamped events in chronological order
- Root Cause Analysis — proximate, contributing, root cause
- Action Items — specific, owned, dated; Lessons Learned — what worked, what didn't
Question 0 of 5
What is the correct order of sections in a standard post-mortem document?
Summary → Impact → Timeline → RCA → Action Items → Lessons Learned is the standard structure. Why this order:
- Summary — one paragraph; readers who skip the rest get the essential facts
- Impact — who was affected, how many users, for how long, revenue/SLA implications
- Timeline — chronological events in UTC; the factual record
- Root Cause Analysis — proximate, contributing, and root causes
- Action Items — specific, owned, dated items to prevent recurrence
- Lessons Learned — what went well, what didn't, process insights
A post-mortem summary section says: "The outage lasted 2 hours 14 minutes and affected all paid users. The root cause was a misconfigured Redis timeout introduced in deployment v2.8.1." What is good about this summary?
Duration + affected segment + root cause are the three essential summary facts. Post-mortem summary checklist:
- ✅ Duration: "2 hours 14 minutes" — precise, calculated from timeline
- ✅ Affected users: "all paid users" — scoped, not vague "some users"
- ✅ Root cause (brief): "misconfigured Redis timeout in v2.8.1" — one sentence, no deep analysis yet
- (Optional) Detection method: "detected by monitoring alert at 14:32 UTC"
- (Optional) Resolution: "resolved by rolling back to v2.8.0"
Which of these is an example of blameless language in a post-mortem?
"The deployment process does not require mandatory staging validation" — system failure, not human failure. Blameless language principles:
- ❌ "should have" — implies an individual failed to do their duty
- ❌ "someone forgot" — identifies individual error
- ❌ "human error" — a catch-all that prevents learning; what systemic factor enabled the error?
- ✅ "The process does not require..." — identifies a systemic gap that will exist regardless of who deploys
- ✅ "The configuration validation step was missing from the checklist" — process failure, fixable
The "Lessons Learned" section of a post-mortem should contain:
What worked + what didn't + what to do differently is the Lessons Learned content. Lessons Learned structure:
- What went well: "The monitoring alert fired within 2 minutes of the first error. The on-call runbook was accurate and reduced MTTR."
- What didn't go well: "Staging did not have equivalent load, so the timeout issue didn't surface in testing."
- What we'll do differently: "Add load testing to the staging validation gate before each production deploy."
A post-mortem's "Impact" section states: "Some users were affected for a while." How should this be improved?
User count + feature + duration + SLA/financial impact makes the Impact section useful. Good Impact section example: "Approximately 1,200 paid subscribers (18% of active users) were unable to access the payments dashboard from 14:32 to 16:46 UTC — a total of 2 hours 14 minutes. During this window, all payment processing failed, resulting in an estimated $8,400 in delayed transactions. This breached our 99.9% monthly uptime SLA by 4 minutes." Components:
- ✅ User count and percentage
- ✅ Affected feature
- ✅ Duration with exact times in UTC
- ✅ Business impact (transactions, SLA breach)