5 exercises — impact statements, blameless timelines, Five Whys root cause, action items with owners, and lessons learned. The SRE culture vocabulary.
0 / 5 completed
Blameless post-mortem structure
Impact: When, how long, how many users, revenue impact — with numbers
Timeline: Chronological events with UTC timestamps — system-level, not people-level
Root cause: Five Whys — systemic gap, not "a bug" or "human error"
Contributing factors: Everything that made the failure worse or harder to detect
Action items: Owner + due date + success criterion — no vague "we should"
Lessons learned: What went well, what to improve, key systemic insight
1 / 5
A post-mortem impact statement reads: "The payment system was down and lots of users were affected." A tech lead says it needs to be rewritten. Which version is better?
A strong impact statement answers four questions with specific, measurable data:
1. When? "2026-04-05 from 14:32 to 16:15 UTC" — exact timestamps, always in UTC 2. How long? "103 minutes" — duration, not just start/end 3. Who was affected and how? "14,800 checkout attempts failed" — not "lots of users" 4. What was the business impact? "Estimated revenue $87,000; 230 support tickets" — quantifies the cost
Why specificity matters: Post-mortems are used to: • Prioritise which systems to improve first (high impact → high priority) • Report to stakeholders who need numbers, not impressions • Set a baseline for future incidents ("was this worse than last time?") • Drive SLO and error budget conversations
Post-mortem impact vocabulary: "The outage lasted [N minutes/hours] from [start UTC] to [end UTC]." "Approximately [N] users / transactions / requests were affected." "Estimated revenue impact: [amount]." "[N] support tickets / customer complaints were received." "[Service X] was fully unavailable / degraded at [N]% reduced capacity."
2 / 5
A post-mortem timeline entry reads: "14:47 — John noticed the error rate going up and eventually decided to check the database." A reviewer says this is not blameless enough. The improved version is _____.
Blameless post-mortems focus on systems, signals, and decisions — not on people and their shortcomings. This is one of the most important concepts in SRE culture (pioneered by Google).
What makes the corrected version blameless: • Names a system-level observable: "Error rate at 12% (up from baseline 0.1%)" • Uses a role rather than a name: "On-call engineer" — the individual isn't blamed; the system's response is tracked • Describes an action, not a judgment: "initiated database health check" • Could not be added to a performance review as evidence of failure
Blameless writing vocabulary: "[Role/system] observed [metric] at [value] in [tool]." "The monitoring alert for [condition] did not fire." (system gap, not person gap) "The runbook did not cover this failure mode." (documentation gap) "At the time, the available information suggested [explanation] — leading to [decision]."
Blameful vs. blameless rewrites: ❌ "Bob deployed without testing." → ✅ "The deployment pipeline did not require a staging test pass." ❌ "Alice missed the alert." → ✅ "The alert was not configured for the affected service." ❌ "The team took too long." → ✅ "Detection time was 34 minutes; alert threshold was set to 15 minutes delay."
3 / 5
A post-mortem lists "Root Cause" as: "A bug in the payment service caused the outage." A senior SRE says this is incomplete. What is missing?
"A bug caused the outage" is never the root cause in a blameless post-mortem. It's the proximate cause — the immediate trigger. The root cause is the systemic condition that allowed the bug to reach production and cause an outage.
The Five Whys applied: 1. Why did the outage happen? → The payment service crashed. 2. Why did it crash? → A null pointer exception on a required field. 3. Why did code with a null pointer exception reach production? → Unit tests didn't cover nil input on that path. 4. Why didn't tests cover that path? → Test coverage was below 60% for that module. 5. Why was coverage that low? → The service was migrated from Python 2 three years ago and tests were never backfilled.
Root cause: "Insufficient test coverage on the payment service, a legacy of the Python 2 migration, allowed a nil-input regression to ship undetected."
Post-mortem root cause vocabulary: "The proximate cause was X. The root cause was Y." "Five Whys analysis revealed that the underlying systemic issue was…" "Contributing factors included: …" "This failure mode was possible because [systemic gap]."
4 / 5
The "Action Items" section of a post-mortem lists: "Fix the bug. Improve monitoring. Update documentation." A tech lead says these are unacceptable. Why?
Vague post-mortem action items are one of the most common reasons incidents repeat. "Fix the bug" assigns no one, sets no deadline, and has no success criterion.
What a complete action item requires: • Owner: one named person or team — not "the team" or "DevOps" • Due date: specific date, not "soon" or "next sprint" • What exactly: specific, measurable change • Priority: P1/P2 or equivalent — links to the incident's severity
Rewritten action items: ❌ "Fix the bug." → ✅ "Add nil-safety checks to payment processor inputs. Owner: Maria G. Due: 2026-04-12." ❌ "Improve monitoring." → ✅ "Add Datadog alert for payment service error rate > 1% sustained for 2 minutes. Owner: SRE team. Due: 2026-04-10." ❌ "Update documentation." → ✅ "Update payments runbook with nil-input failure mode and recovery steps. Owner: Alex T. Due: 2026-04-15."
Post-mortem action item vocabulary: "Action item: [specific task]. Owner: [name]. Priority: P[N]. Due: [date]. Success criterion: [measurable outcome]." "This is a preventive action." / "This is a detective action." / "This is a corrective action."
5 / 5
A post-mortem document ends with: "We apologise to our users for the inconvenience caused." A senior engineer objects to this in an internal post-mortem. Why?
Different document types serve different purposes and audiences — and tone/content should match.
Internal post-mortem: • Audience: engineers, SREs, tech leads • Purpose: understand what happened, prevent recurrence • Ending: lessons learned + action items with owners • Tone: analytical, neutral, forward-looking • No apologies, marketing language, or customer service tone
Customer-facing incident communication: • Audience: users, customers • Purpose: acknowledge impact, explain what happened in non-technical terms, restore confidence • Includes: apology, impact acknowledgment, what was fixed, what prevents recurrence • Tone: empathetic, professional, clear
Standard internal post-mortem closing sections: • "Lessons Learned / What Went Well / What Could Be Improved" — three lists • "Action Items" — the decisions made • "Review date" — when this post-mortem will be reviewed for progress
Lessons Learned vocabulary: "What went well: Our oncall rotation detected the issue within [time]." "What could be improved: Our runbook did not cover this failure mode." "Key lesson: [systemic insight that applies beyond this incident]."