An SRE lead introduces blameless postmortems: "A blameless postmortem assumes that people don't make mistakes maliciously — they were trying their best given the information and tools available. When we find 'human error', we don't stop there. We ask: why did the system make it easy for a human to make this mistake? What safeguards were missing? The goal is to improve systems and processes — not to assign blame to individuals." What is the core principle of a blameless postmortem?
Blameless postmortem: an incident review practice that treats contributors as people who made reasonable decisions with incomplete information, and focuses on system improvement. Origins: Sidney Dekker's work on safety science, adapted by SRE teams at Google and popularised by Netflix, Etsy, PagerDuty. Core principles: Assume good intent — people were doing their best. Focus on systems, not people — why did the system allow this failure? Psychological safety — participants must feel safe to share what they knew, thought, and did — without fear of punishment. No single root cause — complex systems have multiple contributing factors. Just culture vocabulary: Just culture — holds people accountable for reckless behaviour, but not honest mistakes in a broken system (Sidney Dekker, David Marx). Restorative justice — focus on what the responder needs to learn, not on punishment. Second victim — the engineer who made the error also needs support; they often suffer most. Learning review — an alternative name for postmortem that emphasises the learning purpose. In conversation: "We removed 'root cause' from our postmortem template and replaced it with 'contributing factors' — it was a subtle but important shift in how we think about incidents."
2 / 5
An incident commander facilitates a postmortem discussion: "I want us to avoid hindsight bias. When we review what the on-call engineer did at 2am with an unfamiliar alert, we need to put ourselves in their position at that moment — not evaluate decisions from the comfort of knowing what we know now. What information did they have? What did the system tell them? What seemed like the right decision at the time?" What is hindsight bias in incident analysis and why is it problematic?
Hindsight bias (knew-it-all-along effect): the tendency to believe, after learning an outcome, that we would have predicted it. In incident analysis: "It's obvious the cache would have failed under that load — why didn't the engineer see that?" This is unfair because the outcome wasn't obvious before the fact. Cognitive biases in incident analysis: Hindsight bias — evaluating past decisions with future knowledge. Outcome bias — judging a decision by its result rather than its quality at the time. A good decision that had a bad outcome isn't a bad decision. Counterfactual reasoning — "if they had done X, the incident wouldn't have happened." Unhelpful because we can't know for certain. Attribution error — attributing failures to individual character ("careless") rather than situational factors ("under pressure, with ambiguous runbooks, at 3am"). Analysis tools: Local rationality — ask what made the responder's actions rational given their local context. Timeline review — reconstruct events in order to understand context at each decision point. What did they know? When did they know it? — the key question for each decision. In conversation: "When reviewing the 3am decision to restart the service, remember the engineer had two simultaneous alerts, the runbook was outdated, and they'd had no sleep. That context matters."
3 / 5
A reliability engineer presents contributing factors analysis: "Instead of asking 'what was the root cause?' we ask 'what were the contributing factors?' There's rarely one cause — there are multiple factors that aligned to make the incident possible. The Swiss cheese model: each slice of cheese has holes. A serious incident happens when the holes in all the slices align. Each factor is a hole in one slice — remove any one, and the incident doesn't happen." Why do advanced postmortems use contributing factors instead of a single root cause?
Why "contributing factors" over "root cause": "root cause" implies a single origin — fix it and the problem is solved. Complex systems have multiple interacting conditions. Stopping at "root cause" creates: Blame pressure — the "cause" is often a person; they get blamed. Incomplete fixes — only one factor is addressed; others remain. Recurrence — a different triggering event reveals the other unaddressed factors. Swiss Cheese Model (James Reason): safety barriers have holes (weaknesses). An accident occurs when holes align. No single slice caused it. Systemic factors: Latent conditions — pre-existing weaknesses in the system (poor runbooks, missing monitoring). Active failures — actions of people at the sharp end. Organisational factors — staffing, tooling, training. Contributing factors format: "The incident was possible because: 1) missing rate limiting on the endpoint, 2) no alerting on cache miss rate, 3) the on-call runbook didn't cover this failure mode, 4) load testing didn't include this traffic pattern." CAPA vocabulary: Corrective action — fixes the specific problem that occurred. Preventive action — prevents a different problem of the same category. Action item owner — the named individual responsible for completing an action by a due date. In conversation: "We identified 7 contributing factors — our action items map to each one. If we'd stopped at 'root cause = misconfiguration' we'd have fixed one thing and left 6 unchanged."
4 / 5
A platform team reviews their incident metrics: "We track MTTD, MTTI, MTTR, and MTTF. Detection time is when the monitoring alerted. Identification time is when we understood what was wrong. Resolution time is when the service was restored. Time to failure is how long between deployments and incidents. For this incident: detection at T+3min, identification at T+22min, mitigation at T+35min, full resolution at T+2hr. The long gap between detection and identification is our biggest problem — our alerts aren't telling us enough." What does MTTD measure and why is minimising it important?
MTTD (Mean Time To Detect): the average time between when an incident begins and when it is detected by monitoring or users. If an incident starts at T+0 and the alert fires at T+5min, detection time = 5 minutes. Why minimise: every minute without detection is a minute of customer impact. A 30-minute MTTD means 30 minutes of degradation before any response begins. Incident time vocabulary: MTTD — Mean Time To Detect. Improved by: comprehensive alerting, synthetic monitoring, real user monitoring, lower alert thresholds. MTTI (Mean Time To Identify) — from detection to understanding what's wrong. Improved by: better alert context, runbooks, tracing, correlation. MTTR (Mean Time To Recover/Restore) — from detection to service restoration. One of the four DORA metrics. MTTF (Mean Time To Failure) — average time between failures. Higher = more stable. Mitigation time — when the immediate user impact was stopped (e.g., rollback, failover) — may differ from full resolution (e.g., data backfill still ongoing). Customer notification time — when customers/stakeholders were informed. SLA breach time — when the agreed availability SLA was violated. Incident timeline format: "T+0: incident begins. T+5: alert fires (MTTD=5min). T+20: issue identified (MTTI=15min). T+35: mitigation complete. T+2h: full resolution (MTTR=2h)." In conversation: "Our MTTD is 8 minutes on average — but this incident took 40 minutes to detect because we had no alert on that specific error code. That's the gap we're fixing."
5 / 5
An engineering manager writes the CAPA for a major incident: "Our CAPA has two parts: corrective actions address the specific failure that occurred — in this case, adding rate limiting to the authentication endpoint. Preventive actions address the broader category — we're auditing all public-facing endpoints for rate limiting gaps, and adding a checklist item to our feature review process. Each action has an owner and a due date — tracked in our incident management system." What is the difference between a corrective action and a preventive action in CAPA?
CAPA (Corrective and Preventive Action): a formal quality management framework. Corrective Action: addresses the specific defect or failure that caused the current incident. Reactive. Example: "Add rate limiting to /auth/login endpoint — this specific endpoint was the attack vector." Preventive Action: addresses the category of failure to prevent similar incidents elsewhere. Proactive. Example: "Audit all public endpoints for rate limiting; add rate limiting check to pre-launch security checklist." The distinction matters: corrective action alone produces a whack-a-mole pattern — you fix one thing and a similar problem appears elsewhere. Preventive actions close the broader gap. Action item quality: Good: specific, measurable, has a named owner, has a due date. "Add CloudFront rate limiting rule to /auth/* by 2026-06-15, owned by [name]." Bad: vague. "Improve security posture." Postmortem action item tracking: use Jira, Linear, or incident management tools (PagerDuty, Rootly, FireHydrant) to track action items. Scheduled postmortem review: review open action items 30 and 90 days post-incident. Action item completion rate is a reliability maturity metric. In conversation: "A postmortem with 15 action items and no owners is just a document. Every action needs one owner — not a team — and a due date in the next 90 days."