Intermediate–Advanced 7 terms

Advanced Post-Incident Review

Blameless postmortem culture, contributing factors, timeline reconstruction, corrective vs. preventive actions, psychological safety, and SRE postmortem document structure.

  • Blameless Postmortem /ˈbleɪmləs ˈpəʊstmɔːtəm/

    A post-incident review conducted without assigning blame to individuals. The premise: engineers make mistakes in complex systems, and punishment creates incentives to hide failures. Instead, the review focuses on system and process improvements that prevent recurrence.

    "When the deploy script deleted production data, we ran a blameless postmortem. The engineer who ran the script was asked to describe what they saw and did — not accused. The review identified: the script lacked a dry-run mode, the staging environment didn't have production-realistic data sizes, and the runbook step 7 was ambiguous. Three process improvements, zero punishment."
  • Contributing Factor /kənˈtrɪbjʊtɪŋ ˈfæktər/

    A condition or event that contributed to an incident but is not the 'root cause' on its own. Modern incident analysis (JTBD, SRE) uses contributing factors rather than single root cause — incidents are rarely caused by one thing.

    "The API outage had five contributing factors: 1) a configuration change that disabled connection pooling, 2) a traffic spike coinciding with a marketing campaign, 3) no circuit breaker on the database connection, 4) monitoring alert threshold set too high (didn't page until 5 minutes after degradation started), 5) the runbook lacked a 'high connection count' playbook. All five contributed."
  • "5 Whys" Analysis /faɪv waɪz/

    A root cause analysis technique: ask 'why?' repeatedly (typically 5 times) to trace a problem from its symptom to an underlying system or process failure. Used in postmortems to move beyond proximate cause to actionable improvements.

    "Why did the service go down? Database connections exhausted. Why? No connection pool. Why? The connection pool library wasn't configured correctly. Why? Default config wasn't suitable and docs were unclear. Why? No runbook step validates connection pool settings before deploy. Fix: add validation step to deploy runbook. The 5 Whys moved us from 'someone made a mistake' to 'the deploy process allows misconfiguration.'"
  • Corrective vs. Preventive Action /kəˈrɛktɪv prɪˈvɛntɪv ˈækʃən/

    Corrective action: fixes the immediate issue (restore service, fix the bug). Preventive action: addresses the contributing factors to prevent recurrence (add monitoring, improve process, add automated checks). Postmortems should produce both.

    "Corrective actions (done during incident): rolled back the deployment, manually restored connection pool config. Preventive actions (tracked as postmortem tickets): 1) Add connection pool health check to deployment validation (Owner: DevOps, Due: 2 weeks), 2) Add PagerDuty alert when connection count >80% (Owner: SRE, Due: 1 week), 3) Update deploy runbook section 4 (Owner: me, Due: 3 days)."
  • Incident Timeline /ˈɪnsɪdənt ˈtaɪmlaɪn/

    A chronological reconstruction of events during an incident: when alerts fired, what actions were taken, when different responders joined, when the impact began and ended. Written from multiple perspectives (logs, Slack, monitoring). The foundation of a good postmortem.

    "Timeline reconstruction from three sources: 09:43 — first 5xx errors appear in logs (from logs). 09:47 — PagerDuty alert fires (from PD history). 09:48 — on-call acknowledges, starts investigating (from Slack). 09:52 — database alerts fire (from monitoring). 10:04 — root cause identified (Slack). 10:09 — rollback initiated (deploy logs). 10:14 — service recovers (monitoring). Full impact: 31 minutes."
  • Psychological Safety (Postmortem Context) /ˌsaɪkəˈlɒdʒɪkəl ˈseɪfti/

    The shared belief among team members that it is safe to speak up, admit mistakes, and share bad news without fear of punishment, embarrassment, or career consequences. Essential for blameless postmortems to surface what actually happened.

    "Our postmortem culture required psychological safety before it worked. Early postmortems were silent: engineers gave vague answers, afraid of consequences. We changed two things: the facilitator always started by thanking the responders for their work during the incident, and managers were explicitly excluded from postmortem reviews. Participation and honesty improved dramatically within three months."
  • Production Readiness Review (PRR) /prəˈdʌkʃən ˈrɛdinəs rɪˈvjuː/

    A pre-launch review by the SRE team to assess whether a service meets the reliability bar for production. Checks: runbook exists, alerting is configured, load tested, graceful degradation tested, capacity planned, rollback procedure exists. If PRR fails, the service doesn't get SRE support.

    "The new payment processor integration failed PRR on three points: no runbook for payment timeout scenarios, no alerting on payment success rate, load test only ran at 10% of expected peak traffic. The dev team addressed all three in two weeks. PRR is not a bureaucratic gate — it's the engineering work we require to be confident the system won't page us at 3am."