Intermediate Reading #postmortem #incidents #SRE

Reading Postmortems & Incident Reports

5 exercises on Kubernetes outage and database incident postmortems. Navigate timelines, identify root causes vs. contributing factors, and read action items like a senior engineer.

Reading postmortems effectively
  • Timeline first — calculate TTD (time-to-diagnose) and TTM (time-to-mitigate) from timestamps
  • Root cause vs. contributing factors — root cause is the direct trigger; contributing factors worsened impact or delayed recovery
  • Action items reveal the gaps — each action item is the inverse of a contributing factor
  • Blameless language — modern postmortems assign action items to teams, not individuals
  • Inference questions — use “What went well/poorly” sections to infer system behaviour
0 / 5 completed
1 / 5
Passage: Kubernetes Payment Outage Postmortem
Title: Postmortem: Payment Service Outage — 14 March 2024

Severity: SEV-1 | Duration: 47 minutes | Impact: 100% of checkout requests failed

Timeline (all times UTC):
  09:14 — Alerting fires: checkout error rate crosses 5% threshold.
  09:17 — On-call engineer pages payment-team lead.
  09:22 — Team identifies that payment-pod replicas have dropped from 12 to 0.
  09:29 — Root cause identified: a misconfigured PodDisruptionBudget (PDB) prevented
           the autoscaler from terminating old pods during the rolling update.
           New pods could not be scheduled because old pods refused to be evicted.
  09:41 — Temporary fix applied: PDB patched to allow evictions.
  09:44 — Pod replicas restored to 12; checkout error rate returns to baseline.
  10:01 — Full postmortem call begins.

Root Cause:
  The Kubernetes PodDisruptionBudget for the payment service set minAvailable: 12,
  equal to the maximum replica count. When the rolling update attempted to terminate
  one old pod, the PDB blocked it because doing so would drop below minAvailable.
  The update stalled; eventually all new pods entered Pending state, and the old
  pods — unable to be terminated — began to fail health checks independently.

Contributing Factors:
  1. No staging environment for PDB changes — the misconfiguration was not caught
     before reaching production.
  2. Alert threshold (5% error rate) was too conservative; by the time it fired,
     impact was already total.
  3. No runbook for "rolling update stalled" scenario — diagnosis took 7 minutes
     longer than it should have.

Action Items:
  [ ] Enforce PDB review in the deployment checklist (Owner: DevOps, Due: 2024-03-28)
  [ ] Add "pods not ready" alert that fires earlier than error-rate alert (Owner: SRE team)
  [ ] Write rolling-update stall runbook (Owner: Platform team, Due: 2024-04-01)
According to the timeline, how long did it take the team to identify the root cause after the alert fired?