Reading Postmortems & Incident Reports — English for Developers

1 / 5

Passage: Kubernetes Payment Outage Postmortem

Title: Postmortem: Payment Service Outage — 14 March 2024

Severity: SEV-1 | Duration: 47 minutes | Impact: 100% of checkout requests failed

Timeline (all times UTC):
09:14 — Alerting fires: checkout error rate crosses 5% threshold.
09:17 — On-call engineer pages payment-team lead.
09:22 — Team identifies that payment-pod replicas have dropped from 12 to 0.
09:29 — Root cause identified: a misconfigured PodDisruptionBudget (PDB) prevented
the autoscaler from terminating old pods during the rolling update.
New pods could not be scheduled because old pods refused to be evicted.
09:41 — Temporary fix applied: PDB patched to allow evictions.
09:44 — Pod replicas restored to 12; checkout error rate returns to baseline.
10:01 — Full postmortem call begins.

Root Cause:
The Kubernetes PodDisruptionBudget for the payment service set minAvailable: 12,
equal to the maximum replica count. When the rolling update attempted to terminate
one old pod, the PDB blocked it because doing so would drop below minAvailable.
The update stalled; eventually all new pods entered Pending state, and the old
pods — unable to be terminated — began to fail health checks independently.

Contributing Factors:
1. No staging environment for PDB changes — the misconfiguration was not caught
before reaching production.
2. Alert threshold (5% error rate) was too conservative; by the time it fired,
impact was already total.
3. No runbook for "rolling update stalled" scenario — diagnosis took 7 minutes
longer than it should have.

Action Items:
[ ] Enforce PDB review in the deployment checklist (Owner: DevOps, Due: 2024-03-28)
[ ] Add "pods not ready" alert that fires earlier than error-rate alert (Owner: SRE team)
[ ] Write rolling-update stall runbook (Owner: Platform team, Due: 2024-04-01)

According to the timeline, how long did it take the team to identify the root cause after the alert fired?

2 / 5

Passage: Kubernetes Payment Outage Postmortem

Title: Postmortem: Payment Service Outage — 14 March 2024

Severity: SEV-1 | Duration: 47 minutes | Impact: 100% of checkout requests failed

Timeline (all times UTC):
09:14 — Alerting fires: checkout error rate crosses 5% threshold.
09:17 — On-call engineer pages payment-team lead.
09:22 — Team identifies that payment-pod replicas have dropped from 12 to 0.
09:29 — Root cause identified: a misconfigured PodDisruptionBudget (PDB) prevented
the autoscaler from terminating old pods during the rolling update.
New pods could not be scheduled because old pods refused to be evicted.
09:41 — Temporary fix applied: PDB patched to allow evictions.
09:44 — Pod replicas restored to 12; checkout error rate returns to baseline.
10:01 — Full postmortem call begins.

Root Cause:
The Kubernetes PodDisruptionBudget for the payment service set minAvailable: 12,
equal to the maximum replica count. When the rolling update attempted to terminate
one old pod, the PDB blocked it because doing so would drop below minAvailable.
The update stalled; eventually all new pods entered Pending state, and the old
pods — unable to be terminated — began to fail health checks independently.

Contributing Factors:
1. No staging environment for PDB changes — the misconfiguration was not caught
before reaching production.
2. Alert threshold (5% error rate) was too conservative; by the time it fired,
impact was already total.
3. No runbook for "rolling update stalled" scenario — diagnosis took 7 minutes
longer than it should have.

Action Items:
[ ] Enforce PDB review in the deployment checklist (Owner: DevOps, Due: 2024-03-28)
[ ] Add "pods not ready" alert that fires earlier than error-rate alert (Owner: SRE team)
[ ] Write rolling-update stall runbook (Owner: Platform team, Due: 2024-04-01)

What was the technical root cause of the payment service outage?

3 / 5

Passage: Kubernetes Payment Outage Postmortem

Title: Postmortem: Payment Service Outage — 14 March 2024

Severity: SEV-1 | Duration: 47 minutes | Impact: 100% of checkout requests failed

Timeline (all times UTC):
09:14 — Alerting fires: checkout error rate crosses 5% threshold.
09:17 — On-call engineer pages payment-team lead.
09:22 — Team identifies that payment-pod replicas have dropped from 12 to 0.
09:29 — Root cause identified: a misconfigured PodDisruptionBudget (PDB) prevented
the autoscaler from terminating old pods during the rolling update.
New pods could not be scheduled because old pods refused to be evicted.
09:41 — Temporary fix applied: PDB patched to allow evictions.
09:44 — Pod replicas restored to 12; checkout error rate returns to baseline.
10:01 — Full postmortem call begins.

Root Cause:
The Kubernetes PodDisruptionBudget for the payment service set minAvailable: 12,
equal to the maximum replica count. When the rolling update attempted to terminate
one old pod, the PDB blocked it because doing so would drop below minAvailable.
The update stalled; eventually all new pods entered Pending state, and the old
pods — unable to be terminated — began to fail health checks independently.

Contributing Factors:
1. No staging environment for PDB changes — the misconfiguration was not caught
before reaching production.
2. Alert threshold (5% error rate) was too conservative; by the time it fired,
impact was already total.
3. No runbook for "rolling update stalled" scenario — diagnosis took 7 minutes
longer than it should have.

Action Items:
[ ] Enforce PDB review in the deployment checklist (Owner: DevOps, Due: 2024-03-28)
[ ] Add "pods not ready" alert that fires earlier than error-rate alert (Owner: SRE team)
[ ] Write rolling-update stall runbook (Owner: Platform team, Due: 2024-04-01)

The postmortem lists "Contributing Factors." What distinguishes a contributing factor from the root cause in this document?

4 / 5

Passage: DB Connection Pool Postmortem

Title: Postmortem: Database Connection Pool Exhaustion — 22 January 2024

Severity: SEV-2 | Duration: 23 minutes | Impact: ~30% of API requests returned 503

Root Cause:
  A new background job introduced in the v2.14.0 release opened a database connection
  per record processed and did not close connections on completion. Under the default
  daily batch of 8,000 records, the connection pool (max: 100) was exhausted within
  4 minutes of the batch starting.

What went well:
  - Circuit breaker on the API gateway degraded gracefully, preventing a full outage.
  - The on-call engineer identified the new deployment as a likely culprit within
    5 minutes by correlating the alert timestamp with the deploy log.

What went poorly:
  - Code review did not catch the missing connection.close() call.
  - Load testing was conducted with a batch of only 50 records, which did not
    expose the issue at production scale.

Action Items:
  [ ] Add database connection leak detector to CI pipeline (Owner: Backend team)
  [ ] Update load testing to use production-scale batch sizes (Owner: QA team)
  [ ] Add connection pool saturation alert at 80% threshold (Owner: SRE team)

Under "What went well," the postmortem mentions a circuit breaker. What can you infer the circuit breaker did during this incident?

5 / 5

Passage: DB Connection Pool Postmortem

Title: Postmortem: Database Connection Pool Exhaustion — 22 January 2024

Severity: SEV-2 | Duration: 23 minutes | Impact: ~30% of API requests returned 503

Root Cause:
  A new background job introduced in the v2.14.0 release opened a database connection
  per record processed and did not close connections on completion. Under the default
  daily batch of 8,000 records, the connection pool (max: 100) was exhausted within
  4 minutes of the batch starting.

What went well:
  - Circuit breaker on the API gateway degraded gracefully, preventing a full outage.
  - The on-call engineer identified the new deployment as a likely culprit within
    5 minutes by correlating the alert timestamp with the deploy log.

What went poorly:
  - Code review did not catch the missing connection.close() call.
  - Load testing was conducted with a batch of only 50 records, which did not
    expose the issue at production scale.

Action Items:
  [ ] Add database connection leak detector to CI pipeline (Owner: Backend team)
  [ ] Update load testing to use production-scale batch sizes (Owner: QA team)
  [ ] Add connection pool saturation alert at 80% threshold (Owner: SRE team)

Why did load testing fail to detect the connection leak before the v2.14.0 release?