5 exercises on Kubernetes outage and database incident postmortems. Navigate timelines, identify root causes vs. contributing factors, and read action items like a senior engineer.
Reading postmortems effectively
Timeline first — calculate TTD (time-to-diagnose) and TTM (time-to-mitigate) from timestamps
Root cause vs. contributing factors — root cause is the direct trigger; contributing factors worsened impact or delayed recovery
Action items reveal the gaps — each action item is the inverse of a contributing factor
Blameless language — modern postmortems assign action items to teams, not individuals
Inference questions — use “What went well/poorly” sections to infer system behaviour
0 / 5 completed
1 / 5
Passage: Kubernetes Payment Outage Postmortem
Title: Postmortem: Payment Service Outage — 14 March 2024
Severity: SEV-1 | Duration: 47 minutes | Impact: 100% of checkout requests failed
Timeline (all times UTC):
09:14 — Alerting fires: checkout error rate crosses 5% threshold.
09:17 — On-call engineer pages payment-team lead.
09:22 — Team identifies that payment-pod replicas have dropped from 12 to 0.
09:29 — Root cause identified: a misconfigured PodDisruptionBudget (PDB) prevented
the autoscaler from terminating old pods during the rolling update.
New pods could not be scheduled because old pods refused to be evicted.
09:41 — Temporary fix applied: PDB patched to allow evictions.
09:44 — Pod replicas restored to 12; checkout error rate returns to baseline.
10:01 — Full postmortem call begins.
Root Cause:
The Kubernetes PodDisruptionBudget for the payment service set minAvailable: 12,
equal to the maximum replica count. When the rolling update attempted to terminate
one old pod, the PDB blocked it because doing so would drop below minAvailable.
The update stalled; eventually all new pods entered Pending state, and the old
pods — unable to be terminated — began to fail health checks independently.
Contributing Factors:
1. No staging environment for PDB changes — the misconfiguration was not caught
before reaching production.
2. Alert threshold (5% error rate) was too conservative; by the time it fired,
impact was already total.
3. No runbook for "rolling update stalled" scenario — diagnosis took 7 minutes
longer than it should have.
Action Items:
[ ] Enforce PDB review in the deployment checklist (Owner: DevOps, Due: 2024-03-28)
[ ] Add "pods not ready" alert that fires earlier than error-rate alert (Owner: SRE team)
[ ] Write rolling-update stall runbook (Owner: Platform team, Due: 2024-04-01)
According to the timeline, how long did it take the team to identify the root cause after the alert fired?
15 minutes (09:14 to 09:29).
The alert fired at 09:14 and the root cause was identified at 09:29 — a difference of 15 minutes. The postmortem notes that "diagnosis took 7 minutes longer than it should have" due to the absence of a runbook, implying the optimal time would have been around 8 minutes.
Reading postmortem timelines:
Subtract the first timestamp from the discovery timestamp — that is the time-to-diagnose (TTD).
Subtract the discovery timestamp from the resolution timestamp — that is the time-to-mitigate (TTM).
The total outage duration (47 minutes) spans from the first customer impact, which may pre-date the alert.
Common trap: The 47-minute figure is the total outage duration, not the diagnosis time. The 7-minute figure is how much longer diagnosis took than expected — not the total diagnosis time.
2 / 5
Passage: Kubernetes Payment Outage Postmortem
Title: Postmortem: Payment Service Outage — 14 March 2024
Severity: SEV-1 | Duration: 47 minutes | Impact: 100% of checkout requests failed
Timeline (all times UTC):
09:14 — Alerting fires: checkout error rate crosses 5% threshold.
09:17 — On-call engineer pages payment-team lead.
09:22 — Team identifies that payment-pod replicas have dropped from 12 to 0.
09:29 — Root cause identified: a misconfigured PodDisruptionBudget (PDB) prevented
the autoscaler from terminating old pods during the rolling update.
New pods could not be scheduled because old pods refused to be evicted.
09:41 — Temporary fix applied: PDB patched to allow evictions.
09:44 — Pod replicas restored to 12; checkout error rate returns to baseline.
10:01 — Full postmortem call begins.
Root Cause:
The Kubernetes PodDisruptionBudget for the payment service set minAvailable: 12,
equal to the maximum replica count. When the rolling update attempted to terminate
one old pod, the PDB blocked it because doing so would drop below minAvailable.
The update stalled; eventually all new pods entered Pending state, and the old
pods — unable to be terminated — began to fail health checks independently.
Contributing Factors:
1. No staging environment for PDB changes — the misconfiguration was not caught
before reaching production.
2. Alert threshold (5% error rate) was too conservative; by the time it fired,
impact was already total.
3. No runbook for "rolling update stalled" scenario — diagnosis took 7 minutes
longer than it should have.
Action Items:
[ ] Enforce PDB review in the deployment checklist (Owner: DevOps, Due: 2024-03-28)
[ ] Add "pods not ready" alert that fires earlier than error-rate alert (Owner: SRE team)
[ ] Write rolling-update stall runbook (Owner: Platform team, Due: 2024-04-01)
What was the technical root cause of the payment service outage?
PodDisruptionBudget with minAvailable = max replicas.
The postmortem states: "The Kubernetes PodDisruptionBudget... set minAvailable: 12, equal to the maximum replica count. When the rolling update attempted to terminate one old pod, the PDB blocked it."
Why this is a classic misconfiguration: A PodDisruptionBudget (PDB) is a Kubernetes resource that limits how many pods of a deployment can be simultaneously unavailable. Setting minAvailable equal to your total replica count means zero pods can ever be disrupted — including during routine rolling updates. The fix is to set minAvailable to at most replicas - 1, so at least one pod can be terminated at a time.
Why the other options fail:
Option A inverts causality — the autoscaler could not terminate pods because the PDB blocked it.
Option C describes a downstream effect, not the root cause.
Option D is not mentioned anywhere in the passage.
3 / 5
Passage: Kubernetes Payment Outage Postmortem
Title: Postmortem: Payment Service Outage — 14 March 2024
Severity: SEV-1 | Duration: 47 minutes | Impact: 100% of checkout requests failed
Timeline (all times UTC):
09:14 — Alerting fires: checkout error rate crosses 5% threshold.
09:17 — On-call engineer pages payment-team lead.
09:22 — Team identifies that payment-pod replicas have dropped from 12 to 0.
09:29 — Root cause identified: a misconfigured PodDisruptionBudget (PDB) prevented
the autoscaler from terminating old pods during the rolling update.
New pods could not be scheduled because old pods refused to be evicted.
09:41 — Temporary fix applied: PDB patched to allow evictions.
09:44 — Pod replicas restored to 12; checkout error rate returns to baseline.
10:01 — Full postmortem call begins.
Root Cause:
The Kubernetes PodDisruptionBudget for the payment service set minAvailable: 12,
equal to the maximum replica count. When the rolling update attempted to terminate
one old pod, the PDB blocked it because doing so would drop below minAvailable.
The update stalled; eventually all new pods entered Pending state, and the old
pods — unable to be terminated — began to fail health checks independently.
Contributing Factors:
1. No staging environment for PDB changes — the misconfiguration was not caught
before reaching production.
2. Alert threshold (5% error rate) was too conservative; by the time it fired,
impact was already total.
3. No runbook for "rolling update stalled" scenario — diagnosis took 7 minutes
longer than it should have.
Action Items:
[ ] Enforce PDB review in the deployment checklist (Owner: DevOps, Due: 2024-03-28)
[ ] Add "pods not ready" alert that fires earlier than error-rate alert (Owner: SRE team)
[ ] Write rolling-update stall runbook (Owner: Platform team, Due: 2024-04-01)
The postmortem lists "Contributing Factors." What distinguishes a contributing factor from the root cause in this document?
Contributing factors worsened impact or delayed recovery without being the direct trigger.
In this postmortem, the root cause is the specific technical defect (misconfigured PDB) that triggered the outage. The contributing factors are systemic conditions that made it worse or harder to resolve:
No staging for PDB changes → misconfiguration reached production unchecked.
Conservative alert threshold → impact was total before engineers were paged.
No runbook for the scenario → diagnosis took 7 extra minutes.
None of the contributing factors caused the outage — the PDB misconfiguration did. But without them, the outage might have been prevented or recovered from faster.
Blameless postmortems: Modern postmortems avoid blaming individuals. Contributing factors and action items focus on systems and processes, not people. This is reflected in the action items being assigned to teams, not individuals.
4 / 5
Passage: DB Connection Pool Postmortem
Title: Postmortem: Database Connection Pool Exhaustion — 22 January 2024
Severity: SEV-2 | Duration: 23 minutes | Impact: ~30% of API requests returned 503
Root Cause:
A new background job introduced in the v2.14.0 release opened a database connection
per record processed and did not close connections on completion. Under the default
daily batch of 8,000 records, the connection pool (max: 100) was exhausted within
4 minutes of the batch starting.
What went well:
- Circuit breaker on the API gateway degraded gracefully, preventing a full outage.
- The on-call engineer identified the new deployment as a likely culprit within
5 minutes by correlating the alert timestamp with the deploy log.
What went poorly:
- Code review did not catch the missing connection.close() call.
- Load testing was conducted with a batch of only 50 records, which did not
expose the issue at production scale.
Action Items:
[ ] Add database connection leak detector to CI pipeline (Owner: Backend team)
[ ] Update load testing to use production-scale batch sizes (Owner: QA team)
[ ] Add connection pool saturation alert at 80% threshold (Owner: SRE team)
Under "What went well," the postmortem mentions a circuit breaker. What can you infer the circuit breaker did during this incident?
The circuit breaker degraded gracefully — serving some errors rather than total failure.
The postmortem says: "Circuit breaker on the API gateway degraded gracefully, preventing a full outage." The impact was "~30% of API requests returned 503" — not 100%. A circuit breaker in this context is a resilience pattern that "trips" when a downstream dependency (the database) is unhealthy, returning fast error responses (503 Service Unavailable) instead of queuing requests and timing out.
Circuit breaker pattern:
Closed — normal operation; requests flow through.
Open — dependency is unhealthy; requests are short-circuited to an error response immediately.
Half-open — test requests are allowed through to see if the dependency has recovered.
By returning 503s quickly rather than queuing all requests, the circuit breaker protected the other 70% of API traffic from being affected.
Options A and D describe actions not mentioned in the postmortem. Option B describes the database itself, not an API gateway pattern.
5 / 5
Passage: DB Connection Pool Postmortem
Title: Postmortem: Database Connection Pool Exhaustion — 22 January 2024
Severity: SEV-2 | Duration: 23 minutes | Impact: ~30% of API requests returned 503
Root Cause:
A new background job introduced in the v2.14.0 release opened a database connection
per record processed and did not close connections on completion. Under the default
daily batch of 8,000 records, the connection pool (max: 100) was exhausted within
4 minutes of the batch starting.
What went well:
- Circuit breaker on the API gateway degraded gracefully, preventing a full outage.
- The on-call engineer identified the new deployment as a likely culprit within
5 minutes by correlating the alert timestamp with the deploy log.
What went poorly:
- Code review did not catch the missing connection.close() call.
- Load testing was conducted with a batch of only 50 records, which did not
expose the issue at production scale.
Action Items:
[ ] Add database connection leak detector to CI pipeline (Owner: Backend team)
[ ] Update load testing to use production-scale batch sizes (Owner: QA team)
[ ] Add connection pool saturation alert at 80% threshold (Owner: SRE team)
Why did load testing fail to detect the connection leak before the v2.14.0 release?
The batch size used in testing (50 records) was far too small to exhaust the pool (max: 100 connections).
The postmortem states explicitly: "Load testing was conducted with a batch of only 50 records, which did not expose the issue at production scale." In production, the batch was 8,000 records — each opening a connection without closing it. At 50 records, the pool would absorb the leaked connections without becoming exhausted.
Why this is a common failure mode: Load testing at non-representative scale is one of the most frequent reasons pre-production testing misses production bugs. A key action item in this postmortem is: "Update load testing to use production-scale batch sizes."
Reading action items for root cause signals: Action items in a postmortem are often the inverse of the contributing factors — they tell you exactly what was wrong. "Update load testing to use production-scale batch sizes" directly confirms that the test used an unrepresentative scale.