🔍 Reading: Incident Postmortem
3 exercises — read a realistic incident postmortem blog post and answer questions about root cause analysis language, blameless culture vocabulary, scope statements, and action item accountability.
Postmortem reading essentials
- Root cause → the underlying condition, not the proximate trigger
- N+1 query → fetches N items then queries for each individually — catastrophic under load
- Blast radius → explicit scope: what broke, what didn't, who was affected
- Blameless → focus on system gaps, not individual fault
- Action items → must have an Owner + Due date, or they will not be completed
0 / 3 completed
1 / 3
Payment Processing Outage — Incident Postmortem
{ex.passage} The postmortem identifies an "N+1 query" as a contributing factor. Based on the postmortem's language, what does this term mean in context?
N+1 queries — one of the most common production performance anti-patterns
An N+1 query problem occurs when code fetches a list of N items, then executes one additional query for each item — resulting in N+1 total queries instead of 1 or 2 efficient queries.
Classic example:
Why it was invisible under normal traffic: The postmortem says "Under normal traffic, this caused no observable impact." At, say, 50 requests/second, 101 queries × 50 req/s = 5,050 queries/second — manageable. At 12× load: 60,600 queries/second — far beyond the connection pool capacity.
The "amplified by the marketing email" pattern: This is a classic incident type: a latent bug that only manifests under abnormal load. The bug existed for 38 minutes (13:45 to 14:23) before it caused any visible impact. This is why load testing and query count monitoring matter — you can't always reproduce production traffic patterns in development.
The fix: "Configuration rolled back" — the immediate mitigation was reverting the deployment. The action item to "implement query count alerting to catch N+1 regressions in staging" addresses the detection gap — ensuring future N+1s are caught before reaching production.
Vocabulary: "query amplification" is a broader term for situations where a small increase in load causes a disproportionately large increase in database operations.
An N+1 query problem occurs when code fetches a list of N items, then executes one additional query for each item — resulting in N+1 total queries instead of 1 or 2 efficient queries.
Classic example:
// Load 100 orders (1 query)
const orders = await db.query("SELECT * FROM orders LIMIT 100");
// For each order, load the user separately (100 more queries!)
for (const order of orders) {
order.user = await db.query(
"SELECT * FROM users WHERE id = ?", [order.user_id]
);
}
// Total: 101 queries instead of 1 JOINWhy it was invisible under normal traffic: The postmortem says "Under normal traffic, this caused no observable impact." At, say, 50 requests/second, 101 queries × 50 req/s = 5,050 queries/second — manageable. At 12× load: 60,600 queries/second — far beyond the connection pool capacity.
The "amplified by the marketing email" pattern: This is a classic incident type: a latent bug that only manifests under abnormal load. The bug existed for 38 minutes (13:45 to 14:23) before it caused any visible impact. This is why load testing and query count monitoring matter — you can't always reproduce production traffic patterns in development.
The fix: "Configuration rolled back" — the immediate mitigation was reverting the deployment. The action item to "implement query count alerting to catch N+1 regressions in staging" addresses the detection gap — ensuring future N+1s are caught before reaching production.
Vocabulary: "query amplification" is a broader term for situations where a small increase in load causes a disproportionately large increase in database operations.