How to Write a Clear Incident Post-Mortem

A practical guide to writing effective incident post-mortems in English — structure, language, blameless framing, and phrases that communicate clearly.

A post-mortem (also called an incident report or retrospective) is one of the most important documents an engineering team produces. Done well, it captures what happened, why it happened, and what will prevent it from happening again. Done poorly, it assigns blame, buries the real causes, and teaches nothing.

This guide covers the structure, language, and specific phrases that make post-mortems clear, honest, and useful.


The Blameless Principle in English

Before writing a single word, internalise this: a good post-mortem focuses on systems and processes, not on individuals. This is called blameless post-mortem culture.

In English, this means:

Blame language (avoid)Blameless language (use)
“The engineer accidentally deleted the database.""The production database was deleted during a manual migration step."
"DevOps failed to set up monitoring.""Monitoring was not configured for this service at the time of the incident."
"The developer didn’t test this properly.""The change was merged without integration tests covering this code path.”

The passive voice is your friend in post-mortems. It describes what happened without pointing at who did it.


Standard Post-Mortem Structure

1. Incident Summary

A 2–4 sentence overview of what happened, when, and how severe it was:

“On 11 June 2026, the user authentication service was unavailable for approximately 47 minutes, affecting all users attempting to log in. The incident began at 14:23 UTC and was resolved at 15:10 UTC. Approximately 12,000 users were impacted.”

Keep it factual. No interpretation here — just the headline facts.

2. Timeline

A chronological log of events. Use past simple and past continuous:

“14:23 — Automated alerts fired for elevated error rates on the auth service.” “14:31 — On-call engineer acknowledged the alert and began investigating.” “14:45 — Root cause identified: a misconfigured environment variable in the latest deployment.” “15:07 — Fix deployed to production.” “15:10 — Error rates returned to normal. Incident resolved.”

3. Root Cause Analysis

This is the most important section. Explain why the incident happened, not just what happened:

“The root cause was a missing environment variable in the production configuration. This variable had been added to the staging environment but was not included in the production deployment checklist.”

“The underlying cause was the absence of integration tests covering the interaction between the cache layer and the database. The bug was present in the code but not detectable through unit tests alone.”

Use the phrase “root cause” for the primary technical reason, and “contributing factors” for secondary issues:

“Contributing factors included: lack of feature flags for this release, insufficient monitoring on the new service, and time pressure from the quarterly deadline.”

4. Impact Assessment

“User-facing impact: login was unavailable for 47 minutes, affecting approximately 12,000 users in the EU region. No data was lost. No payment processing was affected.”

Be precise about scope: how many users, which regions, which features, and what data (if any) was affected.

5. What Went Well

Blameless post-mortems also capture what worked:

“The on-call rotation responded within 8 minutes of the initial alert. Incident communication was clear and timely — the status page was updated within 12 minutes. The rollback procedure worked as expected.”

6. Action Items

Each action item needs an owner and a deadline:

“Action: Add production environment variable validation to the deployment checklist. Owner: [Platform team]. Due: 20 June 2026.”

“Action: Create integration tests for the cache-database interaction. Owner: [Backend team]. Due: 30 June 2026.”

“Action: Add alerting for the auth service error rate threshold. Owner: [SRE team]. Due: 18 June 2026.”


Useful Phrases for Post-Mortems

Describing the timeline:

“At approximately…"
"Shortly after…"
"Within minutes of…"
"The situation escalated when…”

Describing cause:

“The incident was triggered by…"
"The root cause has been identified as…"
"This was compounded by…"
"A contributing factor was…”

Describing resolution:

“The issue was resolved by reverting the deployment."
"Normal service was restored at…"
"Monitoring confirmed recovery at…”

Describing future prevention:

“To prevent recurrence, we will…"
"Going forward, the team will…"
"We have identified the following gaps in our process…”


Tone and Length

A good post-mortem is:

  • Factual, not emotional. Stick to what happened and what will change.
  • Concise but complete. Cover every section, but don’t pad it. 600–1,000 words is typical.
  • Actionable. Every identified problem should have a corresponding action item.
  • Timely. Write it within 48–72 hours while memory is fresh.

Avoid these phrases in post-mortems:

  • “Unfortunately” (too informal)
  • “Luckily” (minimises the severity)
  • “Human error” (too vague and often blame-shifting in disguise)
  • “Never again” (overcommitment without a plan)

Writing clear post-mortems is a sign of engineering maturity. They show that your team treats failures as learning opportunities — and that you communicate about problems with the same rigor you apply to writing code.