How to Write a Database Incident Report in English

A complete guide for DBAs and data engineers: how to write a post-incident report (PIR) for database outages — structure, vocabulary, templates, and professional phrases.

A database incident report (also called a post-incident review or postmortem) is a written account of what went wrong, why it happened, and how to prevent it recurring. Writing a clear, professional incident report is one of the most important communication skills for a DBA or data engineer. This guide walks through the full structure with vocabulary, templates, and ready-to-use phrases.


What Is a Database Incident Report?

A database incident report (also: post-incident review (PIR), postmortem, or root cause analysis (RCA)) is a document written after a database outage, performance degradation, or data integrity event.

Its goals are:

  1. Transparency — inform stakeholders about what happened
  2. Accountability — document the timeline and who did what
  3. Learning — identify the root cause
  4. Prevention — define action items to prevent recurrence

“We owe the stakeholders a PIR within 48 hours of this outage. The report should be factual, blameless, and include specific action items.”

Key principle: blameless culture. Good incident reports focus on systems, processes, and conditions — not on blaming individuals.


Incident Report Structure

Section 1: Incident Summary

Start with a high-level summary that anyone can read in 30 seconds.

Template:

## Incident Summary

**Incident ID**: INC-2026-047
**Date**: 2026-04-14
**Duration**: 2h 17m (09:43 UTC – 11:58 UTCh 17 minutes
**Severity**: P1 (Critical)
**Impact**: Orders database replica lag exceeded 4 hours; reporting dashboards
            displayed stale data for ~3.5 hours; 127 business users affected
**Status**: Resolved

Vocabulary:

Severity — the priority/impact classification of an incident. Common levels:

  • P1 / Critical — complete outage, major data loss, revenue impact
  • P2 / High — significant degradation or partial outage
  • P3 / Medium — degraded performance, minor feature unavailable
  • P4 / Low — cosmetic or minimal impact

Duration — total elapsed time from detection to resolution.

Impact — who was affected and how. Always quantify where possible: number of users, transactions, error rate, revenue.


Section 2: Timeline

A precise chronological account of the incident. Use UTC timestamps.

Template:

## Timeline (all times UTC)

| Time  | Event |
|-------|-------|
| 09:43 | Monitoring alert fires: replica lag > 60 seconds (threshold: 30s) |
| 09:47 | On-call DBA acknowledges alert |
| 09:52 | Investigation begins; identified that replica I/O write latency spiked |
| 10:05 | Root cause identified: a long-running analytical query blocked replication |
| 10:12 | Long-running query terminated |
| 10:30 | Replica lag begins decreasing |
| 11:58 | Replica lag returns to < 5 seconds; incident resolved |
| 13:00 | Stakeholder update sent |
| 14:30 | Incident report draft completed |

Language tips for the timeline:

  • Use passive or active voice consistently: “Alert triggered” or “Monitoring fired alert”
  • Be specific about what was detected, identified, implemented, and resolved — these are different events
  • Include communication events: “Stakeholders notified”, “Incident bridge opened”

Section 3: Root Cause Analysis

This is the analytical core of the report. Explain what caused the incident and why.

Root Cause Analysis methods:

5 Whys — ask “Why?” repeatedly until you reach the systemic root cause:

The orders database replica was delayed →
  Why? A long-running analytical query held a table lock →
    Why? The query ran on the primary instead of the read replica →
      Why? The ETL job was misconfigured to use the primary connection string →
        Why? The ETL job configuration wasn't reviewed during the recent
             infrastructure migration
Root cause: Missing configuration review checklist for infrastructure migrations

Template:

## Root Cause Analysis

**Immediate cause**: A long-running analytical query acquired a table lock on the
primary database, blocking replication for 2+ hours.

**Contributing factors**:
1. The ETL pipeline was misconfigured to connect to the primary instead of the
   read replica after last month's infrastructure migration.
2. There was no alerting on ETL connection endpoints — the misconfiguration
   was not detected for 3 weeks.
3. The replica lag alert threshold (30s) was too high to catch the problem early.

**Root cause**: The infrastructure migration runbook did not include a step to
validate ETL connection configurations after a primary/replica endpoint change.

Section 4: Impact Assessment

Quantify the business impact as precisely as possible.

Template:

## Impact Assessment

**Data integrity**: No data was lost or corrupted. The replica contained stale
                   data for 3.5 hours; no mutations were made during this window.

**User impact**: 127 users of the reporting dashboard received data that was
               up to 4 hours out of date.

**Business impact**: Three scheduled order fulfilment reports were generated
                    with stale data. Manual re-generation was required post-recovery.
                    Estimated additional engineering time: 3 hours.

**Customer impact**: No external customer impact detected. Internal operations
                    teams were affected.

**Revenue impact**: None identified directly; estimated indirect operational cost:
                   €2,400 in engineer time.

Section 5: What Went Well

Include what worked — the systems, processes, and people that limited the impact. This builds a positive learning culture.

Template:

## What Went Well

- Monitoring detected the replica lag within 4 minutes of onset
- On-call DBA responded within 4 minutes of the alert
- Incident communication was clear — stakeholders received updates at
  T+20min, T+60min, and T+2h
- The read replica correctly isolated the impact — the primary was unaffected
  and write operations continued normally throughout

Section 6: What Could Be Improved

Honest assessment of gaps in process, tooling, or knowledge.

Template:

## What Could Be Improved

- ETL connection configuration was not validated after migration
- No alert existed for ETL job connection endpoint changes
- The replica lag alert threshold (30s) was too low to provide early warning
  before business impact — we were alerted after impact, not before
- The on-call runbook did not include steps for diagnosing replica lag due to
  lock contention specifically

Section 7: Action Items

Every incident report must end with specific, owned, time-bound action items.

Template:

## Action Items

| ID | Action | Owner | Priority | Due Date |
|----|--------|-------|----------|----------|
| AI-1 | Add ETL connection endpoint validation to migration runbook | Alex Chen | High | 2026-04-21 |
| AI-2 | Create alert for ETL job connection endpoint misconfigurations | Maria Lopez | High | 2026-04-21 |
| AI-3 | Reduce replica lag alert threshold from 30s to 10s for P1 escalation | Alex Chen | Medium | 2026-04-28 |
| AI-4 | Add lock contention section to on-call replica lag runbook | DBA team | Medium | 2026-05-05 |
| AI-5 | Review all ETL jobs for connection string correctness post-migration | Maria Lopez | High | 2026-04-18 |

Language for action items:

  • Use imperative form: “Add…”, “Create…”, “Reduce…”, “Review…”
  • Assign one owner per item — not a team
  • Set a specific due date — not “soon” or “next quarter”

Useful Phrases

For the summary:

  • “This incident resulted in [X] minutes of degraded service for [Y] users.”
  • “No data loss or corruption occurred.”

For the root cause section:

  • “The immediate cause was… however, the root cause was…”
  • “This condition went undetected because…”
  • “A contributing factor was the absence of…”

For the lessons section:

  • “This incident revealed a gap in our…”
  • “We had not anticipated that…”
  • “The alert fired after the impact had already begun — we need earlier detection.”

Practice

Deepen your DBA communication vocabulary with the Database Administration exercise set and the DBA learning path.