What does "rollback" mean?

A rollback reverts software or infrastructure to a previously working version. It's a fast mitigation strategy when a bad deploy causes an incident.

What is "connection pool exhaustion"?

Connection pool exhaustion happens when all available database connections are in use and new requests queue up indefinitely, causing timeouts and errors.

How often should you send status updates during an incident?

Every 15–30 minutes during an active P1/P2 incident, even if there's nothing new to report. Silence is perceived as loss of control by stakeholders.

What does "MTTR" stand for?

MTTR is Mean Time To Recovery (or Repair) — the average time from incident declaration to resolution. A key SRE metric.

A runbook is a documented set of procedures for handling a known operational scenario — like restarting a service, rotating credentials, or responding to an alert.

What does "on-call rotation" mean?

An on-call rotation is a schedule where engineers take turns being responsible for responding to alerts and incidents outside business hours.

What is "error budget" in SRE?

An error budget is the allowable amount of downtime or errors defined by your SLO. If you're within budget, you can ship freely. If you've spent it, you focus on reliability.

English Phrases for Incident Response — IT Professional's Guide

Declaring and Coordinating

We're seeing elevated error rates on [service] — declaring a P[1/2] incident.

Clear declaration with severity level

"We're seeing elevated error rates on the payment service — declaring a P1 incident."
I'm taking incident command — [name], can you take notes?

Assigning the IC and scribe roles

"I'm taking incident command — Sarah, can you take notes in the incident doc?"
What's the blast radius?

Asking how many users or systems are affected

"What's the blast radius right now — are all users affected or just a subset?"
Don't make changes without announcing in the incident channel first.

Preventing uncoordinated changes during an incident

"Important: don't make changes without announcing in #incidents first — we need to track what's been tried."

Status Updates and Resolution

[Time] update: [impact]. We're investigating [hypothesis].

Timed status update template

"14:32 update: ~20% of checkout requests failing. We're investigating a potential DB connection pool exhaustion."
We've identified the root cause: [cause].

Root cause announcement

"We've identified the root cause: a bad deploy at 14:15 introduced a null dereference in the order service."
Rolling back to the previous version — ETA 5 minutes.

Announcing a rollback with timeline

"Rolling back to the previous version — ETA 5 minutes, monitoring for recovery."
We're monitoring — error rates are dropping. Will confirm resolution shortly.

Transitioning from mitigation to resolution

"We're monitoring — error rates are dropping from 18% to 4%. Will confirm full resolution shortly."
Incident resolved at [time]. A postmortem will follow within 48 hours.

All-clear announcement with postmortem commitment

"Incident resolved at 15:07. A blameless postmortem will follow within 48 hours."

Phrases to Avoid

These common phrasings undermine your professionalism. Here are better alternatives.

Avoid "I think maybe it could be the database."

Better "Current hypothesis: DB connection pool exhaustion — [metric] supports this."

Vague guesses in an incident waste time. State a hypothesis with supporting evidence.

Avoid "Whose fault is this?"

Better "What changed in the last hour that could explain this?"

Blame-seeking during an incident delays resolution and damages team culture. Focus on the timeline of changes.

Avoid "It should be fine now."

Better "Monitoring metrics — will confirm resolution in 10 minutes when we see sustained recovery."

"Should be fine" sets false expectations. Commit to a monitoring window before declaring resolution.

Practice Exercises

Choose the most professional or correct phrase for each scenario.

Frequently Asked Questions

What is a P1 vs P2 incident?

Priority levels indicate severity. P1 (Priority 1) typically means a critical outage affecting all or many users. P2 is a significant degradation but with a workaround or partial impact. Teams define exact thresholds in their incident runbooks.

What is an incident commander?

The incident commander (IC) is the person coordinating the response — assigning roles, making decisions, and managing communication. They focus on coordination, not necessarily on fixing the issue themselves.

What is a blameless postmortem?

A blameless postmortem analyses what went wrong without assigning personal blame. It focuses on system and process failures, producing action items to prevent recurrence.

Show more questions (7)