5 exercises — the vocabulary every SRE, DevOps, and backend engineer needs to respond to and communicate about production incidents: blast radius, postmortems, escalation, war rooms, and runbooks.
Core incident response vocabulary clusters
Impact terms: blast radius, scope of impact, SEV-1/2/3, affected users, degraded service
Process terms: triage, contain, investigate, mitigate, resolve, post-incident review
Roles: incident commander (IC), on-call engineer, communication lead, scribe
Metrics: MTTA (mean time to acknowledge), MTTR (mean time to resolve), error rate, SLA breach
The incident commander says on a call: "We've identified the blast radius — it's only the payments service, the rest of the platform is operating normally. Let's contain it before we investigate root cause." What does blast radius mean in incident response?
Blast radius is borrowed from military/explosive terminology and means the scope of impact of a failure — which systems, services, or users are affected. Minimising blast radius is a core reliability engineering principle: design systems so a single failure can't cascade across your entire platform. Common blast radius limitation techniques: Bulkheads — isolate services so one failure doesn't exhaust shared resources. Circuit breakers — stop calls to a failing service so failures don't propagate. Cell-based architecture — route users to independent cells so an issue in Cell A doesn't affect Cell B. Feature flags — toggle features off for subsets of users without a full deployment rollback. In incident response: before fixing, you first contain (limit blast radius), then investigate (root cause), then fix, then learn (postmortem). In conversation: "The canary deployment caught the bug early — blast radius was under 1% of traffic before we rolled back."
2 / 5
A postmortem document states: "After the incident was resolved, the team conducted a blameless postmortem, identifying four contributing factors and six action items to prevent recurrence." What does blameless postmortem mean?
A blameless postmortem (also called a learning review or post-incident review) is a structured analysis of what went wrong, designed to improve systems and processes rather than punish individuals. The philosophy: engineers operating in complex systems make rational decisions based on the information they had at the time. When things go wrong, the system — not the individual — usually created the conditions for failure. Pioneered by Google SRE and the DevOps movement. Blameless postmortem structure: Timeline — what happened and when (constructed from logs, alerts, chat history). Impact — user/revenue/SLA impact. Contributing factors — usually 3–5 systemic causes (not a single person's mistake). What went well — detection speed, communication, mitigation effectiveness. Action items — specific, assigned, time-bound improvements. A blame culture leads to hiding mistakes and under-reporting incidents; a blameless culture promotes transparency and organisational learning. In conversation: "The postmortem was blameless — we focused on why the monitoring didn't alert us earlier, not on who deployed the change."
3 / 5
An SRE explains their on-call setup: "We use an escalation policy — if the primary on-call doesn't acknowledge the alert within five minutes, it automatically pages the secondary, and then the engineering manager." What is an escalation policy?
An escalation policy defines who gets paged, in what order, and after how long if an earlier responder doesn't acknowledge an alert. It ensures every alert gets a human response even if the primary on-call is asleep, unavailable, or overwhelmed. Escalation policy components: Acknowledgement timeout — how long before escalating (typically 5–15 min). Escalation chain — primary → secondary → team lead → engineering manager. Repeat interval — how often to re-page if unacknowledged. Override schedule — holiday and weekend coverage rules. Common tools: PagerDuty, Opsgenie, VictorOps (now Splunk On-Call). Related vocabulary: On-call rotation — the schedule defining who is primary on-call and when. Acknowledgement (ACK) — the responder confirms they've seen and are handling the alert. MTTA — Mean Time To Acknowledge. MTTR — Mean Time To Resolve. Runbook — step-by-step instructions for responding to a specific alert type. In conversation: "We reduced MTTA from 12 minutes to 3 minutes after rewriting our escalation policy to reduce the acknowledgement window."
4 / 5
During a live incident, the incident commander says: "Let's declare this a SEV-1. I'm setting up a war room — all non-essential responders please leave the channel. We need clean communication." What is a war room in incident response?
A war room is a dedicated, time-boxed communication space — a Slack channel, Zoom call, or physical room — where incident responders focus exclusively on resolving a major outage. Non-essential people are excluded to reduce noise and keep communication clear. Incident communication best practices: Dedicated channel — create a new incident channel (e.g., #inc-20240315-payments) to separate incident comms from general chat. Incident commander (IC) — one person owns the incident, delegates tasks, and decides the resolution strategy. Communication lead — a separate person drafts status page updates and internal stakeholder comms so the IC can focus on resolution. Scribe — records timeline, decisions, and action items for the postmortem. Status page — external-facing updates for affected customers (e.g., via Statuspage.io, Atlassian Status). SEV levels: SEV-1 — critical, major user impact, all hands. SEV-2 — significant impact. SEV-3 — degraded performance, non-critical. SEV-4 — minor issue, no user impact. In conversation: "The war room stayed open for 4 hours until we confirmed all affected transactions had recovered."
5 / 5
An SRE writes in a runbook: "If the database CPU exceeds 90% for more than 5 minutes, execute a manual failover to the read replica. Document the exact time and promote this replica to primary." What is a runbook?
A runbook (also called an operations playbook or standard operating procedure / SOP) is a documented set of procedures for handling a specific operational scenario — typically a recurring incident type or scheduled operational task. Runbook contents: Alert trigger — what alert fires and what conditions caused it. Diagnosis steps — how to confirm the issue and assess impact. Resolution steps — numbered, specific actions to take. Verification — how to confirm the issue is resolved. Escalation — when and who to escalate to if the runbook doesn't resolve the issue. Runbook types: Break-fix runbooks — reactive, for handling specific alert types. Operational runbooks — for scheduled tasks (deployments, database maintenance, key rotation). Disaster recovery runbooks — for major failure scenarios. Runbooks are key to reducing MTTR and enabling on-call engineers to handle incidents they didn't originally build. In conversation: "The on-call engineer followed the database failover runbook and had us back in 8 minutes, even though they'd never handled this incident type before."