What is an incident commander role during a production incident?
Incident commander (IC): a process role, not a technical role. The IC does NOT investigate or fix — they coordinate. Responsibilities: declare incident severity (SEV level), page additional responders as needed, manage the communication bridge (Zoom/Slack), delegate: "you investigate DB, you check dependencies, you update status page", time-box investigation efforts, make escalation decisions, track what has been tried, drive toward resolution, hand off IC role cleanly when needed. IC can be rotated — the IC is whoever picks up the role, not necessarily the most senior engineer. Separation of concerns: IC manages process; tech lead manages technical investigation; communications manager updates stakeholders.
2 / 5
What do SEV levels (SEV1, SEV2, SEV3) indicate?
SEV levels: vary by organization but typical definitions: SEV1: complete outage or data loss/corruption. All users affected. 24/7 immediate response. IC paged, executive notification, customer communication required. All hands on deck. SEV2: major feature unavailable or significant degradation affecting many users. Core functionality impacted. Immediate 24/7 response. SEV3: minor feature degraded, workaround available, limited user impact. Response during business hours. SEV4: trivial issue, tracking in backlog. Declaration criteria: who declares? On-call engineer, customer support escalation, automated alert. When in doubt, declare higher — it's easier to downgrade than to escalate late.
3 / 5
What is a runbook and how does it help during incidents?
Runbook: a documented procedure reducing time-to-resolution and enabling less experienced engineers to respond effectively. Good runbooks: Symptom-based: "Alert: high error rate on /api/orders" → runbook for that specific alert. Diagnosable: "Check these metrics first, then check these logs". Actionable: "Run this command, check this output, if X do Y". Tested: verified during GameDays or drills. Maintained: updated after every incident that uses them. Link runbooks from alerts: every PagerDuty alert should link to the relevant runbook. Anti-pattern: runbooks that say "escalate to the payments team" without any diagnostic steps. Anyone on call should be able to do initial diagnosis.
4 / 5
What is a postmortem (incident retrospective) and what makes it blameless?
Blameless postmortem: coined at Google SRE. Principle: given the information and tools available at the time, engineers made reasonable decisions. The goal: improve systems and processes, not punish individuals. Postmortem structure: Summary: one-paragraph overview. Timeline: minute-by-minute what was known and what actions were taken. Root causes (plural — use the 5 Whys). Contributing factors: what made this worse or harder to detect. Action items: concrete, assigned, time-boxed improvements. Blameless culture: if people fear blame, they hide information. Hidden information makes postmortems less accurate and repeat incidents more likely. "What would have to be true for this reasonable person to make this decision?" is the blameless framing.
5 / 5
What is an error budget and how does it relate to incident response?
Error budget: derived from SLO. If your SLO is 99.9% availability (three nines), your error budget is 0.1% downtime = 8.76 hours/year = 43.8 minutes/month. When you have an incident, you burn error budget. Error budget uses: Deployment gate: if error budget is <10% remaining for the month, pause non-critical deployments. Prioritize reliability over features. SRE/Dev negotiation tool: "we burned 80% of this month's budget on incidents. We need reliability work before new features." Toil reduction driver: incidents consuming budget motivate investing in automation. Error budget policy: formal agreement defining actions at different budget burn levels (>2x normal burn rate → alert; budget exhausted → freeze deployments; consistent exhaustion → OKR to improve SLO). Burn rate alert: how fast the error budget is being consumed relative to the allowed rate.