Advanced Listening #incidents #on-call #SRE

Live Incident Bridge Calls

Read 3 incident call transcripts — a payment outage, a database replication incident, and a post-incident handoff — then answer comprehension questions about the reasoning and decisions made under pressure.

How to follow an incident call in English

Who is the IC?: the Incident Commander controls the call — identify their directives and decisions
Theory vs. fact: engineers distinguish confirmed facts from hypotheses — listen for "we've confirmed", "theory is", "likely"
Time pressure: incident calls are time-boxed — "five-minute clock", "in the next two minutes" signal urgency
Action items: every call ends with specific owners and deadlines — extract who does what by when

0 / 3 completed

1 / 3

📄 Transcript

[Live incident bridge call — SEV-1. Payment service down. 11 minutes into the call.]
IC (Incident Commander): "Okay, let's get a status update from each team. Engineering lead — where are we?"
Lead: "We've confirmed the payment service is returning 503s across all regions. Started at 14:32 UTC. The service processes roughly 1,200 transactions per minute normally — all of that is failing. We've isolated it to the payment service; the API gateway and auth service are healthy."
IC: "Do we have a theory on root cause yet?"
Lead: "Two theories. Theory one: a configuration change was deployed at 14:28 — four minutes before the incident. It's a likely candidate. Theory two: the upstream payment provider had a reported incident starting at 14:25, which predates ours — so there could be a dependency failure. We're running diagnostics on both."
IC: "Can you isolate which one without waiting for the full diagnostic? We need to decide whether to roll back or wait."
Lead: "We can check the upstream provider's status page and run a direct health check against their endpoint. If their endpoint is returning errors, it's theory two and a rollback won't help. If it's healthy, we prioritise theory one and roll back the config. Can do in five minutes."
IC: "Do it. Comms lead — what's the customer-facing status?"
Comms: "Status page updated at 14:38 with 'investigating payment issues'. I need the engineering team's word on whether to escalate this to 'major outage' language — that triggers our enterprise customer email notification flow."
IC: "Hold on that escalation until we have a root cause theory confirmed. Engineering — five-minute clock starts now. Everyone else: silence on the channel unless you have new information."

What is the IC's decision-making priority at this point in the incident, and what specific diagnostic shortcut does the engineering lead propose?

2 / 3

📄 Transcript

[Incident bridge call — SEV-2. Database replication lag causing read replica failures. 23 minutes into the call.]
DBA: "Okay, I have the diagnosis. The primary Postgres instance is at 94% disk. The WAL (write-ahead log) files are accumulating faster than they're being cleaned up. That's causing replication lag on the replicas — they're behind by 14 minutes now. The replicas are still serving reads, but they're stale."
IC: "What are our options?"
DBA: "Three options. One: manually trigger a WAL checkpoint and free up disk space — this can be done in about 3 minutes but it increases write load on the primary temporarily. Two: add a read replica with fresh volume and route traffic to it — 20 minutes to provision. Three: identify and kill the long-running transaction that's likely holding WAL files open — if one exists. I need to check."
IC: "Check now. How do I know if option three works before trying it?"
DBA: "I run a query against pg_stat_activity — if there's a transaction that's been open for more than 20 minutes, that's almost certainly the WAL blocker. I can see that in under a minute."
IC: "Do it. Customer impact right now?"
Eng: "We have read replicas serving our reporting and analytics endpoints. They're returning data that's 14 minutes old. No direct customer-facing write failures. But the analytics dashboard is showing stale metrics — customers who are actively monitoring their data will notice."
IC: "How long until the staleness is user-visible in a way that generates support tickets?"
Eng: "It already would be for real-time dashboard users. We probably have an hour before ticket volume spikes, based on our last replication lag incident."
IC: "Comms — draft a proactive status update: 'We are investigating a delay in real-time analytics data. Transaction data is not affected.' Send it in the next two minutes."

What decision framework does the IC use to manage the response, and what does "WAL" refer to in this context?

3 / 3

📄 Transcript

[Post-incident bridge call — SEV-1 resolved. Transition to post-mortem handoff. Call between IC, on-call engineer, and team leads.]
IC: "Alright, the service is healthy, we've been green for 35 minutes. I'm declaring the incident resolved at 17:42 UTC. Total duration: 3 hours 10 minutes. Let me do a handoff brief for the post-mortem.
Timeline: at 14:32 UTC the payment service began returning 503s. Root cause confirmed: the upstream payment provider had an API gateway outage. Our service had insufficient circuit breaker configuration — we had no fallback and no timeout, so we cascaded. We mitigated by routing 60% of traffic to the secondary provider at 15:48 UTC using a feature flag. Full recovery at 17:42.
Impact: 1,847 transactions failed. No data corruption. Customers were notified at 14:56 UTC — 24 minutes after incident start. Enterprise customers received individual emails at 15:05 UTC.
For the post-mortem, I'm flagging three areas: one, why we had no circuit breaker on the upstream dependency — this was known technical debt; two, why the detection-to-notification gap was 24 minutes — our alerting fired within 3 minutes but the comms workflow took 21; three, the feature flag for secondary provider routing worked, but we didn't have runbook documentation — the engineer who executed it improvised from memory.
The post-mortem draft is due in 48 hours. Owner is Priya. I'll send the incident summary email to stakeholders in the next 30 minutes."

What are the three improvement areas the IC identifies, and what does "cascaded" mean in the context of this incident?