5 exercises on describing on-call rotations and incident timelines in professional IT English.
On-call vocabulary essentials
Primary on-call: first responder — gets paged first
Secondary on-call: escalation path if primary doesn't respond
Handoff: transfer of on-call responsibility between engineers
Paged: received an alert via PagerDuty, OpsGenie, or similar
0 / 5 completed
1 / 5
You're the primary on-call engineer. An alert fires at 02:47 UTC. How do you describe this in the incident log?
Incident log entry — professional format
A professional incident log entry includes:
Exact UTC timestamps: 02:47 for the page, 02:49 for acknowledgement
MTTD calculation: 2 minutes to detect/acknowledge — the gap between alert and response
Initial assessment: what the alert showed, what service was affected
Why "around 3am" is insufficient: Incident timelines need precision. Post-mortems, SLA calculations, and on-call metrics all depend on accurate timestamps. "Around 3am" is impossible to use for MTTD or MTTR calculations.
Key incident timeline terms:
Paged: the moment the alert fired and the engineer was notified
Acknowledged: the engineer confirmed receipt of the page
Escalated: primary couldn't resolve; secondary or manager was engaged
You're handing off on-call responsibility to a colleague at the end of your shift. Which handoff message is most professional?
On-call handoff — structured and complete
A professional handoff message gives the incoming engineer everything they need to start effective immediately:
Exact handoff time: 08:00 UTC — no ambiguity
Active issues: recent incidents and their current status
Pending items: the payment latency spike — resolved but root cause unknown
Open non-critical alerts: the Grafana alert that needs watching but isn't actionable yet
Resources: link to on-call doc
Contact: offer to follow up if needed
Why a sparse handoff is risky: The incoming engineer doesn't know what happened overnight. If the payment issue resurfaces, they'd have no context without the handoff notes — costing 10–30 minutes reconstructing the history.
Handoff vocabulary:
"Handing off primary on-call at [time] UTC."
"There are [N] active items — see notes below."
"All clear at handoff — no active incidents."
3 / 5
An incident post-mortem says: "MTTD: 14 minutes. MTTR: 2 hours 22 minutes." What do these metrics tell you?
MTTD and MTTR — distinct phases of incident duration
Incident lifecycle:
Issue starts (unknown to the team)
[MTTD: 14 minutes] Alert fires / issue detected
Team begins investigation and mitigation
[MTTR: 2h 22m from detection] Service restored
Total user-facing impact duration = MTTD + MTTR = 14 min + 2h 22m = ~2h 36m
Why MTTD matters separately: A high MTTD indicates monitoring/alerting gaps — the issue existed for a long time before detection. Reducing MTTD often has more impact than reducing MTTR.
DORA thresholds for MTTR:
Tier
MTTR
Elite
Under 1 hour
High
1 hour to 1 day
Medium
1 day to 1 week
2 hours 22 minutes = High performer tier.
Vocabulary:
"MTTD was [X] — the alert fired [X] minutes after the issue started."
"MTTR was [Y] — from detection to service restoration."
4 / 5
The on-call rotation cycles every week. An engineer says: "I'm on primary on-call from Monday 08:00 UTC to the following Monday 08:00 UTC." A new team member asks: "What happens if you don't respond to a page within 5 minutes?" What is the correct answer?
Escalation policy — how paging works in practice
Modern on-call systems (PagerDuty, OpsGenie, VictorOps) implement escalation policies:
Alert fires → Primary on-call is paged
If primary doesn't acknowledge in [N] minutes → Secondary on-call is paged
If secondary doesn't acknowledge → Engineering manager or escalation group is paged
Alert continues escalating until someone acknowledges
The point of escalation: Critical production issues cannot wait. If the primary is asleep with their phone on silent, the secondary ensures the incident isn't missed.
On-call responsibilities:
Acknowledge within the SLA window (typically 5–15 minutes)
Investigate and mitigate within the MTTR target
Escalate to secondary or specialists if needed
Document the incident timeline in the incident management system
Vocabulary:
"The escalation timeout is [N] minutes."
"The secondary on-call was paged after the primary didn't respond."
"I acknowledged the page at [time] and began investigation."
5 / 5
After resolving an incident, you write in the post-mortem: "Incident duration: 47 minutes (14:32 UTC – 15:19 UTC). Impact: ~12,000 users affected. SLA status: within budget (error budget consumed: 23%)." Which additional sentence best completes this summary?
Completing a post-mortem summary with error budget projection
The model sentence demonstrates SRE-mature thinking:
Connects the incident to the error budget: 23% consumed
Projects forward: how many similar incidents can the team absorb before breaching the SLO
Sets context for reliability investment decisions: if 4 such incidents would exhaust the budget, the team knows to prioritise prevention
Why "we were lucky" (A) is unprofessional: Post-mortems focus on systemic factors, not luck. Luck-based framing suggests the outcome was random rather than the result of system properties that can be improved.
Why "we need to be more careful" (D) is insufficient: "More careful" describes an individual behaviour, not a systemic change. Post-mortems should identify specific process, tooling, or monitoring improvements.
Error budget tracking vocabulary:
"This incident consumed [X%] of the monthly error budget."
"Remaining error budget: [X%] — [Y minutes] of allowed downtime."
"If the current incident rate continues, the budget will be exhausted by [date]."