On-Call Handoff in English: Language for SRE Shift Changes
English vocabulary and phrases for on-call shift handoffs: incident state, ongoing investigations, alerting threshold changes, known issues, and clear shift transition communication.
The on-call handoff is one of the highest-stakes communication moments in site reliability engineering. When one on-call engineer hands over to the next, incomplete or ambiguous communication can mean that a developing incident goes unnoticed, a known workaround is not applied, or the incoming engineer wastes critical time reconstructing context that the outgoing engineer already had. For non-native English speakers on international SRE teams, clear and precise handoff language is a safety-critical skill.
Key Vocabulary
Handoff The formal transfer of on-call responsibility from one engineer to the next, including all relevant context about the current state of the systems and any active work.
“The handoff happens at 09:00 UTC. The outgoing on-call should have the handoff notes ready by 08:45.”
Incident state A summary of any active or recently resolved incidents — their severity, current status, assigned owners, and next steps.
“Incident state: one active SEV-2 on the payments service. Root cause is identified, fix is in review.”
Ongoing investigation A known issue that has not yet been diagnosed or resolved — the incoming on-call needs to understand what is known, what has been tried, and what is still being looked into.
“There is an ongoing investigation into intermittent latency spikes on the search service. We have not identified the root cause. See the linked runbook for the diagnostics we’ve run so far.”
Alerting threshold The value at which a monitoring alert fires — changes to thresholds during a shift should be explicitly documented in the handoff so the incoming engineer understands the current alert sensitivity.
“I temporarily raised the p99 latency alert threshold from 300ms to 500ms during the incident. It has not been reverted. Do not assume the current silence means all is normal.”
Known issue A problem that has been identified but not yet resolved — often with a workaround in place and a ticket tracking the permanent fix.
“Known issue: the batch job occasionally fails on the third retry. The workaround is to manually re-trigger from the admin console. There is a ticket tracking the root cause fix.”
Noise False positive alerts that fire frequently without representing a real problem — important context for the incoming on-call so they do not spend time investigating non-issues.
“The disk-usage alert for the logging cluster has been noisy all week — it fires but auto-resolves. It is being investigated by the platform team. Do not page the team for this one.”
Escalation path The sequence of contacts to reach if the on-call cannot resolve an issue — who to call, in what order, and through which channel.
“If you cannot stabilise the payments service within 30 minutes, escalate to the on-call payments lead. Their contact details are in PagerDuty under ‘Payments Escalation’.”
Useful Phrases
Opening a written handoff:
“Handoff notes for the shift starting 09:00 UTC 19 June 2026. Outgoing on-call: [name]. Incoming on-call: [name]. Overall system status: one active incident, one known issue, alerting is nominal except where noted below.”
Describing an active incident:
“Active incident: SEV-2 on the order processing service. Started at 06:12 UTC. Root cause: a misconfigured rate limit on the upstream payment provider API. A fix has been deployed to staging and is awaiting review. ETA for production deployment: approximately 45 minutes. I will remain available on Slack during the review.”
Flagging a non-standard alerting state:
“Note: I suppressed the ‘high memory usage’ alert on the ml-inference-02 node at 07:30 UTC. The node is running a scheduled model retraining job that is memory-intensive — this is expected and will complete by 10:00 UTC. You can re-enable the alert after 10:00.”
Summarising a quiet shift:
“Uneventful shift. No incidents. One alert fired at 03:15 UTC for elevated 5xx rate on the search API — it self-resolved within 4 minutes and did not require intervention. I’ve linked the graph in the handoff doc for reference.”
Highlighting what to watch:
“Watch the database connection pool on the primary read replica — utilisation has been trending upward all week and is currently sitting at 78%. It hasn’t breached the alert threshold but I’d keep an eye on it during the morning traffic peak.”
Common Mistakes
Assuming “no news is good news” A common error in handoff communication is leaving out information because nothing dramatic happened — but the incoming on-call needs to know about gradual trends, suppressed alerts, and manual interventions just as much as they need to know about active incidents. A handoff that says only “quiet shift, nothing to report” is incomplete if there are known issues or non-standard monitoring states.
Using imprecise time language In a global SRE team, saying “this happened this morning” or “the alert fired last night” is ambiguous. Always use UTC timestamps in handoff notes: “The alert fired at 03:15 UTC” is unambiguous regardless of where the incoming engineer is based. Make it a habit to include the timezone explicitly even when it feels obvious.
Failing to separate observation from diagnosis “The database is struggling because of a memory leak” presents an unconfirmed hypothesis as a fact. In handoff notes, clearly distinguish between what you have observed and what you believe the cause to be: “We are seeing elevated query times on the primary replica (observation). Our current hypothesis is a memory pressure issue, but this has not been confirmed (investigation status).” This saves the incoming engineer from anchoring on the wrong diagnosis.
A well-written handoff is a form of documentation — it should give the incoming on-call everything they need to carry on without you in the room.