This set builds vocabulary for on-call rotations, escalation, and incident notification.
0 / 5 completed
1 / 5
At standup, a dev mentions being the designated engineer responsible for responding to production alerts during a scheduled rotation. What is this responsibility called?
Being on-call means an engineer is designated to respond to production incidents and alerts during a specific scheduled period, typically as part of a rotating roster shared across a team. This ensures there's always a clear point of responsibility for responding to urgent issues. On-call scheduling is a core practice in maintaining reliable production systems.
2 / 5
During a design review, the team configures a rule so an unacknowledged critical alert automatically pages a secondary responder after a set delay. What is this mechanism called?
An escalation policy defines what happens if an alert isn't acknowledged within a set time, such as automatically paging a secondary responder or the next level of the on-call chain. This ensures a critical issue doesn't go unaddressed if the primary responder is unavailable. Well-designed escalation policies are essential for reliable incident response.
3 / 5
In a code review, a dev references the automated notification sent to the on-call engineer when a monitored system crosses an alert threshold. What is this called?
A page is the automated notification, often via phone call, SMS, or push notification, sent to the on-call engineer when a monitored condition crosses an alert threshold, designed to reliably interrupt them even outside normal working hours. This urgency-appropriate delivery mechanism distinguishes paging from a routine notification like an email. Pages are reserved for issues that genuinely require immediate human attention.
4 / 5
An incident report shows the same low-priority alert paged the on-call engineer dozens of times overnight, causing fatigue and a delayed response to a genuinely critical issue the next day. What practice would address this?
Excessive alert noise from poorly tuned thresholds causes fatigue that can desensitize an on-call engineer to genuinely critical pages, a well-documented failure mode called alert fatigue. Regularly reviewing and tuning which conditions actually warrant a page keeps the signal meaningful. This tuning discipline is essential to sustaining an effective on-call practice long term.
5 / 5
During a PR review, a teammate asks why the team maintains a formal on-call rotation instead of relying on whoever happens to notice a problem first. What is the reasoning?
A formal on-call rotation guarantees a specific, accountable person is responsible for responding at any given time, rather than depending on someone happening to notice an issue, which can lead to slow or missed responses, especially outside working hours. This structured ownership is core to maintaining reliable incident response for production systems. It also distributes the responsibility fairly across a team rather than burdening one person constantly.