Cluster Incident Communication
5 exercises — Practice communicating Node NotReady events, etcd degradation, cluster upgrades, and post-incident validation in professional English.
0 / 5 completed
Quick reference: incident communication vocabulary
- NotReady node — kubelet health check failing; pods rescheduled after toleration timeout
- etcd — distributed key-value store holding all cluster state; latency impacts the entire control plane
- cordon / drain — cordon marks nodes unschedulable; drain evicts pods gracefully before maintenance
1 / 5
Node worker-03 has been in NotReady state for 4 minutes. Pods on that node are being rescheduled. You need to post a Slack update to your engineering channel. Which message is most appropriate for a professional incident communication?
Professional incident Slack messages follow a consistent structure: severity indicator, affected component, timestamp, current impact, actions in progress, and update cadence — all in one first message.
"Not sure why. Working on it." conveys panic, not control, which erodes confidence in on-call engineers. "All services may be impacted" is an unverified claim that causes unnecessary alarm. The best message gives stakeholders what they need to make decisions: is the service down? Are we investigating or waiting? When will they hear next? Separate Slack threads can be used for technical investigation details to keep the main channel readable.
Key vocabulary:
• NotReady — node condition indicating the kubelet health check is failing; pods may be rescheduled
• rescheduled — pods evicted from the failing node and placed on healthy nodes by the scheduler
• update cadence — a committed interval for communicating incident status (every 10/15/30 minutes)
"Not sure why. Working on it." conveys panic, not control, which erodes confidence in on-call engineers. "All services may be impacted" is an unverified claim that causes unnecessary alarm. The best message gives stakeholders what they need to make decisions: is the service down? Are we investigating or waiting? When will they hear next? Separate Slack threads can be used for technical investigation details to keep the main channel readable.
Key vocabulary:
• NotReady — node condition indicating the kubelet health check is failing; pods may be rescheduled
• rescheduled — pods evicted from the failing node and placed on healthy nodes by the scheduler
• update cadence — a committed interval for communicating incident status (every 10/15/30 minutes)