How to Write a Runbook Handoff for a Multi-Region Failover in English

Learn the English structure for writing a runbook handoff document so an on-call engineer in another region can execute a multi-region failover correctly without you present.

A runbook handoff for a multi-region failover has to work at 3am, in a timezone you’re asleep in, for an engineer who may have never touched this system before and cannot ask you a clarifying question. That constraint changes how you write English for it: every step needs to be unambiguous, every assumption needs to be stated rather than implied, and every “obviously” needs to be removed, because nothing is obvious to someone executing under pressure without the author in the room.

Key Vocabulary

Failover — the process of switching traffic or operations from a failed or degraded region to a healthy one, ideally with minimal disruption. “This runbook covers failover from the EU-West region to US-East in the event of a regional outage.”

Preconditions — the specific conditions that must be true before it is safe to begin executing the runbook. “Preconditions: confirm that US-East is reporting healthy status in the dashboard, and that replication lag is under 60 seconds before proceeding.”

Point of no return — a step in a procedure after which reversing course becomes significantly harder or impossible, and which should be flagged explicitly. “Step 6 is the point of no return — once DNS is repointed, reverting requires a separate rollback procedure, not simply redoing the earlier steps in reverse.”

Verification step — an explicit check inserted after an action to confirm it had the intended effect before moving to the next step. “After promoting the replica, the verification step is to confirm write traffic is landing in US-East by checking the connection count dashboard.”

Rollback path — the documented alternative procedure to reverse a failover if something goes wrong partway through. “If verification fails at step 4, follow the rollback path in Appendix B rather than continuing with the remaining failover steps.”

Structuring the Handoff Document

  • Purpose: This runbook fails traffic over from EU-West to US-East when EU-West is degraded or unreachable. It should take approximately 15 minutes to execute.”
  • Preconditions (check all before starting): US-East health dashboard shows green; replication lag under 60 seconds; you have access to the DNS management console.”
  • Step 1: Open the DNS console at [link] and locate the record for api.example.com. Do not modify anything yet — this step is only to confirm you have access.”
  • Step 2 (verification): Run curl -I https://api-useast.example.com/health and confirm you see 200 OK before proceeding to step 3.”
  • Step 6 (point of no return): Update the DNS record to point to the US-East load balancer IP. Propagation typically completes within 5 minutes given our TTL settings.”

Writing for an Engineer Who Isn’t There to Ask You

  • “If replication lag is above 60 seconds at the precondition check, stop and escalate to the on-call database engineer rather than proceeding — do not attempt to force promotion.”
  • “If you are unsure whether a step completed successfully, treat it as failed and use the verification command listed, rather than assuming it worked and moving on.”
  • “This runbook assumes you have already been granted the ‘failover-operator’ role; if you have not, request it through [process] before continuing, as steps 5 onward will fail without it.”

Professional Tips

  • State preconditions as a checklist, not prose. Someone under pressure needs to scan and confirm each item quickly, not parse a paragraph to extract what to check.
  • Mark the point of no return explicitly, in its own labeled step. Naming it removes the guesswork about which actions are still safely reversible and which aren’t.
  • Attach a verification step to every action that changes system state. A step without a way to confirm success leaves the reader guessing whether to proceed, which is exactly the wrong position to be in during an outage.
  • Write escalation triggers as explicit if/then statements. “If X, then escalate to Y” removes the judgment call from someone who may not have the context to make it confidently.

Practice Exercise

  1. Write a “Preconditions” checklist of three items for a hypothetical failover runbook.
  2. Write one verification step that follows a state-changing action, including the exact command or check to run.
  3. Write an escalation sentence in the form “If [condition], then [action]” for a step that could plausibly fail.