Build fluency in the vocabulary of setting and actually achieving a data-loss and downtime target.
0 / 5 completed
1 / 5
At standup, a dev mentions two targets the team sets for a major outage: how much data loss is acceptable, and how long the system is allowed to stay down before service is restored. What are these two targets called?
Recovery Point Objective, or RPO, defines how much data loss is acceptable, typically expressed as a time window like fifteen minutes, while Recovery Time Objective, or RTO, defines how long the system is allowed to stay down before service is restored. Service-level indicators measure a system's actual current performance rather than defining these forward-looking disaster targets. Setting both targets ahead of time is what lets a team design a backup and failover strategy that's actually sized to the business's real tolerance for loss and downtime.
2 / 5
During a design review, the team picks a continuous replication strategy over a nightly backup because the business can tolerate only a few minutes of data loss rather than up to a full day's worth. Which capability does this choice reflect?
This reflects choosing a backup or replication strategy sized to meet a tight Recovery Point Objective, since a nightly backup can lose up to a full day of data while continuous replication can shrink that loss window down to minutes or less. Choosing a nightly backup specifically because the business can tolerate a full day's loss would only make sense if the actual RPO target were that loose, which isn't the scenario described. This is exactly how a stated RPO target should drive the concrete technical strategy a team picks, rather than the strategy being chosen arbitrarily.
3 / 5
In a code review, a dev notices the disaster recovery runbook hasn't actually been executed as a drill in over a year, even though it's kept up to date on paper. What does this represent?
An untested runbook risks a failover that takes far longer than its stated Recovery Time Objective once it's actually needed, since a runbook that reads correctly on paper can still hide a step that no longer works, a stale credential, or a missing permission that only surfaces during a real execution. A read replica is an unrelated deployment pattern. Reviewing a runbook for accuracy on paper is a different, much weaker check than actually executing it, since a paper review can't catch a step that fails only when run for real.
4 / 5
An incident report shows a regional failover took six hours to complete against a stated one-hour Recovery Time Objective, because the runbook had never actually been drilled and a critical step failed the first time it was executed for real. What practice would prevent this?
Running regular disaster recovery drills that actually execute the runbook surfaces a broken step, like a stale credential or a missing permission, well before a real outage forces the team to discover it under pressure. Continuing to keep the runbook updated only on paper with no actual drills is exactly what led to the six-hour failover against a one-hour target in this incident. This drilling discipline is what turns a written RTO target into a genuinely achievable one rather than an aspirational number.
5 / 5
During a PR review, a teammate asks why the team invests in regular disaster recovery drills instead of trusting that a well-written runbook will work correctly the first time it's actually needed. What is the reasoning?
A runbook that reads correctly on paper can still hide a step that fails only under real conditions, such as a credential that quietly expired or a permission that was revoked since the runbook was last written, none of which a paper review would catch. A drill surfaces that failure in a controlled setting, where the team can fix it calmly, instead of discovering it for the first time during an actual outage under real pressure. The tradeoff is the ongoing time investment of running a drill regularly enough that the runbook stays validated against a system that keeps changing underneath it.