Multi-Region Deployment
Vocabulary for designing, discussing, and operating systems deployed across multiple geographic regions for resilience and performance.
- Active-Active /ˈæktɪv ˈæktɪv/
A multi-region architecture where all regions simultaneously serve live traffic and can process writes. Provides the highest availability and lowest latency for global users, but requires conflict resolution for concurrent writes to shared data and is the most complex to operate correctly.
"Our active-active setup routes users to the nearest region — European users to Frankfurt, North American to Virginia. Both regions accept writes, so we use CRDTs for the shopping cart to handle concurrent updates without conflicts. Failover is seamless: if Frankfurt goes down, DNS automatically routes all traffic to Virginia within 60 seconds."
- Active-Passive /ˈæktɪv ˈpæsɪv/
A multi-region architecture where one region (active) serves all live traffic and another (passive/standby) is kept warm and ready to take over. Simpler than active-active but involves a failover step — traffic is not instantly available from the standby, and write replication lag determines data loss risk.
"We use active-passive for the billing service — the passive region in Tokyo replicates from primary Singapore in near-real-time. Failover is manual and tested quarterly: we flip DNS, confirm replication lag was under 5 seconds, and declare the new primary within 20 minutes of an outage declaration."
- RPO (Recovery Point Objective) /ɑː piː əʊ/
The maximum acceptable amount of data loss measured in time — how far back in time can you afford to recover data? An RPO of 1 hour means you accept losing up to 1 hour of transactions. Drives decisions about replication frequency, backup intervals, and cross-region data synchronisation.
"Our RPO is 15 minutes for the orders database — we cannot lose more than 15 minutes of transactions. This drives our replication strategy: synchronous replication to the standby region for orders, with an RDS read replica that lags no more than 5 minutes. We test RPO recovery quarterly."
- RTO (Recovery Time Objective) /ɑː tiː əʊ/
The maximum acceptable time to restore service after a failure — how long can the system be unavailable? An RTO of 2 hours means service must be restored within 2 hours of a failure event. Drives decisions about automation level, pre-provisioned standby capacity, and runbook complexity.
"Our RTO is 30 minutes for the core application — the SLA to customers guarantees restoration within that window. To meet it, we pre-provision warm standby infrastructure in the DR region rather than cold-starting it, and our runbooks are automated to the point where a single command initiates the full failover sequence."
- Failover /ˈfeɪləʊvər/
The process of switching from a failed primary system to a standby system to restore service. Can be automated (triggered by health check failures) or manual (requiring human decision). The failover process itself must be regularly tested — an untested failover is not a recovery plan.
"We run quarterly failover drills where we deliberately terminate the primary region to test the automated failover path. The last drill confirmed automatic DNS failover in 47 seconds and full application recovery in 8 minutes — within our 30-minute RTO. The drill revealed a misconfigured health check that would have delayed detection in a real incident."
- Failback /ˈfeɪlbæk/
The process of returning traffic and operations to the original primary region after a failover, once the primary has been restored and validated. Often more complex and risky than the initial failover — requires synchronising any data written to the standby back to the primary and carefully cutting over traffic.
"After the US-East outage, failing back from EU-West took 4 hours: we replayed 2 hours of writes from EU-West logs to the restored US-East database, validated consistency, then shifted traffic back in a canary pattern at 10% increments rather than an instant cut-over."
- Data Sovereignty /ˈdeɪtə ˈsɒvrɪnti/
The principle that data is subject to the laws and regulations of the country or jurisdiction in which it is collected or stored. GDPR (EU), PIPL (China), and other regulations restrict data residency — certain customer data cannot leave a specific geographic region.
"Our EU customers' personal data cannot leave the EU under GDPR — we operate a fully isolated EU region with dedicated databases. The multi-region architecture has a hard routing rule: EU-origin sessions always hit the Frankfurt cluster and data never replicates to US or APAC regions."
- Anycast /ˈeniˌkɑːst/
A network routing technique where the same IP address is advertised from multiple geographic locations — traffic is routed by BGP to the topologically nearest instance. Used by CDNs and DNS providers to route users to the closest point of presence without application-level logic.
"We use Anycast for our global API gateway — the same IP address is announced from 12 PoPs on all continents. When a user in Seoul makes a request, BGP routes it to the nearest PoP in Tokyo rather than across the Atlantic to our origin servers. This alone reduced average API latency from 280ms to 22ms for APAC users."
- Latency-Based Routing /ˈleɪtənsi beɪst ˈruːtɪŋ/
A DNS routing policy that directs users to the regional endpoint with the lowest measured network latency from their location. Used in Route 53, Cloudflare, and other global load balancers. Differs from geographic routing (which uses IP location) — latency-based routing measures actual network performance.
"We switched from geographic routing to latency-based routing in Route 53 and saw p50 latency drop by 18% globally — geographic proximity does not always correlate with lowest latency, especially for users near international cable landing points."
- Disaster Recovery /dɪˈzɑːstər rɪˈkʌvəri/
The set of policies, tools, and procedures for restoring IT systems and data after a catastrophic failure — whether hardware failure, data corruption, natural disaster, or cyberattack. DR strategy is defined by RPO and RTO requirements and the acceptable cost of the standby infrastructure.
"Our disaster recovery strategy uses a warm standby in a separate AWS region: the DR region runs all application tiers at 25% capacity with continuous database replication. In a full primary region loss, we scale up the DR region to 100% capacity within our 30-minute RTO, meeting our RPO of 5 minutes."
- Pilot Light /ˈpaɪlət laɪt/
A DR strategy where only the core data replication infrastructure is kept running in the standby region — databases replicate continuously but application servers are not provisioned. Cheaper than warm standby but slower to recover: application tier must be provisioned from scratch during a failover.
"We use the pilot light DR pattern for our analytics platform — the data warehouse replicates to the DR region continuously, but no compute is provisioned there. Recovery takes 45 minutes to spin up the application tier from AMIs, which is acceptable for our 2-hour RTO for non-critical services."
- Warm Standby /wɔːm ˈstændbaɪ/
A DR strategy where a scaled-down version of the full system runs continuously in the standby region — databases replicate in real time and application servers run at minimum capacity. Faster recovery than pilot light (just scale up, no provisioning), at higher ongoing cost.
"Our warm standby runs all services at 20% production capacity — just enough to handle monitoring, replication, and smoke tests. When we initiate failover, Auto Scaling groups expand to full capacity in 8 minutes. The trade-off is 20% of our production infrastructure cost running continuously in the DR region."
Quick Quiz — Multi-Region Deployment
Test yourself on these 12 terms. You'll answer 10 multiple-choice questions — each shows a term, you pick the correct definition.
What does this term mean?