High Availability Database Vocabulary: Replication, Failover, and RPO/RTO
Essential HA database vocabulary for DBAs and engineers: replication types, failover concepts, RPO, RTO, MTTR, clustering, consensus, and more — explained in plain English.
High availability (HA) database architectures ensure that data remains accessible and correct even when components fail. Whether you’re designing HA systems, documenting them for auditors, or explaining trade-offs in architecture reviews, you need this vocabulary. This guide covers the 50 terms at the core of database reliability.
Availability Fundamentals
High Availability (HA)
High availability is the ability of a system to remain operational and accessible for a high percentage of time — typically 99.9% (three nines) or 99.99% (four nines) uptime.
| Availability | Annual downtime |
|---|---|
| 99% (two nines) | ~3.6 days |
| 99.9% (three nines) | ~8.7 hours |
| 99.99% (four nines) | ~52 minutes |
| 99.999% (five nines) | ~5.3 minutes |
Fault Tolerance
Fault tolerance is the ability to continue operating correctly in the presence of component failures — with no service interruption, only degraded performance at most.
“Our database cluster is fault tolerant — the loss of one node doesn’t cause any downtime; the cluster continues serving reads and writes.”
Resilience
Resilience is the ability to recover quickly from failures. A resilient system fails, detects, and recovers — minimising the impact window.
Recovery Objectives
RPO (Recovery Point Objective)
RPO is the maximum acceptable amount of data loss measured in time. If RPO = 1 hour, the system can tolerate losing at most the last 1 hour of transactions.
“Our RPO is 15 minutes — we can’t afford to lose more than 15 minutes of transaction data in a disaster.”
RPO determines: backup frequency, replication synchrony, transaction log shipping interval.
RTO (Recovery Time Objective)
RTO is the maximum acceptable time to restore service after a failure. If RTO = 4 hours, the system must be back online within 4 hours of a disaster.
“Our RTO is 1 hour — we need to be able to restore the database within an hour of a complete primary failure.”
RTO determines: failover automation, standby strategy, DR site readiness.
MTTR (Mean Time to Recovery)
MTTR is the average time it takes to restore service after a failure. A performance KPI for reliability teams.
“This quarter’s MTTR for database incidents is 34 minutes — we’re targeting under 20 minutes by Q3.”
MTBF (Mean Time Between Failures)
MTBF is the average time between successive failures. Higher MTBF = more reliable system.
RLO (Recovery Level Objective)
RLO is the minimum functional level of service acceptable during or after recovery — e.g., read-only mode acceptable while writes are restored.
Replication
Database Replication
Replication is the process of maintaining copies of data on multiple servers — for availability, scalability (read distribution), and disaster recovery.
Primary / Master
The primary (formerly “master”) is the server that accepts writes. All data modifications go to the primary first.
Replica / Standby / Secondary
A replica (or standby or secondary) is a copy of the primary that receives and applies changes. Replicas can serve reads (read replicas) or be kept on standby purely for failover (hot standby).
Synchronous Replication
In synchronous replication, the primary waits for acknowledgement from at least one replica before confirming a transaction commit.
Guarantees: zero data loss on failover (RPO = 0). Trade-off: write latency increases; if the replica is slow, the primary is slow.
“We use synchronous replication between AZs — we accept slightly higher write latency in exchange for zero-RPO failover.”
Asynchronous Replication
In asynchronous replication, the primary commits immediately without waiting for the replica. The replica catches up in the background.
Guarantees: lower write latency; potentially some data loss on failover (RPO > 0). Trade-off: replica may lag behind the primary.
“Cross-region replication is asynchronous — network latency makes synchronous replication impractical. Our cross-region RPO is ~30 seconds.”
Replication Lag
Replication lag is the delay between a write on the primary and its appearance on the replica. Monitored as a key metric. Excessive lag = risk of stale reads and higher RPO on failover.
“Replica lag spiked to 4 hours during the incident — reports using the read replica showed 4-hour-old data.”
Streaming Replication
Streaming replication continuously streams write-ahead log (WAL) records from primary to standby in near-real-time. Used by PostgreSQL for HA standbys.
Logical Replication
Logical replication replicates specific tables at the logical (row change) level — more flexible than physical/streaming replication. Enables replication across major versions or partial replication of selected tables.
Failover
Failover
Failover is the process of switching database traffic from the failed primary to a standby — making the standby the new primary.
Automatic Failover
Automatic failover detects primary failure and promotes a standby to primary without human intervention. Requires robust failure detection and consensus mechanisms.
“We have automatic failover configured — if the primary is unresponsive for 30 seconds, the standby is automatically promoted.”
Manual Failover
Manual failover requires a human to initiate the promotion of a standby to primary. Higher RTO than automatic failover.
Split Brain
Split brain is a dangerous condition where two nodes in a cluster both believe they are the primary — typically caused by a network partition. Can lead to data divergence and corruption.
Prevented by: Fencing, quorum/consensus mechanisms, STONITH.
Fencing (STONITH — Shoot The Other Node In The Head)
Fencing prevents split brain by isolating or forcibly shutting down the node that should no longer be primary before promoting the new primary. “STONITH” is the memorable acronym for this pattern.
Quorum
Quorum is the minimum number of nodes that must agree for a cluster to make a decision (promotion, write acceptance). A 3-node cluster with quorum of 2 can survive one node failure.
Consensus Protocol
Consensus protocols (Raft, Paxos) ensure all nodes agree on the same state. Used by distributed databases (CockroachDB, etcd, Patroni for PostgreSQL) to manage leader election and data consistency.
Topology Patterns
Hot Standby
A hot standby is a replica that is running, in sync, and ready to accept connections immediately after promotion. Minimum RTO.
Warm Standby
A warm standby is running and receiving replication but not accepting connections. Requires startup and promotion steps before serving traffic.
Cold Standby
A cold standby is a server that is not running — requires starting up, restoring a backup, and catching up before it can serve traffic. Highest RTO.
Multi-AZ (Multi-Availability Zone)
Multi-AZ deploys primary and standby in different availability zones (physically separate data centres within a region). Protects against AZ-level failures with automatic failover.
“Our RDS instance is Multi-AZ — if the AZ containing the primary goes down, AWS automatically failovers to the standby in a different AZ within 60-120 seconds.”
Read Replica
A read replica is a replica configured to serve SELECT queries — offloading read traffic from the primary.
“We have three read replicas serving our reporting queries — the primary only handles writes and critical operational reads.”
Active-Active
Active-active replication means multiple nodes accept writes simultaneously, with conflict resolution mechanisms. The highest availability and throughput but the most complex to implement correctly.
Active-Passive
Active-passive means one primary accepts writes; standbys are passive (receive replication but don’t accept writes). Simpler to reason about; failover requires promotion.
Backup and Recovery
Full Backup
A full backup is a complete copy of the database at a point in time.
Incremental Backup
An incremental backup captures only changes since the last backup — smaller and faster but requires the full backup + all incrementals to restore.
WAL (Write-Ahead Log) / Transaction Log / Redo Log
Database engines use a write-ahead log (WAL) — transaction changes are written to the log before being applied to data files. WAL enables PITR and replication.
PITR (Point-in-Time Recovery)
PITR is the ability to restore a database to any specific point in time within the backup retention window — using a base backup plus replaying the WAL.
“We had a bad DELETE query run at 14:30 — we used PITR to restore the database to 14:29, recovering all the deleted records.”
Useful Phrases
Discussing RPO/RTO in design:
- “Given an RPO of zero and RTO of under 60 seconds, we need synchronous Multi-AZ replication with automatic failover.”
Explaining a failover:
- “The primary failed at 09:43 UTC. Automatic failover promoted the standby at 09:44 — 61 seconds total. All connections were automatically rerouted.”
Explaining replication lag:
- “Replication lag is currently 45 seconds — within our RPO target of 1 minute. We’ll monitor it through the bulk load.”
Practice
Deepen your DBA vocabulary with the Database Administration exercise set and the DBA learning path.