High Availability Database Vocabulary: Replication, Failover, and RPO/RTO

High availability (HA) database architectures ensure that data remains accessible and correct even when components fail. Whether you’re designing HA systems, documenting them for auditors, or explaining trade-offs in architecture reviews, you need this vocabulary. This guide covers the 50 terms at the core of database reliability.

Availability Fundamentals

High Availability (HA)

High availability is the ability of a system to remain operational and accessible for a high percentage of time — typically 99.9% (three nines) or 99.99% (four nines) uptime.

Availability	Annual downtime
99% (two nines)	~3.6 days
99.9% (three nines)	~8.7 hours
99.99% (four nines)	~52 minutes
99.999% (five nines)	~5.3 minutes

Fault Tolerance

Fault tolerance is the ability to continue operating correctly in the presence of component failures — with no service interruption, only degraded performance at most.

“Our database cluster is fault tolerant — the loss of one node doesn’t cause any downtime; the cluster continues serving reads and writes.”

Resilience

Resilience is the ability to recover quickly from failures. A resilient system fails, detects, and recovers — minimising the impact window.

Recovery Objectives

RPO (Recovery Point Objective)

RPO is the maximum acceptable amount of data loss measured in time. If RPO = 1 hour, the system can tolerate losing at most the last 1 hour of transactions.

“Our RPO is 15 minutes — we can’t afford to lose more than 15 minutes of transaction data in a disaster.”

RPO determines: backup frequency, replication synchrony, transaction log shipping interval.

RTO (Recovery Time Objective)

RTO is the maximum acceptable time to restore service after a failure. If RTO = 4 hours, the system must be back online within 4 hours of a disaster.

“Our RTO is 1 hour — we need to be able to restore the database within an hour of a complete primary failure.”

RTO determines: failover automation, standby strategy, DR site readiness.

MTTR (Mean Time to Recovery)

MTTR is the average time it takes to restore service after a failure. A performance KPI for reliability teams.

“This quarter’s MTTR for database incidents is 34 minutes — we’re targeting under 20 minutes by Q3.”

MTBF (Mean Time Between Failures)

MTBF is the average time between successive failures. Higher MTBF = more reliable system.

RLO (Recovery Level Objective)

RLO is the minimum functional level of service acceptable during or after recovery — e.g., read-only mode acceptable while writes are restored.

Replication

Database Replication

Replication is the process of maintaining copies of data on multiple servers — for availability, scalability (read distribution), and disaster recovery.

Primary / Master

The primary (formerly “master”) is the server that accepts writes. All data modifications go to the primary first.

Replica / Standby / Secondary

A replica (or standby or secondary) is a copy of the primary that receives and applies changes. Replicas can serve reads (read replicas) or be kept on standby purely for failover (hot standby).

Synchronous Replication

In synchronous replication, the primary waits for acknowledgement from at least one replica before confirming a transaction commit.

Guarantees: zero data loss on failover (RPO = 0). Trade-off: write latency increases; if the replica is slow, the primary is slow.

“We use synchronous replication between AZs — we accept slightly higher write latency in exchange for zero-RPO failover.”

Asynchronous Replication

In asynchronous replication, the primary commits immediately without waiting for the replica. The replica catches up in the background.

Guarantees: lower write latency; potentially some data loss on failover (RPO > 0). Trade-off: replica may lag behind the primary.

“Cross-region replication is asynchronous — network latency makes synchronous replication impractical. Our cross-region RPO is ~30 seconds.”

Replication Lag

Replication lag is the delay between a write on the primary and its appearance on the replica. Monitored as a key metric. Excessive lag = risk of stale reads and higher RPO on failover.

“Replica lag spiked to 4 hours during the incident — reports using the read replica showed 4-hour-old data.”

Streaming Replication

Streaming replication continuously streams write-ahead log (WAL) records from primary to standby in near-real-time. Used by PostgreSQL for HA standbys.

Logical Replication

Logical replication replicates specific tables at the logical (row change) level — more flexible than physical/streaming replication. Enables replication across major versions or partial replication of selected tables.

Failover

Failover is the process of switching database traffic from the failed primary to a standby — making the standby the new primary.

Automatic Failover

Automatic failover detects primary failure and promotes a standby to primary without human intervention. Requires robust failure detection and consensus mechanisms.

“We have automatic failover configured — if the primary is unresponsive for 30 seconds, the standby is automatically promoted.”

Manual Failover

Manual failover requires a human to initiate the promotion of a standby to primary. Higher RTO than automatic failover.

Split Brain

Split brain is a dangerous condition where two nodes in a cluster both believe they are the primary — typically caused by a network partition. Can lead to data divergence and corruption.

Prevented by: Fencing, quorum/consensus mechanisms, STONITH.

Fencing (STONITH — Shoot The Other Node In The Head)

Fencing prevents split brain by isolating or forcibly shutting down the node that should no longer be primary before promoting the new primary. “STONITH” is the memorable acronym for this pattern.

Quorum

Quorum is the minimum number of nodes that must agree for a cluster to make a decision (promotion, write acceptance). A 3-node cluster with quorum of 2 can survive one node failure.

Consensus Protocol

Consensus protocols (Raft, Paxos) ensure all nodes agree on the same state. Used by distributed databases (CockroachDB, etcd, Patroni for PostgreSQL) to manage leader election and data consistency.

Topology Patterns

Hot Standby

A hot standby is a replica that is running, in sync, and ready to accept connections immediately after promotion. Minimum RTO.

Warm Standby

A warm standby is running and receiving replication but not accepting connections. Requires startup and promotion steps before serving traffic.

Cold Standby

A cold standby is a server that is not running — requires starting up, restoring a backup, and catching up before it can serve traffic. Highest RTO.

Multi-AZ (Multi-Availability Zone)

Multi-AZ deploys primary and standby in different availability zones (physically separate data centres within a region). Protects against AZ-level failures with automatic failover.

“Our RDS instance is Multi-AZ — if the AZ containing the primary goes down, AWS automatically failovers to the standby in a different AZ within 60-120 seconds.”

Read Replica

A read replica is a replica configured to serve SELECT queries — offloading read traffic from the primary.

“We have three read replicas serving our reporting queries — the primary only handles writes and critical operational reads.”

Active-Active

Active-active replication means multiple nodes accept writes simultaneously, with conflict resolution mechanisms. The highest availability and throughput but the most complex to implement correctly.

Active-Passive

Active-passive means one primary accepts writes; standbys are passive (receive replication but don’t accept writes). Simpler to reason about; failover requires promotion.

Backup and Recovery

Full Backup

A full backup is a complete copy of the database at a point in time.

Incremental Backup

An incremental backup captures only changes since the last backup — smaller and faster but requires the full backup + all incrementals to restore.

WAL (Write-Ahead Log) / Transaction Log / Redo Log

Database engines use a write-ahead log (WAL) — transaction changes are written to the log before being applied to data files. WAL enables PITR and replication.

PITR (Point-in-Time Recovery)

PITR is the ability to restore a database to any specific point in time within the backup retention window — using a base backup plus replaying the WAL.

“We had a bad DELETE query run at 14:30 — we used PITR to restore the database to 14:29, recovering all the deleted records.”

Useful Phrases

Discussing RPO/RTO in design:

“Given an RPO of zero and RTO of under 60 seconds, we need synchronous Multi-AZ replication with automatic failover.”

Explaining a failover:

“The primary failed at 09:43 UTC. Automatic failover promoted the standby at 09:44 — 61 seconds total. All connections were automatically rerouted.”

Explaining replication lag:

“Replication lag is currently 45 seconds — within our RPO target of 1 minute. We’ll monitor it through the bulk load.”

Practice

Deepen your DBA vocabulary with the Database Administration exercise set and the DBA learning path.

Availability Fundamentals

High Availability (HA)

Fault Tolerance

Resilience

Recovery Objectives

RPO (Recovery Point Objective)

RTO (Recovery Time Objective)

MTTR (Mean Time to Recovery)

MTBF (Mean Time Between Failures)

RLO (Recovery Level Objective)

Replication

Database Replication

Primary / Master

Replica / Standby / Secondary

Synchronous Replication

Asynchronous Replication

Replication Lag

Streaming Replication

Logical Replication

Failover

Failover

Automatic Failover

Manual Failover

Split Brain

Fencing (STONITH — Shoot The Other Node In The Head)

Quorum

Consensus Protocol

Topology Patterns

Hot Standby

Warm Standby

Cold Standby

Multi-AZ (Multi-Availability Zone)

Read Replica

Active-Active

Active-Passive

Backup and Recovery

Full Backup

Incremental Backup

WAL (Write-Ahead Log) / Transaction Log / Redo Log

PITR (Point-in-Time Recovery)

Useful Phrases

Practice

Related Articles