Database high-availability comparison

Synchronous vs asynchronous replication

Every replicated database has to answer one question on every single write: does the primary wait for a replica to confirm before telling the client "done"? The answer trades write latency against the risk of losing the last few writes during a failover — and most production incidents involving "we lost some data during failover" trace back to this one setting.

TL;DR

Synchronous replication waits for one or more replicas to confirm a write before acknowledging it to the client — guarantees no data loss on failover, at the cost of extra write latency.
Asynchronous replication acknowledges the write as soon as the primary commits it, replicating to followers in the background — fast writes, but a crash before replication completes loses those writes.
Semi-synchronous is the common middle ground: wait for at least one replica, replicate to the rest asynchronously.

Side-by-side comparison

Aspect	Synchronous replication	Asynchronous replication
Write acknowledged when	Primary + configured replica(s) both confirm	Primary alone commits it
Write latency	Higher — bounded by slowest required replica + network RTT	Lower — no waiting on replicas
Data loss on primary crash	None, for the synchronous replica set	Possible — unreplicated writes are lost
Replication lag	~Zero for the synchronous replica	Non-zero, variable (ms to seconds)
Single point of failure risk	Yes, if the sync replica is unavailable, writes can stall	No — async replica outage doesn't block writes
Typical use case	Financial systems, regulated data, zero-RPO requirements	Read replicas, cross-region DR, most general workloads
Failover data loss	Zero (promoted replica has every acknowledged write)	Possible — must promote the least-stale replica
Relation to consensus	Similar spirit to Raft/Paxos quorum writes, but usually a fixed replica, not a majority vote	No agreement protocol — just "fire and eventually catch up"

Code / config side-by-side

PostgreSQL — synchronous replication

-- postgresql.conf on the primary
synchronous_standby_names = 'replica-a'
synchronous_commit = on

-- COMMIT now blocks until replica-a
-- confirms it has received AND flushed
-- the write-ahead log entry.
-- Slower writes, zero loss on failover
-- to replica-a.

PostgreSQL — asynchronous replication (default)

-- postgresql.conf on the primary
synchronous_standby_names = ''
-- (empty = async, the default)

-- COMMIT returns as soon as the primary
-- itself has flushed the write.
-- Replicas apply the WAL stream in the
-- background -- "streaming replication."
-- If the primary dies before a replica
-- catches up, those last writes are lost.

When to use synchronous replication

Zero data loss (RPO = 0) is a hard requirement. Financial ledgers, payment processors, and regulated healthcare data typically cannot tolerate losing even the last few committed transactions.
You have a reliable, low-latency network between primary and replica. Synchronous replication within the same data centre or availability zone keeps the added latency small (single-digit milliseconds).
You need a guaranteed-consistent failover target. When the promoted replica must have every write the old primary ever acknowledged, synchronous replication is the only way to guarantee it without a full consensus protocol.
Write volume is moderate enough to absorb the latency cost. Extremely high-throughput write workloads may find the per-write latency penalty unacceptable at scale.

When to use asynchronous replication

Write latency matters more than zero-RPO guarantees. Most consumer applications tolerate losing the last few milliseconds of writes during a rare crash far better than they tolerate slower writes on every request.
Replicas are geographically distant. Cross-region disaster-recovery replicas almost always use async replication — synchronous replication across continents would make every write intolerably slow.
You're scaling reads, not guaranteeing zero loss. Read replicas that exist purely to offload query traffic don't need to block every write on their own confirmation.
You can bound and monitor acceptable lag. Systems that alert on replication lag and switch to read-only mode above a threshold get most of the safety of sync replication with the performance of async.

English phrases engineers use

Synchronous replication conversations

"We run sync replication to one standby for zero RPO."
"Writes are blocking on replica acknowledgement — that's the latency trade-off."
"We use semi-synchronous so one slow replica can't stall every write."
"The promoted replica has every committed write — no data loss on failover."

Asynchronous replication conversations

"The replica is 30ms behind the primary right now."
"Replication lag spiked during the backup window."
"We promoted the least-stale replica after the primary failed."
"A lag threshold triggers read-only mode to protect consistency."

Quick decision tree

Zero tolerance for losing acknowledged writes (finance, compliance) → Synchronous replication
Cross-region disaster recovery → Asynchronous replication (sync is usually too slow across regions)
Read scaling only, writes still go to one primary → Asynchronous read replicas
Want most safety without full latency cost → Semi-synchronous replication
Extremely high write throughput, latency-sensitive → Asynchronous, with monitored lag thresholds
Need automatic failover with provable no-loss guarantee → Consider a consensus protocol (Raft) instead of simple sync replication

Frequently asked questions

What exactly is replication lag?

Replication lag is the delay between a write being committed on the primary and that same write being applied on a replica. If a replica is "30ms behind," writes acknowledged 30ms ago on the primary have not yet been applied there — a read from that replica can return data that is up to 30ms stale. Asynchronous replication always has some lag; synchronous replication is designed to make read-after-write lag effectively zero for whichever replicas are configured as synchronous.

Why would anyone choose asynchronous replication if it risks data loss?

Because the write latency cost of synchronous replication is real and constant, paid on every single write, while the data-loss risk of async replication only materialises during the (hopefully rare) event of an unclean primary failure. For most applications, shaving 5-20ms off every write is worth accepting a small, bounded risk window of losing the last few unreplicated writes during a genuine crash.

What is semi-synchronous replication?

A middle ground: the primary waits for acknowledgement from at least one replica (not all of them) before confirming the write to the client, then replicates to remaining replicas asynchronously. This bounds the worst-case data loss to "at most what wasn't yet sent to any replica" while avoiding the full latency cost of waiting for every replica to confirm.

How does synchronous replication relate to consensus protocols like Raft?

They solve overlapping problems with different guarantees. Raft-style consensus requires acknowledgement from a majority of nodes before committing an entry, which is a form of synchronous replication with a specific quorum rule — it survives the failure of a minority of nodes without any data loss or need for a human to choose which replica becomes primary. Traditional "synchronous replication" in a primary-replica database setup often means waiting for one designated replica, which is simpler but creates a single point of failure if that replica goes down (writes can stall until it recovers or is removed from the config).

What does "promote a replica" mean, and how does replication mode affect it?

Promoting a replica means reconfiguring it to accept writes and become the new primary after a failover. With asynchronous replication, promoting the most-caught-up replica can still mean losing whatever writes were acknowledged by the old primary but never made it to that replica. With synchronous replication, any replica that was part of the synchronous set is guaranteed to have every acknowledged write, so promotion loses nothing.

Does synchronous replication protect against corruption, not just loss?

No — it only protects against loss of already-acknowledged writes if the primary dies. It does nothing to protect against application-level bugs, bad migrations, or malicious deletes, which get replicated (synchronously or not) to every replica just as faithfully as good data. Point-in-time backups and audit logs are the tools for that class of problem, not replication mode.

Why does replication lag spike during a backup window?

Taking a backup consumes I/O and CPU resources on the primary or replica that would otherwise be used to apply incoming write-ahead-log entries, so the replica falls further behind during the backup and catches up afterward. This is a common operational gotcha with asynchronous replicas used both for read scaling and backups.

Show more questions (4)