Database high-availability comparison

Synchronous vs asynchronous replication

Every replicated database has to answer one question on every single write: does the primary wait for a replica to confirm before telling the client "done"? The answer trades write latency against the risk of losing the last few writes during a failover — and most production incidents involving "we lost some data during failover" trace back to this one setting.

TL;DR

  • Synchronous replication waits for one or more replicas to confirm a write before acknowledging it to the client — guarantees no data loss on failover, at the cost of extra write latency.
  • Asynchronous replication acknowledges the write as soon as the primary commits it, replicating to followers in the background — fast writes, but a crash before replication completes loses those writes.
  • Semi-synchronous is the common middle ground: wait for at least one replica, replicate to the rest asynchronously.

Side-by-side comparison

AspectSynchronous replicationAsynchronous replication
Write acknowledged whenPrimary + configured replica(s) both confirmPrimary alone commits it
Write latencyHigher — bounded by slowest required replica + network RTTLower — no waiting on replicas
Data loss on primary crashNone, for the synchronous replica setPossible — unreplicated writes are lost
Replication lag~Zero for the synchronous replicaNon-zero, variable (ms to seconds)
Single point of failure riskYes, if the sync replica is unavailable, writes can stallNo — async replica outage doesn't block writes
Typical use caseFinancial systems, regulated data, zero-RPO requirementsRead replicas, cross-region DR, most general workloads
Failover data lossZero (promoted replica has every acknowledged write)Possible — must promote the least-stale replica
Relation to consensusSimilar spirit to Raft/Paxos quorum writes, but usually a fixed replica, not a majority voteNo agreement protocol — just "fire and eventually catch up"

Code / config side-by-side

PostgreSQL — synchronous replication

-- postgresql.conf on the primary
synchronous_standby_names = 'replica-a'
synchronous_commit = on

-- COMMIT now blocks until replica-a
-- confirms it has received AND flushed
-- the write-ahead log entry.
-- Slower writes, zero loss on failover
-- to replica-a.

PostgreSQL — asynchronous replication (default)

-- postgresql.conf on the primary
synchronous_standby_names = ''
-- (empty = async, the default)

-- COMMIT returns as soon as the primary
-- itself has flushed the write.
-- Replicas apply the WAL stream in the
-- background -- "streaming replication."
-- If the primary dies before a replica
-- catches up, those last writes are lost.

When to use synchronous replication

  • Zero data loss (RPO = 0) is a hard requirement. Financial ledgers, payment processors, and regulated healthcare data typically cannot tolerate losing even the last few committed transactions.
  • You have a reliable, low-latency network between primary and replica. Synchronous replication within the same data centre or availability zone keeps the added latency small (single-digit milliseconds).
  • You need a guaranteed-consistent failover target. When the promoted replica must have every write the old primary ever acknowledged, synchronous replication is the only way to guarantee it without a full consensus protocol.
  • Write volume is moderate enough to absorb the latency cost. Extremely high-throughput write workloads may find the per-write latency penalty unacceptable at scale.

When to use asynchronous replication

  • Write latency matters more than zero-RPO guarantees. Most consumer applications tolerate losing the last few milliseconds of writes during a rare crash far better than they tolerate slower writes on every request.
  • Replicas are geographically distant. Cross-region disaster-recovery replicas almost always use async replication — synchronous replication across continents would make every write intolerably slow.
  • You're scaling reads, not guaranteeing zero loss. Read replicas that exist purely to offload query traffic don't need to block every write on their own confirmation.
  • You can bound and monitor acceptable lag. Systems that alert on replication lag and switch to read-only mode above a threshold get most of the safety of sync replication with the performance of async.

English phrases engineers use

Synchronous replication conversations

  • "We run sync replication to one standby for zero RPO."
  • "Writes are blocking on replica acknowledgement — that's the latency trade-off."
  • "We use semi-synchronous so one slow replica can't stall every write."
  • "The promoted replica has every committed write — no data loss on failover."

Asynchronous replication conversations

  • "The replica is 30ms behind the primary right now."
  • "Replication lag spiked during the backup window."
  • "We promoted the least-stale replica after the primary failed."
  • "A lag threshold triggers read-only mode to protect consistency."

Quick decision tree

  • Zero tolerance for losing acknowledged writes (finance, compliance) → Synchronous replication
  • Cross-region disaster recovery → Asynchronous replication (sync is usually too slow across regions)
  • Read scaling only, writes still go to one primary → Asynchronous read replicas
  • Want most safety without full latency cost → Semi-synchronous replication
  • Extremely high write throughput, latency-sensitive → Asynchronous, with monitored lag thresholds
  • Need automatic failover with provable no-loss guarantee → Consider a consensus protocol (Raft) instead of simple sync replication

Frequently asked questions

What exactly is replication lag?

Replication lag is the delay between a write being committed on the primary and that same write being applied on a replica. If a replica is "30ms behind," writes acknowledged 30ms ago on the primary have not yet been applied there — a read from that replica can return data that is up to 30ms stale. Asynchronous replication always has some lag; synchronous replication is designed to make read-after-write lag effectively zero for whichever replicas are configured as synchronous.

Why would anyone choose asynchronous replication if it risks data loss?

Because the write latency cost of synchronous replication is real and constant, paid on every single write, while the data-loss risk of async replication only materialises during the (hopefully rare) event of an unclean primary failure. For most applications, shaving 5-20ms off every write is worth accepting a small, bounded risk window of losing the last few unreplicated writes during a genuine crash.

What is semi-synchronous replication?

A middle ground: the primary waits for acknowledgement from at least one replica (not all of them) before confirming the write to the client, then replicates to remaining replicas asynchronously. This bounds the worst-case data loss to "at most what wasn't yet sent to any replica" while avoiding the full latency cost of waiting for every replica to confirm.