Distributed Systems Consensus Vocabulary: Raft, Paxos, CAP, and Linearizability Explained
Master the vocabulary of distributed consensus for senior engineers: Raft leader election, quorum, linearizability vs serializability, CAP and PACELC theorems, split-brain, vector clocks, 2PC, and saga.
Distributed consensus is one of the most technically demanding areas of backend and platform engineering — and one of the most vocabulary-rich. When your team debates “do we need linearizability here, or is causal consistency enough?” or “can we tolerate split-brain if we use leader fencing?”, precision matters. The wrong word costs you hours in a design review.
This article covers the vocabulary you need to participate in distributed systems design discussions, pass senior backend and infrastructure interviews, and understand papers like the Raft and Paxos papers.
Why Consensus Is Hard — The Vocabulary of the Problem Space
Distributed Consensus The problem of getting multiple nodes in a network to agree on a single value, even in the presence of node failures and network delays. Solved by consensus protocols like Raft and Paxos.
“Distributed consensus is required whenever multiple replicas must agree on the order of writes — otherwise, different replicas can see different histories.”
Fault Tolerance
The ability of a distributed system to continue operating correctly when some nodes fail. A system that tolerates f failures requires at least 2f + 1 nodes for Raft-style consensus.
“Our Raft cluster has 5 nodes — it can tolerate 2 simultaneous node failures without losing the ability to elect a leader or commit writes.”
Quorum A majority of nodes that must agree for an operation to proceed. In a 5-node cluster, quorum = 3. Operations that don’t reach quorum are blocked until enough nodes are available.
“We lost quorum when the second node crashed — the cluster stopped accepting writes until we brought the third node back online.”
Network Partition A failure where nodes cannot communicate with each other — the cluster is divided into two or more isolated groups (partitions). Network partitions are the most challenging failure scenario in distributed systems.
“The network partition split our 5-node cluster into a 3-node and a 2-node group. The 3-node partition had quorum and kept operating. The 2-node partition correctly refused writes.”
Section 1: Raft — The Vocabulary of the Modern Consensus Algorithm
Raft is the consensus algorithm used by etcd (Kubernetes), CockroachDB, TiKV, and many other modern distributed systems. Understanding Raft vocabulary is essential for platform engineers.
Leader, Follower, Candidate The three states a Raft node can be in. The Leader handles all client requests and replicates entries to Followers. When a Leader fails, Followers become Candidates and start an election. The Candidate that receives votes from a quorum becomes the new Leader.
“The leader node crashed. Within 150ms, three followers had timed out, declared themselves candidates, and started requesting votes. The node with the most up-to-date log won the election.”
Election Term A monotonically increasing number that identifies a leadership period. Each election starts a new term. Terms are used to detect stale messages — a node that receives a message with a higher term knows it has missed an election and updates itself.
“The leader was in term 12. When it crashed and the new election completed, the new leader was in term 13. We can see this in the etcd logs.”
Log Replication The process by which the leader sends log entries (client commands) to all followers. An entry is considered committed only after the leader has replicated it to a quorum of followers.
“Log replication was slow — the leader was waiting for the slow follower in the other availability zone to acknowledge each entry before committing. We solved this by dropping the slow follower below quorum.”
Heartbeat A periodic message sent by the leader to all followers to prevent them from starting an election. If a follower doesn’t receive a heartbeat within the election timeout, it assumes the leader is dead and starts an election.
Section 2: Paxos and ZooKeeper
Paxos The original distributed consensus algorithm, described by Leslie Lamport. Known for being notoriously difficult to understand and implement correctly. Most real implementations use Multi-Paxos (an optimisation for replicated state machines) or switch to Raft.
“ZooKeeper uses Zab, a Paxos variant. Kafka relies on ZooKeeper for broker coordination — this is why Kafka clusters had ZooKeeper as a dependency for so long.”
Zab (ZooKeeper Atomic Broadcast) The consensus protocol used by Apache ZooKeeper. Designed for primary-backup replication in systems like Kafka broker coordination. Conceptually similar to Paxos.
Section 3: Consistency Models
Linearizability (Strong Consistency) The strongest consistency model. Every read returns the most recent write, and all operations appear to execute instantaneously at a single point in time. Linearizability is expensive — it requires coordination across replicas on every operation.
“etcd provides linearizable reads by default — every read goes through the leader to ensure it reflects all committed writes. This adds latency but guarantees correctness for Kubernetes controller decisions.”
Serializability A consistency model for transactions (multiple operations). Transactions appear to execute one at a time (as if serialised), even though they run concurrently. Different from linearizability, which applies to individual operations.
“The database provides serializable transactions — concurrent transactions are scheduled to produce results equivalent to some serial execution order. This prevents write skew and phantom reads.”
Causal Consistency Operations that are causally related appear in the same order to all nodes. Operations without a causal relationship may be seen in different orders by different nodes. Weaker than linearizability, but significantly cheaper.
“We use causal consistency for the comment reply thread — replies always appear after the comment they reference, but two unrelated comments may appear in different orders to different users.”
Eventual Consistency Given enough time without further updates, all replicas converge to the same value. The weakest useful consistency model — provides no guarantees about when convergence happens or what intermediate states look like.
“The shopping cart uses eventual consistency — if you add an item on your phone while off WiFi, it will sync when connectivity is restored. We handle conflicts with a last-write-wins policy.”
Section 4: CAP and PACELC Theorems
CAP Theorem A foundational theorem (Brewer, 2000): a distributed system can guarantee at most two of three properties simultaneously:
- Consistency: every read receives the most recent write
- Availability: every request receives a response (not an error)
- Partition Tolerance: the system continues to operate despite network partitions
In practice, network partitions are unavoidable — so the real choice is whether to favour consistency (CP) or availability (AP) during a partition.
“Cassandra is an AP system — during a partition, it prioritises availability and accepts writes on both sides of the partition. You get eventual consistency but no network partition can make it refuse writes.”
PACELC Theorem An extension of CAP by Daniel Abadi: even when there is no partition (P), a distributed system must trade off between Latency (L) and Consistency (C). This captures the real-world tradeoff that runs constantly, not only during failures.
“DynamoDB is PA/EL — partition-available, and the normal-operation tradeoff is low latency over strong consistency. That’s why DynamoDB recommends eventual consistency reads by default for latency-sensitive applications.”
CP vs. AP Systems
- CP systems (Consistent + Partition-tolerant): prefer to become unavailable rather than serve stale data. Examples: HBase, ZooKeeper, etcd, Spanner.
- AP systems (Available + Partition-tolerant): prefer to continue serving requests even with potentially stale data. Examples: Cassandra, DynamoDB (eventual), CouchDB.
Section 5: Split-Brain and Fencing
Split-Brain A scenario where a network partition causes two parts of the cluster to each believe they are the active leader/primary — both sides accept writes independently, leading to data divergence that may be impossible to reconcile automatically.
“The split-brain event caused two primary database replicas to accept conflicting writes for 4 minutes. Reconciling the diverged state required manual intervention and some data loss.”
Leader Fencing A mechanism to prevent a former leader from making decisions after it has been dethroned — ensuring there is only ever one authoritative leader at a time. Implemented using epoch tokens, lease-based locks, or STONITH (Shoot The Other Node In The Head).
“We use epoch-based fencing — each new leader has a higher epoch number. Old-epoch messages from the deposed leader are rejected by all nodes.”
STONITH (Shoot The Other Node In The Head) A fencing technique that forcibly shuts down a node believed to be misbehaving — by cutting power, revoking its network access, or sending a remote shutdown command. Brutal but reliable.
“Our cluster uses STONITH via IPMI — if a node fails to respond to the fencing agent, we cut power to that server via the out-of-band management interface.”
Section 6: Clocks in Distributed Systems
Lamport Timestamp A logical clock that provides a partial ordering of events in a distributed system. Each node increments its counter on every event; when receiving a message, it sets its counter to max(local, received) + 1. Does not capture causality fully.
Vector Clock An extension of Lamport timestamps that captures causal relationships. Each node maintains a vector of counters for all nodes. Can detect concurrent events (unrelated) and causally ordered events (one happened before the other). Used in distributed databases to detect conflicts.
“DynamoDB originally used vector clocks to detect conflicting writes. They later switched to a simpler last-write-wins approach for scalability.”
Hybrid Logical Clock (HLC) Combines physical time and logical time — stays close to wall-clock time (for human readability and SQL compatibility) while maintaining Lamport’s ordering properties. Used by CockroachDB and YugabyteDB for globally distributed SQL.
“CockroachDB uses HLCs so transaction timestamps are close to real time — this enables geographically distributed transactions to read data with millisecond-accurate time boundaries.”
Section 7: Distributed Transactions
Two-Phase Commit (2PC) A distributed transaction protocol where a coordinator asks all participants to vote (prepare phase), and if all vote yes, commits the transaction on all nodes (commit phase). Provides ACID guarantees across multiple nodes. Main weakness: if the coordinator crashes after prepare but before commit, participants are blocked indefinitely (the “blocking problem”).
“We use 2PC for cross-shard write transactions in our payments system — it ensures either all shards commit a payment or all roll it back. The coordinator failure risk is mitigated by a redundant coordinator with Raft-based leader election.”
Saga (as a distributed transaction alternative) A long-running transaction split into a sequence of local transactions, each with a compensating transaction that undoes its effect if a later step fails. Avoids the blocking problem of 2PC but does not provide full ACID isolation.
“We use Sagas for the order fulfilment flow across 5 services — each step commits locally and emits an event. If the shipping service fails, compensating transactions roll back the warehouse reservation and refund the payment.”
Discussion Language
| Context | Phrase |
|---|---|
| Recommending consistency level | ”For the leader election use case, we need linearizability — strong consistency is non-negotiable. For the analytics read replica, eventual consistency is fine.” |
| Explaining a failure scenario | ”If we have a 3-node consensus cluster and two nodes go down simultaneously, we lose quorum. The surviving node will refuse writes until quorum is restored.” |
| CAP trade-off | ”Given our SLA requires 99.99% availability, we should build an AP system and handle conflict resolution at the application layer. A CP system will reject writes during any partition.” |
| Explaining split-brain risk | ”Without a fencing mechanism, a slow leader could rejoin the cluster after a partition and attempt to write — resulting in split-brain. Leader fencing via epoch tokens prevents this.” |
Related Resources
- Software Architecture Vocabulary — system design vocabulary
- Distributed Systems Vocabulary Practice — vocabulary exercises
- Technical Interview System Design English — using this vocabulary in interviews