Advanced Distributed Systems #split-brain #fencing #leases #ZooKeeper

Split-Brain & Network Partitions

5 exercises — master network partition vocabulary: split-brain causes and prevention, fencing tokens, lease timeouts, minority partition behaviour, epoch numbers, and ZooKeeper distributed lock patterns.

0 / 5 completed
Split-brain & partition quick reference
  • Split-brain — two nodes simultaneously believe they are primary; caused by network partition.
  • Prevention: quorum (majority must vote) + fencing tokens + STONITH.
  • Fencing token — monotonically increasing epoch; storage rejects writes from lower-token leaders.
  • Lease — time-bounded authority grant; expires automatically; follower must wait full timeout before electing new leader.
  • Minority partition — cannot form quorum → enters read-only mode (CP choice).
  • Epoch / generation — higher epoch supersedes lower; old leader's writes rejected if epoch is stale.
  • ZooKeeper ephemeral node — auto-deleted on client crash; crash-safe lock primitive.
  • Watch-predecessor pattern — each lock waiter watches only its immediate predecessor; prevents herd effect.
1 / 5

A site reliability engineer describes an incident: "We had a split-brain scenario in our database cluster. Both nodes thought they were the primary and accepted writes. The result was diverged data that took 4 hours to reconcile."

What causes split-brain and how is it prevented?