Leader Election — System Design Pattern

Architecture Diagram

How It Works

In distributed systems, sometimes exactly one node must be responsible for a task — accepting writes, running a batch job, or coordinating others. Leader election algorithms ensure one node is chosen, and a new leader is elected if the current one fails.

Raft Consensus (Most Interview-Relevant)

Nodes exist in one of three states: Follower, Candidate, or Leader. The leader sends periodic heartbeats. If followers don't hear from the leader within an election timeout, they transition to candidate and start an election.

Normal operation: Leader sends heartbeats every ~150ms. Followers reset their election timer on each heartbeat.
Leader fails: Followers stop receiving heartbeats. After a randomized election timeout (300-500ms), the first follower to time out becomes a candidate.
Election: Candidate increments the term number, votes for itself, and sends RequestVote RPCs to other nodes.
Majority wins: If the candidate receives votes from a majority (e.g., 2/3 nodes), it becomes the new leader.
New leader: Starts sending heartbeats immediately. All writes go through the leader, which replicates to followers.

Approaches to Leader Election

Consensus-Based (Raft, Paxos, ZAB)

Formal, proven algorithms. Used internally by etcd, ZooKeeper, Consul. Guarantees safety (at most one leader per term) even during network partitions.

Coordination Service (ZooKeeper/etcd)

Create an ephemeral lock. The node holding the lock is leader. If it dies, the lock expires, and another node takes over. Simpler to use as a client than implementing Raft yourself.

Split Brain Prevention

The most dangerous failure mode is split brain: a network partition causes two nodes to think they're leader. Raft prevents this by requiring a majority quorum — in a 3-node cluster, you need 2/3 votes. During a partition, only the partition with the majority can elect a leader.

Key Design Decisions

🔢

3 nodes vs 5 nodes: 3-node cluster tolerates 1 failure (2/3 majority). 5-node tolerates 2 failures (3/5 majority) but every write needs 3 ACKs instead of 2 — higher latency. Never use even numbers: 4-node tolerates only 1 failure (same as 3) with higher write latency.

⏱️

Election timeout tuning: Too short = false elections during GC pauses or network blips. Too long = slow failover. The timeout must be randomized (300-500ms) to prevent two candidates from always splitting votes.

🔒

Fencing tokens: Even with leader election, a slow leader might not realize it's been replaced. Use monotonically increasing fencing tokens so downstream services can reject stale leader's writes.

🏗️

Embedded vs external coordination: Implement Raft in your service (complex, full control) or use etcd/ZooKeeper as a coordination service (simpler, extra dependency). Most teams should use existing coordination services.

When to Use

Leader election appears whenever you need a single-writer guarantee or exactly-once task execution in a distributed system.

"How do you ensure only one instance runs the scheduled job?" — Leader election via distributed lock
"What happens if the primary database node goes down?" — Raft-based failover, automatic leader promotion
"Design a distributed lock service" — ZooKeeper/etcd with ephemeral nodes
"How does Kubernetes manage cluster state?" — etcd uses Raft for leader election and log replication

Interview signal: The interviewer wants to see you understand how distributed systems achieve single-writer guarantees and handle leader failures without split brain.

Real-World Examples

etcd (Kubernetes) — Every K8s cluster runs etcd, which uses Raft for leader election and log replication. When you kubectl apply, the write goes through Raft consensus. If the etcd leader dies, Raft elects a new one in ~1 second.
CockroachDB — Each range (chunk of data) has its own Raft group with a leader. When a node fails, Raft automatically elects new leaders for affected ranges — self-healing without human intervention.
Apache Kafka — Uses ZooKeeper (or KRaft since 3.3+) for controller election. The controller manages partition leadership and broker membership across the cluster.
Redis Sentinel — Monitors Redis master, performs automatic failover using quorum-based leader election. Requires majority of sentinels to agree on failure before promoting a replica.

Back-of-Envelope Numbers

Metric	Value
Raft heartbeat interval	100–150 ms
Election timeout (randomized)	300–500 ms (10× heartbeat)
Typical election completion	~300–500 ms
Worst case (split votes)	~2 seconds
etcd write throughput	~10K writes/sec
etcd read throughput	~50K reads/sec
3-node fault tolerance	1 node failure
5-node fault tolerance	2 node failures

👑 Leader Election