Leader election / consensus
When to reach for this
Reach for this when…
- Distributed locks
- Primary election in a replica set
- Workflow / job coordination ("only one runs at a time")
- Configuration change coordination
- Singleton services across a fleet
Not really this pattern when…
- Work is naturally partitioned (each worker owns disjoint keys)
- At-most-once semantics can be relaxed (most async work)
- Single-master DB handles it for you
Good vs bad answer
Interviewer probe
“One worker runs a nightly migration. Ensure only one runs.”
Weak answer
"Each worker checks a flag in the database; whoever sets it first wins."
Strong answer
"Flag-in-DB races under concurrent reads and can't handle leader failure without manual cleanup. Use etcd: each worker tries a transactional put to /migrations/nightly with a 30-second lease. The one who wins the CAS is the leader; others watch the key. Leader refreshes the lease every 10 s while working. If leader crashes or partitions, lease expires in ≤30 s; a watcher becomes the new leader. Every write from the worker includes the fencing token from its lease; the migration tracker rejects writes with stale tokens — so even a split-brain old leader that comes back can't corrupt. Migration itself is idempotent (keyed by step_id) so duplicate execution under worst-case is safe."
Why it wins: Real consensus system, lease with fencing, idempotent sink as defence in depth.
Cheat sheet
- •Use etcd / ZooKeeper / Consul. Don't roll your own.
- •Lease + heartbeat + watch. CAS to acquire.
- •Fencing tokens enforced at the sink.
- •Quorum: 3 or 5 nodes. Never 2 or 4.
- •TTL longer than worst-case GC pause.
- •Redlock for best-effort only. Not for correctness.
- •Idempotent writes as a safety net.
Core concept
Consensus gives you "exactly one node holds this role, and the rest agree". The algorithms are Paxos (theory) and Raft (practical, used by etcd, Consul, CockroachDB). Don't implement them. Use a consensus service.
Practical recipe with etcd / ZooKeeper:
- 1Each node calls
LeaseGrant(ttl=10s); gets a unique lease. - 2Each node attempts
Put(key="/service/leader", value=node_id, lease=<mine>)with a compare-and-swap that succeeds only if key is empty. - 3The winner is leader; others
Watchthe key for changes. - 4Leader
KeepAlives the lease (heartbeat). If it dies or partitions, the lease expires; key is deleted; watchers see the change; they race for the key again.
The fencing problem: network partition makes old leader think it's still leader; new leader elected; both try to write. Fix: fencing tokens — every lease has a monotonically increasing fencing token; downstream (storage) rejects writes with stale tokens. Without fencing, split-brain is inevitable.
Quorum sizing: 3 or 5 nodes for the consensus cluster. N nodes survive ⌊(N-1)/2⌋ failures. Don't run 2 or 4 (no majority on tie). Don't run > 7 (write latency suffers).
Client library > rolling your own. Use curator (ZooKeeper), etcd-client lease libs, or go-redsync/redlock (if you accept redlock's well-known trade-offs for Redis-based locking).
Canonical examples
- →Kafka controller election
- →MongoDB / Redis replica-set primary election
- →Distributed cron (one runner per job)
- →Kubernetes scheduler leader
- →Custom migration runner ("only one pod migrates")
Decision levers
Consensus service choice
etcd: default for Kubernetes ecosystem, great Go client. ZooKeeper: mature, JVM ecosystem (Kafka, Hadoop). Consul: KV + service discovery baked in. Avoid DIY on Redis for anything that must be correct.
Lease TTL
Too short: heartbeat misses under GC pause → leader thrashes. Too long: dead leader holds lease, blocking progress. 10–30 s typical; must be longer than the longest realistic GC / network pause.
Fencing token enforcement
Non-optional if correctness matters. Storage must reject writes with stale tokens. If storage can't fence, accept that split-brain risk exists and design for idempotent writes so it's recoverable.
Lock granularity
One lock per entity (per-job, per-shard) vs one global lock. Fine granularity = less contention, more locks to manage. Coarse = simpler, contention bottleneck.
Failure modes
Redlock has known correctness issues (Kleppmann's critique). For exactly-once requirements, use a real consensus system. Redlock is OK only for best-effort locks where double-execution is tolerable.
Split-brain: old leader + new leader both think they're leader. Both write. Data corrupted. Add fencing tokens; enforce at storage.
"Leader valid for 10 s from grant time" assumes clocks agree. They don't. Lease is refreshed by the consensus service's own clock; trust that, not wall time on the node.
etcd shares a host with the app; heavy app load starves etcd; leases miss; thrashing. Isolate the consensus cluster.
In a 3-node cluster, 2 down = no quorum = no writes. Monitor cluster health; have a playbook for disaster recovery of the consensus cluster itself.
Drills
Why is "set a flag in a DB row" wrong?Reveal
Three reasons. (1) Crash with flag set = permanent block until manual cleanup. (2) No way to detect a dead leader — the flag stays set forever. (3) No fencing — if the previous "leader" comes back, it still thinks it's leader. Leases solve all three: automatic expiry on death, automatic transfer to a watcher, fencing token to defeat split-brain.
Your etcd cluster is 3 nodes; one dies. Safe?Reveal
Yes — 2 of 3 is still a majority; writes continue. If a second dies, you lose write quorum and the cluster becomes read-only until a node returns. This is why production runs 5-node clusters for workloads that can't tolerate even a brief write outage from a double failure.