Pattern

Leader election / consensus

When to reach for this

Reach for this when…

Distributed locks
Primary election in a replica set
Workflow / job coordination ("only one runs at a time")
Configuration change coordination
Singleton services across a fleet

Not really this pattern when…

Work is naturally partitioned (each worker owns disjoint keys)
At-most-once semantics can be relaxed (most async work)
Single-master DB handles it for you

Good vs bad answer

Interviewer probe

“One worker runs a nightly migration. Ensure only one runs.”

Weak answer

"Each worker checks a flag in the database; whoever sets it first wins."

Strong answer

"Flag-in-DB races under concurrent reads and can't handle leader failure without manual cleanup. Use etcd: each worker tries a transactional put to /migrations/nightly with a 30-second lease. The one who wins the CAS is the leader; others watch the key. Leader refreshes the lease every 10 s while working. If leader crashes or partitions, lease expires in ≤30 s; a watcher becomes the new leader. Every write from the worker includes the fencing token from its lease; the migration tracker rejects writes with stale tokens — so even a split-brain old leader that comes back can't corrupt. Migration itself is idempotent (keyed by step_id) so duplicate execution under worst-case is safe."

Why it wins: Real consensus system, lease with fencing, idempotent sink as defence in depth.

Cheat sheet

•Use etcd / ZooKeeper / Consul. Don't roll your own.
•Lease + heartbeat + watch. CAS to acquire.
•Fencing tokens enforced at the sink.
•Quorum: 3 or 5 nodes. Never 2 or 4.
•TTL longer than worst-case GC pause.
•Redlock for best-effort only. Not for correctness.
•Idempotent writes as a safety net.

Core concept

Consensus gives you "exactly one node holds this role, and the rest agree". The algorithms are Paxos (theory) and Raft (practical, used by etcd, Consul, CockroachDB). Don't implement them. Use a consensus service.

Practical recipe with etcd / ZooKeeper:

1Each node calls LeaseGrant(ttl=10s); gets a unique lease.
2Each node attempts Put(key="/service/leader", value=node_id, lease=<mine>) with a compare-and-swap that succeeds only if key is empty.
3The winner is leader; others Watch the key for changes.
4Leader KeepAlives the lease (heartbeat). If it dies or partitions, the lease expires; key is deleted; watchers see the change; they race for the key again.

The fencing problem: network partition makes old leader think it's still leader; new leader elected; both try to write. Fix: fencing tokens — every lease has a monotonically increasing fencing token; downstream (storage) rejects writes with stale tokens. Without fencing, split-brain is inevitable.

Quorum sizing: 3 or 5 nodes for the consensus cluster. N nodes survive ⌊(N-1)/2⌋ failures. Don't run 2 or 4 (no majority on tie). Don't run > 7 (write latency suffers).

Client library > rolling your own. Use curator (ZooKeeper), etcd-client lease libs, or go-redsync/redlock (if you accept redlock's well-known trade-offs for Redis-based locking).

Canonical examples

→Kafka controller election
→MongoDB / Redis replica-set primary election
→Distributed cron (one runner per job)
→Kubernetes scheduler leader
→Custom migration runner ("only one pod migrates")

Decision levers

Consensus service choice

etcd: default for Kubernetes ecosystem, great Go client. ZooKeeper: mature, JVM ecosystem (Kafka, Hadoop). Consul: KV + service discovery baked in. Avoid DIY on Redis for anything that must be correct.

Lease TTL

Too short: heartbeat misses under GC pause → leader thrashes. Too long: dead leader holds lease, blocking progress. 10–30 s typical; must be longer than the longest realistic GC / network pause.

Fencing token enforcement

Non-optional if correctness matters. Storage must reject writes with stale tokens. If storage can't fence, accept that split-brain risk exists and design for idempotent writes so it's recoverable.

Lock granularity

One lock per entity (per-job, per-shard) vs one global lock. Fine granularity = less contention, more locks to manage. Coarse = simpler, contention bottleneck.

Failure modes

DIY leader election on RedisAdvanced

Redlock has known correctness issues (Kleppmann's critique). For exactly-once requirements, use a real consensus system. Redlock is OK only for best-effort locks where double-execution is tolerable.

No fencing token

Split-brain: old leader + new leader both think they're leader. Both write. Data corrupted. Add fencing tokens; enforce at storage.

Clock skew assumptionsAdvanced

"Leader valid for 10 s from grant time" assumes clocks agree. They don't. Lease is refreshed by the consensus service's own clock; trust that, not wall time on the node.

Consensus cluster co-located with workload

etcd shares a host with the app; heavy app load starves etcd; leases miss; thrashing. Isolate the consensus cluster.

Quorum lossAdvanced

In a 3-node cluster, 2 down = no quorum = no writes. Monitor cluster health; have a playbook for disaster recovery of the consensus cluster itself.

Drills

Why is "set a flag in a DB row" wrong?Reveal

Three reasons. (1) Crash with flag set = permanent block until manual cleanup. (2) No way to detect a dead leader — the flag stays set forever. (3) No fencing — if the previous "leader" comes back, it still thinks it's leader. Leases solve all three: automatic expiry on death, automatic transfer to a watcher, fencing token to defeat split-brain.

Your etcd cluster is 3 nodes; one dies. Safe?Reveal

Yes — 2 of 3 is still a majority; writes continue. If a second dies, you lose write quorum and the cluster becomes read-only until a node returns. This is why production runs 5-node clusters for workloads that can't tolerate even a brief write outage from a double failure.

11% complete

Current

When to reach for this

Step 1 of 9

Good vs bad answer

Jump to next

All patterns

Pattern

Leader election / consensus

When to reach for this

Reach for this when…

Distributed locks
Primary election in a replica set
Workflow / job coordination ("only one runs at a time")
Configuration change coordination
Singleton services across a fleet

Not really this pattern when…

Work is naturally partitioned (each worker owns disjoint keys)
At-most-once semantics can be relaxed (most async work)
Single-master DB handles it for you

Good vs bad answer

Interviewer probe

“One worker runs a nightly migration. Ensure only one runs.”

Weak answer

"Each worker checks a flag in the database; whoever sets it first wins."

Strong answer

Why it wins: Real consensus system, lease with fencing, idempotent sink as defence in depth.

Cheat sheet

•Use etcd / ZooKeeper / Consul. Don't roll your own.
•Lease + heartbeat + watch. CAS to acquire.
•Fencing tokens enforced at the sink.
•Quorum: 3 or 5 nodes. Never 2 or 4.
•TTL longer than worst-case GC pause.
•Redlock for best-effort only. Not for correctness.
•Idempotent writes as a safety net.

Core concept

Practical recipe with etcd / ZooKeeper:

1Each node calls LeaseGrant(ttl=10s); gets a unique lease.
2Each node attempts Put(key="/service/leader", value=node_id, lease=<mine>) with a compare-and-swap that succeeds only if key is empty.
3The winner is leader; others Watch the key for changes.
4Leader KeepAlives the lease (heartbeat). If it dies or partitions, the lease expires; key is deleted; watchers see the change; they race for the key again.

Quorum sizing: 3 or 5 nodes for the consensus cluster. N nodes survive ⌊(N-1)/2⌋ failures. Don't run 2 or 4 (no majority on tie). Don't run > 7 (write latency suffers).

Client library > rolling your own. Use curator (ZooKeeper), etcd-client lease libs, or go-redsync/redlock (if you accept redlock's well-known trade-offs for Redis-based locking).

Canonical examples

→Kafka controller election
→MongoDB / Redis replica-set primary election
→Distributed cron (one runner per job)
→Kubernetes scheduler leader
→Custom migration runner ("only one pod migrates")

Decision levers

Consensus service choice

Lease TTL

Too short: heartbeat misses under GC pause → leader thrashes. Too long: dead leader holds lease, blocking progress. 10–30 s typical; must be longer than the longest realistic GC / network pause.

Fencing token enforcement

Non-optional if correctness matters. Storage must reject writes with stale tokens. If storage can't fence, accept that split-brain risk exists and design for idempotent writes so it's recoverable.

Lock granularity

One lock per entity (per-job, per-shard) vs one global lock. Fine granularity = less contention, more locks to manage. Coarse = simpler, contention bottleneck.

Failure modes

DIY leader election on RedisAdvanced

Redlock has known correctness issues (Kleppmann's critique). For exactly-once requirements, use a real consensus system. Redlock is OK only for best-effort locks where double-execution is tolerable.

No fencing token

Split-brain: old leader + new leader both think they're leader. Both write. Data corrupted. Add fencing tokens; enforce at storage.

Clock skew assumptionsAdvanced

"Leader valid for 10 s from grant time" assumes clocks agree. They don't. Lease is refreshed by the consensus service's own clock; trust that, not wall time on the node.

Consensus cluster co-located with workload

etcd shares a host with the app; heavy app load starves etcd; leases miss; thrashing. Isolate the consensus cluster.

Quorum lossAdvanced

In a 3-node cluster, 2 down = no quorum = no writes. Monitor cluster health; have a playbook for disaster recovery of the consensus cluster itself.

Drills

Why is "set a flag in a DB row" wrong?Reveal

Your etcd cluster is 3 nodes; one dies. Safe?Reveal

11% complete

Current

When to reach for this

Step 1 of 9

Good vs bad answer

Jump to next