Consistent hashing
Hash ring, virtual nodes, minimal key reshuffling on membership change.
Simple hash(key) % N breaks the moment N changes — nearly every key remaps, every cache goes cold, and every shard rebalances. Consistent hashing moves only 1/N of keys. It's the algorithm behind every production cache cluster and most distributed databases.
Read this if your last attempt…
- You said "we'll hash and mod by the number of shards" without considering what happens when shards change
- You can't explain virtual nodes or why they matter
- You're designing a distributed cache, database, or anything with elastic membership
- You've heard of consistent hashing but couldn't explain it at a whiteboard
The concept
The modulo problem
Suppose you have 3 database shards and use shard = hash(key) % 3 to decide where each key lives. This works perfectly — until you add a 4th shard. Now almost every key maps to a different shard. For a cache with 1M keys, adding one node invalidates roughly 75% of them. The entire system effectively goes cold. The same catastrophe happens when a node dies.
When to use consistent hashing vs alternatives.
| Scenario | Best scheme | Why |
|---|---|---|
| Distributed cache (Redis, Memcached) | Consistent hashing or fixed slots | Nodes come and go; minimise cache invalidation |
| Distributed database (Cassandra) | Consistent hashing | Elastic membership with minimal data migration |
| Time-series data (partitioned by hour) | Range partitioning | Range scans are the primary access pattern |
| Stable cluster, never rescales | Modulo hash | Simplest; data movement is not a concern |
| Multi-tenant SaaS (tenant per shard) | Directory partitioning | Tenants vary wildly in size; need per-tenant control |
How interviewers grade this
- You explain why modulo hashing breaks on membership change and how consistent hashing fixes it.
- You describe the hash ring and clockwise assignment in one clear sentence.
- You know virtual nodes and can say why they're needed — balance and failure distribution.
- You distinguish hot keys (workload imbalance) from uneven distribution (structural imbalance).
- You name real systems that use it: Cassandra, DynamoDB, Memcached ketama, Redis Cluster's fixed-slot alternative.
Variants
The modulo problem
Why hash(key) % N fails when N changes.
With modulo, adding node 4 remaps ~75% of keys. With consistent hashing, only ~25% of keys move — the ones between the new node and its predecessor.
The most straightforward approach to distribute data: hash the key, mod by the number of servers. With 3 servers:
Event 1234 → hash(1234) % 3 = 1 → Server 1
Event 5678 → hash(5678) % 3 = 0 → Server 0
Event 9012 → hash(9012) % 3 = 2 → Server 2This works until you add a 4th server. Now it's hash(key) % 4 — and most keys map to a different server. For 1M cache keys, roughly 750K suddenly miss. The database gets hammered by a thundering herd of cache misses all at once. The same thing happens when a server dies: hash(key) % 2 redistributes almost everything.
The core issue: changing N invalidates nearly all mappings, not just the ones affected by the membership change. Any system where nodes are added, removed, or fail needs a distribution scheme where membership changes only move a bounded fraction of keys.
Choose this variant when
- Never, unless your cluster size is permanently fixed
The hash ring
Arrange nodes and keys on a circle; walk clockwise to find the owner.
Imagine the hash function output space (0 to 2³² - 1) arranged as a circle. Hash each node's identifier to get its position on the ring. Hash each key to get its position. The key belongs to the first node encountered walking clockwise from the key's position.
When you add a new node at position P, only the keys between P and the previous node move to the new node — everything else stays put. When a node dies, only its keys move to the next clockwise node.
For a ring with N nodes, adding or removing one moves approximately 1/N of all keys. With 10 nodes, that's 10% — compared to ~90% with modulo. The improvement is massive: for a 10M-key cache, modulo would cause 9M cache misses; consistent hashing causes 1M.
The limitation of a basic ring: with only a few physical nodes, the ring is divided unevenly. One node might own 60% of the keyspace while another owns 10%. And when a node dies, its entire range goes to a single neighbour — potentially doubling that neighbour's load. Virtual nodes fix this.
Choose this variant when
- Understanding the concept — but production systems need virtual nodes
Virtual nodes (vnodes)
Each physical node gets K positions on the ring for even distribution and graceful failure.
Instead of placing each physical node at one position, place it at K positions by hashing variations: "NodeA-vn0", "NodeA-vn1", through "NodeA-vn255". Each hash produces a different ring position. Now instead of one large range per physical node, you have many small ranges scattered across the ring.
Three benefits:
- 1Even load distribution — the law of large numbers. With 256 virtual nodes per physical node, each physical node owns close to 1/N of the keyspace regardless of hash distribution.
- 1Failure load spreading — when a physical node dies, its 256 virtual nodes are spread around the ring. The keys from each virtual node redistribute to a different surviving node. Instead of one neighbour absorbing 100% of the dead node's keys, the load spreads roughly evenly across all survivors.
- 1Smooth scaling — adding a node places 256 new virtual nodes around the ring, each absorbing a small slice from a different existing node. Load shifts gradually, not in one big chunk.
The tradeoff: more ring metadata and slightly slower lookups (binary search over a larger sorted array). With 10 nodes × 256 vnodes = 2,560 entries — trivially small. Cassandra defaults to 256 vnodes per node. The typical sweet spot is 128-256.
Pros
- +Near-perfect load balance across nodes
- +Failure impact distributed to all survivors, not just one neighbour
- +Smooth capacity absorption when scaling
Cons
- −More metadata per node (trivial in practice)
- −Rebalancing during bootstrap involves more coordination
- −Slightly slower ring lookup (binary search over larger array)
Choose this variant when
- Any production deployment with 3+ nodes
- Any system where nodes can fail or be added
- This is not optional in production — it's the default
Fixed hash slots (Redis Cluster approach)
Divide keyspace into fixed slots (16,384 in Redis); assign slot ranges to nodes explicitly.
Redis Cluster takes a different approach: CRC16(key) % 16384 maps each key to one of 16,384 fixed slots. Slots are assigned to nodes in contiguous ranges. Moving data means migrating specific slots with CLUSTER MIGRATE — explicit, controlled, slot-by-slot.
Why Redis chose this over a ring: simpler to implement, explicit control over which slots live where, and easier to reason about during operations. You can see exactly which node owns which slots, and migration is deterministic.
The tradeoff: less automatic balancing (you manually assign slots), and the fixed 16K slot count limits granularity. With 100 nodes, each gets ~164 slots — fine. With 1000 nodes, you'd want more slots.
Both approaches — consistent hashing with virtual nodes and fixed hash slots — achieve the same goal: controlled data movement when membership changes. The ring is more elegant and automatic; fixed slots are more explicit and operational. Martin Kleppmann notes in DDIA that the term "consistent hashing" is used loosely — some systems that claim it actually use variations closer to fixed-slot schemes.
Pros
- +Simple mental model — you can list which node owns which slots
- +Explicit migration control
- +Easy to implement correctly
Cons
- −Manual rebalancing — you assign slots yourself
- −Fixed slot count limits granularity
- −Less automatic than a ring with virtual nodes
Choose this variant when
- Redis Cluster deployments
- When you want operational visibility into data placement
- Smaller cluster sizes where manual control is practical
Hot key mitigation
Consistent hashing distributes keys evenly but can't fix one key getting 100x traffic.
A celebrity's profile might be read 500K times/sec. Consistent hashing maps it to one node — perfectly balanced structurally, but that node is on fire.
Three production approaches:
- 1Read replicas for hot keys — replicate the hot key to all cache nodes. Client-side round-robin reads across replicas. Simple, effective for read-heavy hot keys.
- 1Key-space salting — store as "celebrity:123:0" through "celebrity:123:9". Writes go to a random suffix; reads aggregate from all 10. Trades read complexity for write distribution. Good for hot counters.
- 1Local in-process caching (L1 cache) — each app server caches the hottest keys in memory with a short TTL (5-10 seconds). Often the simplest fix. The CDN is the global equivalent.
- 1Adaptive rebalancing — monitor per-node traffic in real time and move specific key ranges. DynamoDB does this automatically. Operationally complex for DIY systems.
The interview distinction: virtual nodes prevent structural imbalance (uneven key distribution). Replication and salting prevent workload imbalance (uneven traffic per key). Know which problem you're solving.
Choose this variant when
- One key gets disproportionate traffic (celebrity, viral content, popular product)
- Your monitoring shows one cache shard at 90% CPU while others are at 20%
Worked example
Scenario: designing a distributed cache for a social media app. 3 cache nodes, 10M keys, expect to scale to 10 nodes.
Why not modulo: hash(key) % 3. Adding a 4th node changes N — 75% of 10M keys remap. That's 7.5M cache misses hitting the database simultaneously. The DB will likely collapse under the thundering herd.
Consistent hashing setup:
- Hash ring: 0 to 2^32 - 1.
- 3 physical nodes, 256 virtual nodes each = 768 ring positions.
- Lookup: hash(key), binary search the sorted vnode array, return owning physical node.
Adding node 4:
- Node 4 gets 256 new vnode positions spread across the ring.
- Each new vnode takes over a small range from one existing node.
- Total keys moved: ~1/4 of 10M = ~2.5M. These are cache misses that refill from DB — manageable with a warm-up phase.
- The other 7.5M keys stay on their existing nodes — no miss.
Node 2 dies:
- Node 2's 256 vnodes are spread around the ring.
- Their key ranges get absorbed by many different surviving nodes, not just one.
- Load increase per survivor: ~50% more keys each (1/3 to 1/2), spread evenly.
- Misses: only the ~3.3M keys from node 2. The other 6.7M are unaffected.
Hot key (Taylor Swift's profile, 100K reads/sec):
- Lives on one node via consistent hashing. Hash can't help — this is workload imbalance.
- Fix: replicate to all 3 cache nodes, client round-robins reads. Or add L1 in-process cache with 5s TTL in each app server.
Good vs bad answer
Interviewer probe
“How do you distribute keys across your cache cluster?”
Weak answer
"hash(key) mod number_of_nodes — simple and even distribution."
Strong answer
"Consistent hashing with virtual nodes. Each of our 3 cache nodes gets 256 virtual positions on a hash ring. Keys hash onto the ring and belong to the next vnode clockwise. When we add a 4th node, only about 25% of keys remap — the rest stay put. Without this, modulo would invalidate 75% and the DB would die under the thundering herd. Virtual nodes give us even distribution: without them, 3 physical nodes could split the ring 60/20/20. With 256 vnodes each, it converges to ~33% per node. When a node dies, its load spreads across all survivors, not just one neighbour. For hot keys like a celebrity profile at 100K reads/sec, I'd replicate the key to all cache nodes and round-robin reads."
Why it wins: Explains the modulo problem, the ring, virtual nodes, failure behaviour, and hot key mitigation — each with the concrete reasoning.
When it comes up
- When you sketch a distributed cache (Redis, Memcached) or elastic data tier
- When the interviewer asks "what happens when you add or lose a node?"
- When discussing Cassandra, DynamoDB, or any partitioned database placement
- When a hot-key or thundering-herd scenario surfaces in deep-dive
- Whenever you write `hash(key) % N` on the board and the interviewer pauses
Order of reveal
- 1Name the modulo failure mode first. "Naive hash mod N is fine until N changes — then roughly (N-1)/N of keys remap. For a 10M-key cache adding one node, that's 7.5M cold misses hammering the DB at once."
- 2Introduce the ring in one sentence. "Consistent hashing puts both nodes and keys on a circular hash space; each key is owned by the next node clockwise. Adding or losing a node moves only ~1/N of keys."
- 3Bring in virtual nodes unprompted. "In production I'd use 128-256 virtual nodes per physical node. Without them three nodes can split the ring 60/20/20, and a dead node dumps all its load on one neighbour."
- 4Explain failure behaviour. "When a node dies, its 256 vnodes are scattered around the ring, so the load spreads across all survivors — no single neighbour doubles."
- 5Separate structural from workload imbalance. "Consistent hashing fixes uneven distribution. It does NOT fix one key getting 100x traffic — that's workload imbalance, which needs replication, key salting, or L1 caching."
- 6Name real systems for credibility. "Cassandra and DynamoDB use the ring with vnodes; Redis Cluster uses 16,384 fixed slots — same goal, controlled data movement on membership change."
Signature phrases
- “Modulo moves almost everything; the ring moves 1/N” — Crispest one-line justification for consistent hashing.
- “Virtual nodes, 128 to 256, non-negotiable” — Signals production experience, not textbook understanding.
- “Structural imbalance vs workload imbalance” — Distinction separates strong candidates from memorisers.
- “The ring decides where; replication decides how many” — Prevents conflating placement with durability.
- “Hot keys need replication, not better hashing” — Shows you know the limits of the tool.
Likely follow-ups
?“How exactly is the ring implemented in memory? What's the lookup cost?”Reveal
A sorted array (or red-black tree) of (hashPosition, physicalNodeId) tuples — one entry per virtual node. For a 10-node cluster with 256 vnodes each, that's 2,560 entries, a few KB total.
Lookup:
- 1Compute
pos = hash(key) - 2Binary search the sorted array for the first entry with
hashPosition >= pos(wrap around if past the end). - 3Return the owning physical node.
Cost: O(log V) where V = total vnode count. For 2,560 entries that's ~12 comparisons — single-digit microseconds. Effectively free compared to a network hop.
The whole ring fits in CPU cache, so even at millions of lookups per second per client the overhead is negligible. Clients cache the ring locally and a gossip protocol (or coordinator) pushes updates on membership change.
?“What if two nodes happen to hash to nearby ring positions? Doesn't that starve one of them?”Reveal
Yes — with a basic ring this is exactly the problem. Without virtual nodes, if hash(NodeA) and hash(NodeB) land close together, the node just clockwise of both owns a tiny slice while others own huge slices. The ring is balanced only on average; individual trials can be wildly off.
Virtual nodes solve this by the law of large numbers. With 256 hash positions per physical node, occasional clustering cancels out. Variance of ownership share drops from ~O(1) with 1 vnode per node to ~O(1/√K) with K vnodes. At K=256 the standard deviation of each node's share is small single-digit percentages.
This is also why consistent hashing is not "one ring position per node" in any real system — that's a pedagogical simplification. Cassandra, DynamoDB-style systems, and Memcached ketama all use many virtual positions per physical node.
?“In Cassandra, how does consistent hashing interact with replication factor?”Reveal
They compose cleanly. Consistent hashing decides the primary owner — hash the partition key, walk clockwise to the first vnode. Replication factor N (say 3) means: also store copies on the next N-1 distinct physical nodes walking clockwise on the ring.
So for RF=3, a write lands on the primary plus the two next physical nodes (skipping vnodes that map to the same physical node). This gives:
- Automatic placement of replicas — no separate coordination.
- Natural failure handling — if the primary is down, reads/writes fall to the next replica on the ring.
- Rack-/DC-awareness layered on top — Cassandra's
NetworkTopologyStrategywalks the ring but skips replicas in the same rack or DC to avoid correlated failure.
The key insight: the ring gives you placement; the replication strategy gives you durability. They're independent concerns that plug together.
?“If consistent hashing is so good, why does Redis Cluster use fixed slots instead?”Reveal
Different design priorities, same goal.
Redis Cluster uses 16,384 fixed slots via CRC16(key) % 16384. Slots are assigned to nodes in contiguous ranges and migrated explicitly with CLUSTER MIGRATE.
Why this tradeoff works for Redis:
- 1Operational simplicity — you can print the slot-to-node map as a flat list. Debugging is trivial.
- 2Explicit migration — slot-by-slot movement with atomic hand-off, not probabilistic rebalancing.
- 3Multi-key operations —
MGET/transactions need keys on the same node. Redis exposes{hashtag}syntax to force keys into the same slot, which is easier to reason about than "adjacent on the ring".
Why not for Cassandra/DynamoDB: those systems scale to thousands of nodes where 16K slots gets coarse, and automatic membership handling matters more than manual control.
Both are valid — the ring is elegant and automatic; fixed slots are explicit and operational. Martin Kleppmann (DDIA ch. 6) notes that "consistent hashing" in literature is often loose terminology; several systems that claim it use a scheme closer to fixed slots.
?“A celebrity profile gets 500K req/sec and lives on one vnode. What do you do?”Reveal
Consistent hashing can't help here — by design the key is pinned to one owner. You need workload-imbalance tools:
- 1Replicate the hot key to every cache node. Clients pick a random node for reads. Trivial to implement, works immediately for read-heavy hot keys. Tradeoff: writes fan out to all nodes.
- 1L1 in-process cache in each app server, 5-10 s TTL. Often the fastest fix — 500K req/sec from one node becomes 500K req/sec across N app processes, each doing ~one origin fetch every 5 seconds. CDN is the globally-scaled version of this.
- 1Key salting for hot counters. Write to
celebrity:123:{0..9}, read by summing all 10. Distributes writes but complicates reads. Good for counters, not for read-mostly objects.
- 1Promote to a dedicated tier. Detect the hot key, move it out of the general cache into a replicated hot-key cache cluster. DynamoDB's adaptive capacity does this transparently.
In an interview I'd name all four and pick L1 cache + replication as my default — simplest two that cover 95% of cases.
Code examples
import bisect
import hashlib
class HashRing:
"""Consistent-hash ring. Vnode count defaults to 256 — matches Cassandra."""
def __init__(self, nodes: list[str], vnodes: int = 256) -> None:
self._vnodes = vnodes
self._ring: list[int] = [] # sorted hash positions
self._owner: dict[int, str] = {} # pos -> physical node
for n in nodes:
self._add(n)
def _hash(self, key: str) -> int:
# MD5 is fine here — we need distribution, not cryptographic strength.
return int(hashlib.md5(key.encode()).hexdigest()[:8], 16)
def _add(self, node: str) -> None:
for i in range(self._vnodes):
pos = self._hash(f"{node}-vn{i}")
bisect.insort(self._ring, pos)
self._owner[pos] = node
def _remove(self, node: str) -> None:
for i in range(self._vnodes):
pos = self._hash(f"{node}-vn{i}")
self._ring.remove(pos)
del self._owner[pos]
def get_node(self, key: str) -> str:
"""O(log V). V = total vnodes. At 2,560 entries, ~12 comparisons."""
pos = self._hash(key)
idx = bisect.bisect_right(self._ring, pos)
if idx == len(self._ring):
idx = 0 # wrap around
return self._owner[self._ring[idx]]
def add_node(self, node: str) -> None:
self._add(node)
def remove_node(self, node: str) -> None:
self._remove(node)# Empirically verify that consistent hashing beats modulo by 10x+.
# Run this and compare the two outputs.
from collections import Counter
KEYS = [f"key-{i}" for i in range(100_000)]
# --- Modulo hashing: adding a node remaps nearly everything.
def modulo_owner(key: str, n: int) -> int:
return hash(key) % n
before = {k: modulo_owner(k, 3) for k in KEYS}
after = {k: modulo_owner(k, 4) for k in KEYS}
moved = sum(1 for k in KEYS if before[k] != after[k])
print(f"modulo 3 -> 4: {moved / len(KEYS):.1%} keys moved") # ~75%
# --- Consistent hashing: adding a node moves ~1/N.
ring3 = HashRing(["n1", "n2", "n3"])
ring4 = HashRing(["n1", "n2", "n3", "n4"])
moved = sum(1 for k in KEYS if ring3.get_node(k) != ring4.get_node(k))
print(f"ring 3 -> 4: {moved / len(KEYS):.1%} keys moved") # ~25%
# Balance check: 256 vnodes gives near-perfect distribution.
counts = Counter(ring4.get_node(k) for k in KEYS)
spread = max(counts.values()) - min(counts.values())
print(f"load spread across 4 nodes: {spread / len(KEYS):.1%}") # ~1-2%Common mistakes
Some candidates say "consistent hashing" but then describe hash % N. The interviewer knows the difference. If you cite consistent hashing, explain the ring — why only 1/N of keys move on membership change, not nearly all.
A basic ring with 3 nodes gives uneven splits and single-neighbour failure overload. Virtual nodes are not optional in production. Always mention them and know the typical count (128-256).
Consistent hashing decides WHERE data lives. Replication decides HOW MANY copies exist. Cassandra uses consistent hashing for placement AND replicates to N consecutive nodes on the ring. They work together but are separate concepts.
It solves structural imbalance (uneven key distribution across nodes). It does NOT solve workload imbalance (one key getting 100x traffic). For hot keys you need replication, salting, or L1 caching.
If your cluster size never changes, modulo is simpler and gives perfect distribution. Consistent hashing shines when membership changes — elastic caches, node failures, rolling deployments.
Practice drills
You have a 5-node cache cluster using consistent hashing. One node dies. How much data moves, and where?Reveal
Approximately 1/5 (20%) of all keys move — specifically, the keys that belonged to the dead node. With virtual nodes, these keys redistribute to multiple surviving nodes (not just one). Each survivor absorbs roughly 5% more keys. For caches, "movement" means cache misses that refill from the DB — no explicit migration needed.
Explain virtual nodes in 30 seconds.Reveal
Instead of one ring position per physical node, place it at K positions (e.g. 256) by hashing "NodeA-vn0", "NodeA-vn1", etc. This turns one large range into many small ranges. Three benefits: (1) even load distribution via the law of large numbers, (2) failure load spreads to all survivors instead of one neighbour, (3) adding a node absorbs a little from everyone. Tradeoff: slightly more ring metadata. Cassandra defaults to 256. The metadata for 10 nodes × 256 vnodes fits in a few KB.
Why does Redis Cluster use fixed hash slots instead of a hash ring?Reveal
Redis Cluster uses 16,384 fixed slots (CRC16 % 16384). Design choice: simpler implementation, explicit control over which slots live on which node, and slot-by-slot migration via CLUSTER MIGRATE. Tradeoff: less automatic balancing (you assign slots manually) and fixed 16K granularity. Both approaches achieve controlled data movement on membership change. The ring is more elegant; fixed slots are more operational.
A celebrity profile is read 500K times/sec from your cache. It lives on one node via consistent hashing. Fix it.Reveal
Consistent hashing can't help — the key maps deterministically to one node. Options: (1) Replicate the hot key to all cache nodes, client round-robins reads. Simplest. (2) Key salting: store as celebrity:123:{0..9}, read from random suffix each time. Good for counters. (3) L1 in-process cache in each app server with short TTL (5-10s). Often the quickest win.
Cheat sheet
- •Modulo: adding 1 node to N remaps ~(N-1)/N keys. Catastrophic for caches.
- •Consistent hashing: adding 1 node moves ~1/N keys. Ring + clockwise walk.
- •Virtual nodes (128-256 per physical node): even distribution + failure load spread.
- •Hot keys ≠ uneven distribution. Hashing fixes distribution, not popularity.
- •Hot key fixes: replicate to all nodes, salt the key, or L1 in-process cache.
- •Real systems: Cassandra (ring + vnodes), DynamoDB (evolved ring), Redis Cluster (16K fixed slots), Memcached (ketama algorithm).
- •Ring metadata is tiny — 10 nodes × 256 vnodes = 2,560 entries. Fits in a few KB.
Practice this skill
These problems exercise Consistent hashing. Try one now to apply what you just learned.