Sharding & partitioning
Partition key selection, hot spots, rebalancing, consistent hashing.
The partition key is the single most consequential decision in a distributed data design. Pick it wrong and no amount of horsepower recovers you — the hot shard stays hot, the rebalance never finishes, and the team spends a quarter migrating.
Read this if your last attempt…
- Your design says "shard by user_id" and you can't explain why
- You were asked "what happens when one shard is twice as busy as the others?" and stalled
- You don't know the difference between logical and physical shards
- You said "we'll rebalance later" without naming how
The concept
Sharding (a.k.a. horizontal partitioning) is the deliberate split of one logical dataset across many physical nodes so that each node only owns a slice of it.
Three choices, in order:
Range = ordered, great for scans, risks hot tails. Hash = uniform spread, no range queries. Directory = explicit map, flexible, needs a router that itself scales.
Which partition scheme for which workload
| Range | Hash | Consistent hash | Directory | |
|---|---|---|---|---|
| Write distribution | Skewed on monotonic keys | Uniform | Uniform | As good as the mapping |
| Range queries | Cheap | Scatter-gather | Scatter-gather | Scatter-gather |
| Resharding cost | Medium | High (naive) / medium (w/ virtual shards) | Low (~1/N) | Low |
| Operational complexity | Low | Low | Medium | High (mapping is state) |
How interviewers grade this
- You name the partition key explicitly and defend it against "what's the read path?" and "what's the write distribution?"
- You distinguish logical shards (virtual buckets) from physical nodes and explain how rebalancing moves buckets, not rows.
- You call out hot-key risk concretely: "celebrity users would concentrate writes; I handle this by splitting their keyspace."
- You pick a scheme (range / hash / directory) on evidence — not "hash because it's even".
- You mention the cross-shard-query cost and what queries become expensive after sharding.
Variants
Range partitioning
Split the key space into ordered ranges.
Good for:
- Range scans (time, alphabetical)
- Time-series data with TTL per partition
- Small, predictable key spaces
Bad for:
- Monotonic keys (auto-increment, timestamp) — last shard takes all writes
- Uniform write distribution is a hard requirement
Pros
- +Range scans (time, alphabetical)
- +Time-series data with TTL per partition
- +Small, predictable key spaces
Cons
- −Monotonic keys (auto-increment, timestamp) — last shard takes all writes
- −Uniform write distribution is a hard requirement
Choose this variant when
- Time-series ingestion where you partition by day or week and drop whole partitions to expire data.
Hash partitioning
hash(key) % N → shard id.
1024 virtual buckets map onto N physical nodes. Adding a node reassigns ~1/N of buckets — not a re-hash of every row. This is how you scale without downtime.
Good for:
- Uniform write distribution out of the box
- Point-lookup workloads where range scans aren't needed
- When you can't predict what "hot" looks like
Bad for:
- Range queries become scatter-gather
- Naive modulo makes resharding catastrophic
Pros
- +Uniform write distribution out of the box
- +Point-lookup workloads where range scans aren't needed
- +When you can't predict what "hot" looks like
Cons
- −Range queries become scatter-gather
- −Naive modulo makes resharding catastrophic
Choose this variant when
- URL shortener, user-profile store, session cache — any point-lookup-dominant workload.
Consistent hashing
Hash onto a ring; nodes own arcs; add a node, move ~1/N of keys.
Good for:
- Dynamic cluster membership (autoscaling, failures)
- Distributed caches (Redis cluster, memcached)
- Minimising data movement on rebalance
Bad for:
- Workloads that still have hot keys — consistent hashing doesn't fix a skewed key
Pros
- +Dynamic cluster membership (autoscaling, failures)
- +Distributed caches (Redis cluster, memcached)
- +Minimising data movement on rebalance
Cons
- −Workloads that still have hot keys — consistent hashing doesn't fix a skewed key
Choose this variant when
- Cache clusters that scale up and down frequently; Dynamo-style distributed stores.
Directory / lookup-based
Explicit mapping table says which shard owns each key.
Lookup service maps tenant to its shard. Top-10 tenants get dedicated shards; long-tail shares a pool. Migrate a tenant across tiers by updating the directory.
Good for:
- Multi-tenancy where tenants vary wildly in size
- Flexible rebalancing without moving data
- Different tenants on different physical tiers
Bad for:
- The lookup itself becomes a hot path and must be cached
Pros
- +Multi-tenancy where tenants vary wildly in size
- +Flexible rebalancing without moving data
- +Different tenants on different physical tiers
Cons
- −The lookup itself becomes a hot path and must be cached
Choose this variant when
- Large SaaS where you want to put the top 10 tenants on dedicated shards.
Good vs bad answer
Interviewer probe
“How would you shard the `posts` table for a Twitter-scale system?”
Weak answer
"I'd shard by user_id using hash partitioning. That distributes writes evenly and scales horizontally."
Strong answer
"Two observations shape the answer. First, reads (timeline loads) dominate writes by ~100:1, so the read path is the one I optimise. Second, the user distribution is heavy-tailed — a few accounts are 10,000× more active than the median.
For posts I'd partition by user_id with hash-based virtual shards (1024 buckets over N physical nodes). But I'd treat the top ~1,000 users as a special case: their posts get their own partition strategy (hash of user_id + post_id to spread within their keyspace) and the read path for their followers hits a pre-materialised fan-out cache rather than scanning the posts table.
Resharding uses the virtual-shard layer so adding nodes moves ~1/N of the buckets rather than rehashing the whole table. I'd budget rehearsed rebalancing quarterly — never on the critical path."
Why it wins: Names the workload skew explicitly and partitions differently for the tail. Uses virtual shards so adding capacity is a bounded operation. Defends the choice against the actual access pattern (read:write ratio) rather than generic "scales well".
When it comes up
- When data volume or write throughput estimate exceeds one box (typically > 1 TB or > 50K writes/sec)
- During data model step, right after you name the primary entity
- When the interviewer asks "how does this scale to 100× more users?"
- When you propose a relational store at scale — sharding is what makes it scale
- When fan-out queries, hot keys, or read-replicas come up in deep-dive
Order of reveal
- 1State the three decisions in order. "Partition key first, scheme second, physical layout third. The partition key is the consequential choice — scheme and layout are mechanical once the key is right."
- 2Pick the partition key from the top access pattern. "The top query is 'timeline for a user'. That query must be single-shard, so the key is user_id. If the top query were 'all posts for a tag', it'd be tag_id instead."
- 3Choose scheme on evidence. "Hash partitioning because writes are point-inserts, I don't need range scans on the partition key, and I want uniform write distribution."
- 4Introduce virtual shards explicitly. "1024 virtual shards mapped onto N physical nodes. Adding capacity moves buckets, not rows — only ~1/N of data moves per new node."
- 5Call out hot-shard risk with a concrete number. "The top 0.1% of users will be 10,000× the median. I'll either sub-partition their keyspace with a composite key, or move them to a dedicated tier."
- 6Name what queries get expensive. "Cross-shard joins and queries on non-partition columns are now scatter-gather. For the top expensive secondary queries I'd maintain a materialised view or secondary index, partitioned on that query's key."
- 7Close with rebalance discipline. "Rebalancing is rehearsed quarterly, not improvised on the day a shard tips over."
Signature phrases
- “Partition key first, scheme second, layout third” — Orders the decision correctly — most candidates jump to "hash vs range".
- “Virtual shards decouple logical from physical” — Single phrase that tells the interviewer you know how to rebalance without pain.
- “Hot shards are key-choice bugs, not capacity bugs” — Reframes "add more nodes" as the wrong answer.
- “The top 0.1% of keys need their own strategy” — Demonstrates you've thought about heavy-tailed distributions.
- “Every cross-shard query is a latency bill” — Forces you to justify each scatter-gather.
- “Rebalancing is rehearsed, not improvised” — Operational maturity; separates senior from mid.
Likely follow-ups
?“Walk me through what happens when you add a new physical node to a 16-node cluster using virtual shards.”Reveal
Assume 1024 vshards mapped onto 16 nodes (64 vshards per node). Adding a 17th node:
- 1Compute target mapping. 1024 / 17 ≈ 60 vshards per node. The new node takes ~60 vshards, pulled ~4 from each of the existing 16.
- 1Dual-write phase. For each migrating vshard, writes go to both source and destination while the source bulk-copies historical data. Hours to a day depending on data size.
- 1Catch-up phase. Destination replays the change stream until within a few seconds of source.
- 1Cutover. Briefly freeze writes on the vshard (~100 ms), swap the routing table, unfreeze. Next request hits the new node.
- 1Drain source. Once confirmed, delete the migrated vshard data from source.
Why this is bounded: ~6% of data moved (1/17), not 100% re-hashed. At 1 TB total that's 60 GB — hours, not a week. And cutover per vshard is ~100 ms, so aggregate availability impact is negligible.
Without virtual shards you either rehash the whole table (days of dual-write, scary cutover) or accept the skew permanently.
?“What happens when one shard gets 10× the load of its peers in production?”Reveal
Diagnose first. Structural (bad key-space distribution) or workload (one hot key) hotspot?
- Structural hotspot → partition key has low entropy. Example: sharding by country where 60% of users are in one country. Fix: change the key, or salt-prefix (country_hash_prefix) to spread within that keyspace. This is a migration.
- Workload hotspot → one specific key is hot, rest fine. Three mitigations without full resharding:
1. Read replicas for the hot shard — spin up extras specifically for that shard. 2. Key salting — spread writes across user:123:ctr:0..9, aggregate on read. Good for counters. 3. L1 cache in app tier — short-TTL (5-10 s) in-process cache absorbs most reads before the shard sees them.
Short-term triage: add capacity to the hot shard (bigger box) to buy time. Don't mistake this for a fix — it's a bigger bucket under the leak.
Long-term: if the same hotspot recurs weekly, the key is structurally wrong. Migrate.
?“Your design shards by user_id. A PM asks for "posts by hashtag" feed. How?”Reveal
The new query doesn't filter on user_id → hitting the primary index becomes scatter-gather across every shard. Three options:
- 1Secondary index, scatter-gather. Each shard has a local index on hashtag; query fans out to all shards and merges. Latency is O(slowest shard). Unacceptable for a user-facing feed; fine for offline analytics.
- 1Separate index store. Maintain a second table partitioned by hashtag mapping hashtag → post_ids. Primary-table writes emit an event (CDC/outbox) that updates the hashtag index async. Reads hit it directly — single-shard by hashtag. Standard pattern for social "by tag" feeds.
- 1Search index (Elasticsearch). If the query is a real search (ranking, filtering, free-text within), push to ES. Primary stays user-partitioned; ES receives post change stream and serves the hashtag query with ranking.
Design lesson: when a new access pattern doesn't match the primary partition key, don't force the primary store to serve it. Maintain a secondary read model optimised for that query. Accept eventual consistency (~seconds) between primary and secondary — call this out explicitly.
?“How do you handle transactions that span multiple shards?”Reveal
Start with: avoid them. If top transactional flows cross shards, the partition key is wrong. Redesign first.
If unavoidable:
- 1Saga pattern (most common at scale). Break the transaction into a sequence of local-shard ops with compensating actions. On failure, run compensations in reverse. No distributed lock; you must accept intermediate states are observable. Works for business processes (refund, cancel) — not invariant-critical atomicity.
- 1Two-phase commit (2PC). Coordinator locks all participants, then commits atomically. ACID but at high latency (two RTTs minimum, locks held) and the coordinator is a failure mode. Modern distributed DBs (Spanner, CockroachDB) implement this with Paxos/Raft-backed coordinators.
- 1Reshard participants onto one shard for that workflow. If strict atomicity is required, colocate. Example: both legs of a transfer for accounts owned by the same customer go to the same shard keyed by customer_id.
In an interview: saga as default, flag 2PC's latency and availability cost, push back on "we need distributed ACID" by asking whether eventual consistency + compensation solves the business problem.
?“How do you decide between 100 small shards vs 10 large shards?”Reveal
More, smaller shards (e.g. 100 × 100 GB):
- Faster recovery — rebuilding a failed shard moves 100 GB, not 1 TB.
- Finer rebalance granularity.
- Better scatter-gather parallelism.
- Higher coordination overhead — more heartbeats, more metadata.
Fewer, larger shards (e.g. 10 × 1 TB):
- Lower coordination cost — simpler mapping.
- Slower recovery — failed-shard rebuild takes hours.
- Coarser rebalance.
Rule of thumb: size a physical shard so it rebuilds from scratch in under 1 hour at your available replication bandwidth. At 10 Gbps that's ~4 TB; at 1 Gbps ~400 GB. Below that, smaller is usually better.
Separately: always have many more virtual shards than physical nodes — 1024+ vshards on 10-100 nodes is standard. Virtual layer makes rebalancing cheap; physical shard size is a cost/recovery tradeoff.
Code examples
import hashlib
VSHARDS = 1024
# vshard_id -> node_id. Persisted in a small config table / ZooKeeper.
VSHARD_MAP: dict[int, str] = {i: f"node-{i % 16}" for i in range(VSHARDS)}
def vshard_for(key: str) -> int:
h = hashlib.blake2b(key.encode(), digest_size=8).digest()
return int.from_bytes(h, "big") % VSHARDS
def node_for(key: str) -> str:
return VSHARD_MAP[vshard_for(key)]
def add_node(new_node: str, take_per_source: int = 4) -> list[tuple[int, str, str]]:
"""Plan the migration: pull ~1/N of vshards from each existing node."""
by_node: dict[str, list[int]] = {}
for vs, n in VSHARD_MAP.items():
by_node.setdefault(n, []).append(vs)
plan = []
for src, vshards in by_node.items():
for vs in vshards[:take_per_source]:
plan.append((vs, src, new_node)) # dual-write → backfill → cutover
return plan-- 16 physical partitions, each can itself be sub-partitioned or moved later.
CREATE TABLE posts (
post_id BIGINT NOT NULL,
user_id BIGINT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
body TEXT,
PRIMARY KEY (user_id, post_id)
) PARTITION BY HASH (user_id);
DO $$ BEGIN
FOR i IN 0..15 LOOP
EXECUTE format(
'CREATE TABLE posts_p%1$s PARTITION OF posts '
'FOR VALUES WITH (MODULUS 16, REMAINDER %1$s)', i);
END LOOP;
END $$;
-- Queries filtering on user_id hit exactly one partition (partition pruning).
-- A query without user_id becomes a scatter-gather across all 16.Common mistakes
Twitter can't shard by user_id alone — one Justin Bieber account would put 10M writes/s on a single shard. The fix is either (a) per-user sub-partitioning for the top 0.01%, or (b) separate the write path (per-user inbox) from the read path (fan-out cache).
Viral user would hammer one shard. Salt the write key with 0..9 suffix to spread across 10 virtual shards; reader aggregates across the 10 salt prefixes.
hash(key) % N means adding one node re-hashes every key. Virtual shards (e.g. 1024 buckets mapped onto physical nodes) turn a rebalance from "move 100% of the data" into "move 1/N of the buckets".
1024 virtual buckets map onto N physical nodes. Adding a node reassigns ~1/N of buckets — not a re-hash of every row. This is how you scale without downtime.
Timestamp or auto-increment as partition key → every new write goes to the last shard. Either prefix with a hash, or partition by something else and cluster by timestamp within the partition.
"We'll just query both shards and merge" works at 10 shards and destroys latency at 1000. Design so the top 3 queries are single-shard.
If you have to rebalance under load for the first time on the day a shard tipped over, you will have an outage. Rehearse it.
Practice drills
You partition a users table by hash(user_id) mod 16. You need to go from 16 nodes to 20. What does this cost?Reveal
With naive % N, nearly every row moves — you've effectively re-hashed the whole dataset, which at scale is hours of downtime or months of dual-write migration. The fix is to use virtual shards: hash into (say) 1024 buckets, then map those buckets onto physical nodes. Going from 16 → 20 nodes moves only ~20% of the buckets. Same idea as consistent hashing, simpler to reason about.
A time-series DB storing IoT events, partitioned by range on timestamp. Writes are skewed to one shard. Why, and how do you fix it?Reveal
All new writes have "now" as their timestamp, so they all land in the latest range partition. Classic monotonic-key hotspot. Two fixes: (1) composite partition key of (device_id_hash_prefix, timestamp) so writes spread across many partitions while still clustering by time within a partition; (2) keep the range partitioning but shard each time partition further by device — most time-series DBs (Cassandra, InfluxDB) support this natively.
SaaS app with 10,000 tenants, heavy-tailed. Top 10 tenants are 80% of load. Hash-partition? Range-partition? Directory?Reveal
Directory. Hash would spread big tenants across shards but stuff small tenants into a pool where they still contend. The directory lets you put the top 10 tenants on dedicated shards (or even their own clusters), put the long tail on a shared pool, and migrate tenants across tiers as they grow — without moving data you didn't want to move.
Cheat sheet
- •Partition key first, scheme second, physical layout third.
- •Virtual shards (1024+) decouple logical from physical — always.
- •Hot shards are key-choice bugs, not capacity bugs. Audit key entropy before scaling out.
- •Monotonic keys → prefix with a hash or partition on something else.
- •The top 0.1% of keys need their own strategy; don't pretend the distribution is uniform.
- •Rehearse rebalancing. Production day is not the dress rehearsal.
Practice this skill
These problems exercise Sharding & partitioning. Try one now to apply what you just learned.