intermediatedata

Sharding & partitioning

Partition key selection, hot spots, rebalancing, consistent hashing.

The partition key is the single most consequential decision in a distributed data design. Pick it wrong and no amount of horsepower recovers you — the hot shard stays hot, the rebalance never finishes, and the team spends a quarter migrating.

Read this if your last attempt…

Your design says "shard by user_id" and you can't explain why
You were asked "what happens when one shard is twice as busy as the others?" and stalled
You don't know the difference between logical and physical shards
You said "we'll rebalance later" without naming how

The concept

Sharding (a.k.a. horizontal partitioning) is the deliberate split of one logical dataset across many physical nodes so that each node only owns a slice of it.

Three choices, in order:

Architecture diagram· Three ways to split a dataset

Range = ordered, great for scans, risks hot tails. Hash = uniform spread, no range queries. Directory = explicit map, flexible, needs a router that itself scales.

Which partition scheme for which workload

Range	Hash	Consistent hash	Directory
Write distribution	Skewed on monotonic keys	Uniform	Uniform	As good as the mapping
Range queries	Cheap	Scatter-gather	Scatter-gather	Scatter-gather
Resharding cost	Medium	High (naive) / medium (w/ virtual shards)	Low (~1/N)	Low
Operational complexity	Low	Low	Medium	High (mapping is state)

How interviewers grade this

You name the partition key explicitly and defend it against "what's the read path?" and "what's the write distribution?"
You distinguish logical shards (virtual buckets) from physical nodes and explain how rebalancing moves buckets, not rows.
You call out hot-key risk concretely: "celebrity users would concentrate writes; I handle this by splitting their keyspace."
You pick a scheme (range / hash / directory) on evidence — not "hash because it's even".
You mention the cross-shard-query cost and what queries become expensive after sharding.

Variants

Range partitioning

Split the key space into ordered ranges.

Good for:

Range scans (time, alphabetical)
Time-series data with TTL per partition
Small, predictable key spaces

Bad for:

Monotonic keys (auto-increment, timestamp) — last shard takes all writes
Uniform write distribution is a hard requirement

Pros

+Range scans (time, alphabetical)
+Time-series data with TTL per partition
+Small, predictable key spaces

Cons

−Monotonic keys (auto-increment, timestamp) — last shard takes all writes
−Uniform write distribution is a hard requirement

Choose this variant when

Time-series ingestion where you partition by day or week and drop whole partitions to expire data.

Hash partitioning

hash(key) % N → shard id.

Architecture diagram· Virtual shards decouple logical from physical

1024 virtual buckets map onto N physical nodes. Adding a node reassigns ~1/N of buckets — not a re-hash of every row. This is how you scale without downtime.

Good for:

Uniform write distribution out of the box
Point-lookup workloads where range scans aren't needed
When you can't predict what "hot" looks like

Bad for:

Range queries become scatter-gather
Naive modulo makes resharding catastrophic

Pros

+Uniform write distribution out of the box
+Point-lookup workloads where range scans aren't needed
+When you can't predict what "hot" looks like

Cons

−Range queries become scatter-gather
−Naive modulo makes resharding catastrophic

Choose this variant when

URL shortener, user-profile store, session cache — any point-lookup-dominant workload.

Consistent hashing

Hash onto a ring; nodes own arcs; add a node, move ~1/N of keys.

Good for:

Dynamic cluster membership (autoscaling, failures)
Distributed caches (Redis cluster, memcached)
Minimising data movement on rebalance

Bad for:

Workloads that still have hot keys — consistent hashing doesn't fix a skewed key

Pros

+Dynamic cluster membership (autoscaling, failures)
+Distributed caches (Redis cluster, memcached)
+Minimising data movement on rebalance

Cons

−Workloads that still have hot keys — consistent hashing doesn't fix a skewed key

Choose this variant when

Cache clusters that scale up and down frequently; Dynamo-style distributed stores.

Directory / lookup-based

Explicit mapping table says which shard owns each key.

Architecture diagram· Directory-based sharding for heterogeneous tenants

Lookup service maps tenant to its shard. Top-10 tenants get dedicated shards; long-tail shares a pool. Migrate a tenant across tiers by updating the directory.

Good for:

Multi-tenancy where tenants vary wildly in size
Flexible rebalancing without moving data
Different tenants on different physical tiers

Bad for:

The lookup itself becomes a hot path and must be cached

Pros

+Multi-tenancy where tenants vary wildly in size
+Flexible rebalancing without moving data
+Different tenants on different physical tiers

Cons

−The lookup itself becomes a hot path and must be cached

Choose this variant when

Large SaaS where you want to put the top 10 tenants on dedicated shards.

Good vs bad answer

Interviewer probe

“How would you shard the `posts` table for a Twitter-scale system?”

Weak answer

"I'd shard by user_id using hash partitioning. That distributes writes evenly and scales horizontally."

Strong answer

"Two observations shape the answer. First, reads (timeline loads) dominate writes by ~100:1, so the read path is the one I optimise. Second, the user distribution is heavy-tailed — a few accounts are 10,000× more active than the median.

For posts I'd partition by user_id with hash-based virtual shards (1024 buckets over N physical nodes). But I'd treat the top ~1,000 users as a special case: their posts get their own partition strategy (hash of user_id + post_id to spread within their keyspace) and the read path for their followers hits a pre-materialised fan-out cache rather than scanning the posts table.

Resharding uses the virtual-shard layer so adding nodes moves ~1/N of the buckets rather than rehashing the whole table. I'd budget rehearsed rebalancing quarterly — never on the critical path."

Why it wins: Names the workload skew explicitly and partitions differently for the tail. Uses virtual shards so adding capacity is a bounded operation. Defends the choice against the actual access pattern (read:write ratio) rather than generic "scales well".

Interview playbook3-4 minutes in the data design step, plus ~1 minute per deep-dive when a new query or access pattern is introduced.

When it comes up

When data volume or write throughput estimate exceeds one box (typically > 1 TB or > 50K writes/sec)
During data model step, right after you name the primary entity
When the interviewer asks "how does this scale to 100× more users?"
When you propose a relational store at scale — sharding is what makes it scale
When fan-out queries, hot keys, or read-replicas come up in deep-dive

Order of reveal

1
State the three decisions in order. "Partition key first, scheme second, physical layout third. The partition key is the consequential choice — scheme and layout are mechanical once the key is right."
2
Pick the partition key from the top access pattern. "The top query is 'timeline for a user'. That query must be single-shard, so the key is user_id. If the top query were 'all posts for a tag', it'd be tag_id instead."
3
Choose scheme on evidence. "Hash partitioning because writes are point-inserts, I don't need range scans on the partition key, and I want uniform write distribution."
4
Introduce virtual shards explicitly. "1024 virtual shards mapped onto N physical nodes. Adding capacity moves buckets, not rows — only ~1/N of data moves per new node."
5
Call out hot-shard risk with a concrete number. "The top 0.1% of users will be 10,000× the median. I'll either sub-partition their keyspace with a composite key, or move them to a dedicated tier."
6
Name what queries get expensive. "Cross-shard joins and queries on non-partition columns are now scatter-gather. For the top expensive secondary queries I'd maintain a materialised view or secondary index, partitioned on that query's key."
7
Close with rebalance discipline. "Rebalancing is rehearsed quarterly, not improvised on the day a shard tips over."

Signature phrases

“Partition key first, scheme second, layout third”

“Virtual shards decouple logical from physical”

“Hot shards are key-choice bugs, not capacity bugs”

“The top 0.1% of keys need their own strategy”

“Every cross-shard query is a latency bill”

“Rebalancing is rehearsed, not improvised”

“Partition key first, scheme second, layout third” — Orders the decision correctly — most candidates jump to "hash vs range".
“Virtual shards decouple logical from physical” — Single phrase that tells the interviewer you know how to rebalance without pain.
“Hot shards are key-choice bugs, not capacity bugs” — Reframes "add more nodes" as the wrong answer.
“The top 0.1% of keys need their own strategy” — Demonstrates you've thought about heavy-tailed distributions.
“Every cross-shard query is a latency bill” — Forces you to justify each scatter-gather.
“Rebalancing is rehearsed, not improvised” — Operational maturity; separates senior from mid.

Likely follow-ups

?“Walk me through what happens when you add a new physical node to a 16-node cluster using virtual shards.”Reveal

Assume 1024 vshards mapped onto 16 nodes (64 vshards per node). Adding a 17th node:

1Compute target mapping. 1024 / 17 ≈ 60 vshards per node. The new node takes ~60 vshards, pulled ~4 from each of the existing 16.

1Dual-write phase. For each migrating vshard, writes go to both source and destination while the source bulk-copies historical data. Hours to a day depending on data size.

1Catch-up phase. Destination replays the change stream until within a few seconds of source.

1Cutover. Briefly freeze writes on the vshard (~100 ms), swap the routing table, unfreeze. Next request hits the new node.

1Drain source. Once confirmed, delete the migrated vshard data from source.

Why this is bounded: ~6% of data moved (1/17), not 100% re-hashed. At 1 TB total that's 60 GB — hours, not a week. And cutover per vshard is ~100 ms, so aggregate availability impact is negligible.

Without virtual shards you either rehash the whole table (days of dual-write, scary cutover) or accept the skew permanently.

?“What happens when one shard gets 10× the load of its peers in production?”Reveal

Diagnose first. Structural (bad key-space distribution) or workload (one hot key) hotspot?

Structural hotspot → partition key has low entropy. Example: sharding by country where 60% of users are in one country. Fix: change the key, or salt-prefix (country_hash_prefix) to spread within that keyspace. This is a migration.

Workload hotspot → one specific key is hot, rest fine. Three mitigations without full resharding:

1. Read replicas for the hot shard — spin up extras specifically for that shard. 2. Key salting — spread writes across user:123:ctr:0..9, aggregate on read. Good for counters. 3. L1 cache in app tier — short-TTL (5-10 s) in-process cache absorbs most reads before the shard sees them.

Short-term triage: add capacity to the hot shard (bigger box) to buy time. Don't mistake this for a fix — it's a bigger bucket under the leak.

Long-term: if the same hotspot recurs weekly, the key is structurally wrong. Migrate.

?“Your design shards by user_id. A PM asks for "posts by hashtag" feed. How?”Reveal

The new query doesn't filter on user_id → hitting the primary index becomes scatter-gather across every shard. Three options:

1Secondary index, scatter-gather. Each shard has a local index on hashtag; query fans out to all shards and merges. Latency is O(slowest shard). Unacceptable for a user-facing feed; fine for offline analytics.

1Separate index store. Maintain a second table partitioned by hashtag mapping hashtag → post_ids. Primary-table writes emit an event (CDC/outbox) that updates the hashtag index async. Reads hit it directly — single-shard by hashtag. Standard pattern for social "by tag" feeds.

1Search index (Elasticsearch). If the query is a real search (ranking, filtering, free-text within), push to ES. Primary stays user-partitioned; ES receives post change stream and serves the hashtag query with ranking.

Design lesson: when a new access pattern doesn't match the primary partition key, don't force the primary store to serve it. Maintain a secondary read model optimised for that query. Accept eventual consistency (~seconds) between primary and secondary — call this out explicitly.

?“How do you handle transactions that span multiple shards?”Reveal

Start with: avoid them. If top transactional flows cross shards, the partition key is wrong. Redesign first.

If unavoidable:

1Saga pattern (most common at scale). Break the transaction into a sequence of local-shard ops with compensating actions. On failure, run compensations in reverse. No distributed lock; you must accept intermediate states are observable. Works for business processes (refund, cancel) — not invariant-critical atomicity.

1Two-phase commit (2PC). Coordinator locks all participants, then commits atomically. ACID but at high latency (two RTTs minimum, locks held) and the coordinator is a failure mode. Modern distributed DBs (Spanner, CockroachDB) implement this with Paxos/Raft-backed coordinators.

1Reshard participants onto one shard for that workflow. If strict atomicity is required, colocate. Example: both legs of a transfer for accounts owned by the same customer go to the same shard keyed by customer_id.

In an interview: saga as default, flag 2PC's latency and availability cost, push back on "we need distributed ACID" by asking whether eventual consistency + compensation solves the business problem.

?“How do you decide between 100 small shards vs 10 large shards?”Reveal

More, smaller shards (e.g. 100 × 100 GB):

Faster recovery — rebuilding a failed shard moves 100 GB, not 1 TB.
Finer rebalance granularity.
Better scatter-gather parallelism.
Higher coordination overhead — more heartbeats, more metadata.

Fewer, larger shards (e.g. 10 × 1 TB):

Lower coordination cost — simpler mapping.
Slower recovery — failed-shard rebuild takes hours.
Coarser rebalance.

Rule of thumb: size a physical shard so it rebuilds from scratch in under 1 hour at your available replication bandwidth. At 10 Gbps that's ~4 TB; at 1 Gbps ~400 GB. Below that, smaller is usually better.

Separately: always have many more virtual shards than physical nodes — 1024+ vshards on 10-100 nodes is standard. Virtual layer makes rebalancing cheap; physical shard size is a cost/recovery tradeoff.

Code examples

pythonVirtual-shard router — add a node without a full re-hash

import hashlib

VSHARDS = 1024

# vshard_id -> node_id. Persisted in a small config table / ZooKeeper.
VSHARD_MAP: dict[int, str] = {i: f"node-{i % 16}" for i in range(VSHARDS)}

def vshard_for(key: str) -> int:
    h = hashlib.blake2b(key.encode(), digest_size=8).digest()
    return int.from_bytes(h, "big") % VSHARDS

def node_for(key: str) -> str:
    return VSHARD_MAP[vshard_for(key)]

def add_node(new_node: str, take_per_source: int = 4) -> list[tuple[int, str, str]]:
    """Plan the migration: pull ~1/N of vshards from each existing node."""
    by_node: dict[str, list[int]] = {}
    for vs, n in VSHARD_MAP.items():
        by_node.setdefault(n, []).append(vs)
    plan = []
    for src, vshards in by_node.items():
        for vs in vshards[:take_per_source]:
            plan.append((vs, src, new_node))   # dual-write → backfill → cutover
    return plan

sqlPostgres declarative partitioning by hash

-- 16 physical partitions, each can itself be sub-partitioned or moved later.
CREATE TABLE posts (
    post_id   BIGINT NOT NULL,
    user_id   BIGINT NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    body      TEXT,
    PRIMARY KEY (user_id, post_id)
) PARTITION BY HASH (user_id);

DO $$ BEGIN
    FOR i IN 0..15 LOOP
        EXECUTE format(
            'CREATE TABLE posts_p%1$s PARTITION OF posts '
            'FOR VALUES WITH (MODULUS 16, REMAINDER %1$s)', i);
    END LOOP;
END $$;

-- Queries filtering on user_id hit exactly one partition (partition pruning).
-- A query without user_id becomes a scatter-gather across all 16.

Common mistakes

Sharding by user_id on a celebrity-skewed workload

Twitter can't shard by user_id alone — one Justin Bieber account would put 10M writes/s on a single shard. The fix is either (a) per-user sub-partitioning for the top 0.01%, or (b) separate the write path (per-user inbox) from the read path (fan-out cache).

Architecture diagram· Hot-key mitigation by salting

Viral user would hammer one shard. Salt the write key with 0..9 suffix to spread across 10 virtual shards; reader aggregates across the 10 salt prefixes.

Modulo-hash with no virtual shards

hash(key) % N means adding one node re-hashes every key. Virtual shards (e.g. 1024 buckets mapped onto physical nodes) turn a rebalance from "move 100% of the data" into "move 1/N of the buckets".

Architecture diagram· Virtual shards decouple logical from physical

1024 virtual buckets map onto N physical nodes. Adding a node reassigns ~1/N of buckets — not a re-hash of every row. This is how you scale without downtime.

Monotonic partition keys

Timestamp or auto-increment as partition key → every new write goes to the last shard. Either prefix with a hash, or partition by something else and cluster by timestamp within the partition.

Cross-shard joins pretended away

"We'll just query both shards and merge" works at 10 shards and destroys latency at 1000. Design so the top 3 queries are single-shard.

Rebalancing as a production-day exercise

If you have to rebalance under load for the first time on the day a shard tipped over, you will have an outage. Rehearse it.

Practice drills

You partition a users table by hash(user_id) mod 16. You need to go from 16 nodes to 20. What does this cost?Reveal

With naive % N, nearly every row moves — you've effectively re-hashed the whole dataset, which at scale is hours of downtime or months of dual-write migration. The fix is to use virtual shards: hash into (say) 1024 buckets, then map those buckets onto physical nodes. Going from 16 → 20 nodes moves only ~20% of the buckets. Same idea as consistent hashing, simpler to reason about.

A time-series DB storing IoT events, partitioned by range on timestamp. Writes are skewed to one shard. Why, and how do you fix it?Reveal

All new writes have "now" as their timestamp, so they all land in the latest range partition. Classic monotonic-key hotspot. Two fixes: (1) composite partition key of (device_id_hash_prefix, timestamp) so writes spread across many partitions while still clustering by time within a partition; (2) keep the range partitioning but shard each time partition further by device — most time-series DBs (Cassandra, InfluxDB) support this natively.

SaaS app with 10,000 tenants, heavy-tailed. Top 10 tenants are 80% of load. Hash-partition? Range-partition? Directory?Reveal

Directory. Hash would spread big tenants across shards but stuff small tenants into a pool where they still contend. The directory lets you put the top 10 tenants on dedicated shards (or even their own clusters), put the long tail on a shared pool, and migrate tenants across tiers as they grow — without moving data you didn't want to move.

Cheat sheet

•Partition key first, scheme second, physical layout third.
•Virtual shards (1024+) decouple logical from physical — always.
•Hot shards are key-choice bugs, not capacity bugs. Audit key entropy before scaling out.
•Monotonic keys → prefix with a hash or partition on something else.
•The top 0.1% of keys need their own strategy; don't pretend the distribution is uniform.
•Rehearse rebalancing. Production day is not the dress rehearsal.

Practice this skill

These problems exercise Sharding & partitioning. Try one now to apply what you just learned.

url shortener news feed twitter timeline

8% complete

Current

Read this if

Step 1 of 13

The concept

Jump to next

Range

Hash

Consistent hash