Data model design
Entities, relationships, keys, normalization vs denormalization.
"We'll put it in Postgres" is not a data model. The data model is entities, keys, relationships, cardinalities, and the access patterns each one has to serve — and it locks in every trade-off you will chase for the rest of the design.
Read this if your last attempt…
- You jumped to "we use Postgres / DynamoDB / Cassandra" before naming a single entity
- Your reviewer asked "what does one row look like?" and you paused
- You don't know whether your design needs joins
- You can't answer "what's the cardinality of that relationship?"
- You always reach for 3NF and feel guilty about denormalising
The concept
Data modelling in an interview is the discipline of moving from "what queries do we serve" to "what does one row look like" — in that direction, never the reverse.
The candidate who draws an ER diagram first and then asks how the home page renders has it backwards. The candidate who says "the home page does these three queries, here are the entities those queries touch, here is the key for each one" is going to design something that survives launch.
The arrow only points this way. You derive the schema from the queries you need to serve, not the other way around.
Normalise vs denormalise
| Axis | Keep normalised | Denormalise |
|---|---|---|
| Primary concern | Writes consistent and cheap | Reads fast without joins |
| Good when | Strong consistency required; writes dominate; data mutates often | Read:write ratio > 10:1; the read path must be one lookup |
| Cost | Join on every read; risk of N+1; harder to shard | Writes fan out; need an outbox, CDC, or batch rebuild |
| Update propagation | Immediate via single source of truth | Eventually consistent — staleness window must be acceptable |
| Operational cost | Low — one schema, one path | High — multiple copies, drift detection, rebuild |
| Canonical example | Bank accounts → transactions | Twitter timeline (author tweets fanned into follower inboxes) |
How interviewers grade this
- Before naming a database, you list the top access patterns with QPS and latency.
- For each entity you state: primary key (natural or surrogate), cardinality of each relationship, and growth rate over time.
- You distinguish the write model from the read model when the access patterns diverge.
- Your denormalisation decisions cite a specific read that would otherwise be a multi-hop join.
- You name which fields are immutable vs mutable — mutability drives caching and sharding choices.
- You name which fields are indexed and tie each index to an access pattern.
- You answer "soft delete or hard delete?" before being asked.
Variants
Relational, normalised (3NF)
3NF + foreign keys; joins at read time.
The default for transactional systems. Each entity gets a table; relationships are foreign keys; reads use JOINs to assemble views.
When it shines:
- Strong consistency required (financial, inventory, ACID)
- Mixed read patterns — ad-hoc analytics, reporting
- Schema is well-understood and stable
- Read QPS is low to medium (sub-10k/s on joined paths)
When it breaks:
- Read QPS > 10k on a multi-table join
- Need to shard joined tables horizontally (cross-shard joins are slow)
- Denormalised read views become essential — but you keep adding indexes instead
Tools: Postgres, MySQL, Aurora, Spanner.
Pros
- +Strong consistency, ACID
- +Ad-hoc queries are easy
- +Familiar to every team
- +Single source of truth
Cons
- −Joins get expensive at scale
- −Hard to shard tables that join
- −Schema migrations can be painful
Choose this variant when
- OLTP transactional workloads
- Compliance/financial systems
- Mixed read patterns; ad-hoc query needs
Document / aggregate-oriented
Whole object stored together; no joins; embedded sub-entities.
The aggregate is the unit of read and write. A user document contains the user's profile, preferences, and short bounded sub-collections (e.g., shipping addresses).
When it shines:
- Access pattern is "fetch one aggregate and render"
- Schema evolves per-tenant (no migrations)
- Sub-collections are bounded (< ~100 items)
- Eventual consistency on cross-aggregate views is acceptable
When it breaks:
- Sub-collections grow unbounded (16 MB MongoDB doc limit; 400 KB DynamoDB item)
- Cross-document joins become essential (you reinvent JOIN in app code, badly)
- Two consumers need different sub-slices of the same doc
Tools: MongoDB, DynamoDB single-table, Cosmos DB, Firestore.
Pros
- +One read = one fetch
- +Schema flexibility per document
- +Natural fit for object models
- +Horizontal scaling is easier
Cons
- −Cross-doc analytics is awkward
- −Bounded sub-collections only
- −Eventual consistency on related entities
Choose this variant when
- Aggregate is the natural unit of read
- Schema varies by tenant
- Mobile-style "fetch one screen" patterns
Wide-column / one table per query
Rows are keyed tuples; design the table for the exact query.
Wide-column stores (Cassandra, ScyllaDB, HBase, BigTable) flip data modelling on its head: you design one table per query, denormalised at write time, with the partition key chosen for the query.
The mental model:
- A table is a giant sorted map:
partition key → cluster key → row - Every query is a partition lookup + range scan within the partition
- No joins. No secondary indexes worth using at scale. No ad-hoc queries.
When it shines:
- Write-heavy time-series (IoT, telemetry, logs)
- Predictable, finite set of access patterns known up front
- Need linear horizontal write scaling
- Eventual consistency is acceptable
When it breaks:
- Access patterns change (you have to rebuild the table)
- Need ad-hoc analytics (use a separate OLAP store)
- Hot partitions appear (one celebrity user gets 100x the traffic)
Pros
- +Linear horizontal write scale
- +Predictable single-digit-ms p99
- +Great for time-series
Cons
- −One table per query — denormalisation everywhere
- −No ad-hoc queries
- −Hot partition risk
Choose this variant when
- Write-heavy time-series
- Known, finite query set
- Need to scale writes linearly across nodes
CQRS / materialised read model
Normalised write side; denormalised read side; rebuilt from a log.
When write and read patterns diverge sharply, split the model. Writes go to a normalised OLTP store (Postgres). A change-data-capture (CDC) stream projects each write into one or more denormalised read models — search index (Elasticsearch), analytics warehouse (Snowflake), cache (Redis), pre-computed view (Postgres materialised view).
The contract:
- Write side: small number of normalised tables, ACID transactions
- CDC pipeline (Debezium, Kafka Connect, or app-level outbox)
- Read side: one or more shapes, each tuned to its consumer
- Eventual consistency between write and read sides (typically sub-second)
When it shines:
- Read:write ratio is huge (search, analytics, dashboards)
- Multiple consumers want different shapes of the same data
- Replay-from-log is needed for new consumers or recovery
When it hurts:
- Small teams: operational cost of the pipeline is real
- Strict read-your-writes consistency: clients may need to read from the write side
Pros
- +Each side optimised for its workload
- +New consumers added by adding a new projection
- +Replay enables recovery and migration
Cons
- −Operational cost of the CDC pipeline
- −Eventual consistency on the read side
- −Requires logs/events as first-class infra
Choose this variant when
- Read patterns diverge sharply from writes
- Multiple read consumers (search, analytics, cache)
- Team can run a streaming pipeline
Event-sourced
The source of truth is an append-only event log; current state is a projection.
Instead of storing current state directly, store every state-changing event as an immutable record. Current state is a fold over the event log — usually cached in a snapshot for fast reads.
The contract:
- Domain events are the source of truth (
AccountOpened,MoneyDeposited,MoneyWithdrawn) - Read model is a projection (current balance) computed from events
- Append-only — never update or delete
- Snapshots periodically materialised for performance
When it shines:
- Audit and compliance are first-class (banking, healthcare)
- Need temporal queries ("what was the balance on March 15?")
- Multiple read projections share the same event log
- Recovery / replay / debugging are critical
When it hurts:
- Schema evolution is hard (events are immutable forever)
- Eventual consistency on read projections
- Significantly higher engineering and operational complexity
Pros
- +Perfect audit trail
- +Temporal queries are natural
- +New projections from old events
Cons
- −Event schema evolution is hard
- −Higher engineering bar
- −Eventual consistency
Choose this variant when
- Audit/compliance is core requirement
- Temporal queries needed
- Banking, accounting, healthcare
Worked example
Scenario: Design the data model for an Instagram-style feed (users, posts, likes, comments, follows).
Step 1 — list access patterns
- A1: Render a user's home feed (latest posts from accounts they follow). p95 < 100 ms. ~100k QPS.
- A2: Render a user profile (their photos, paginated). p95 < 200 ms. ~10k QPS.
- A3: Like a post (idempotent per user-post pair). ~50k QPS.
- A4: Comment on a post (latest 20 shown inline). ~10k QPS.
- A5: Follow / unfollow a user. ~1k QPS.
Step 2 — entities and keys
- User: PK user_id (UUIDv7). Fields: username (unique, indexed), display_name, avatar_url, follower_count (denormalised), following_count (denormalised), created_at.
- Post: PK post_id (UUIDv7, time-ordered). Fields: user_id (FK, indexed (user_id, created_at DESC) for A2), media_url, caption, like_count (denormalised), comment_count (denormalised), created_at.
- Like: composite PK (user_id, post_id). Idempotent — re-like is a no-op.
- Comment: PK comment_id (UUIDv7). Fields: post_id (indexed (post_id, created_at DESC) for A4), user_id, body, created_at.
- Follow: composite PK (follower_id, followee_id). Indexed both ways.
Step 3 — relationship shapes
- User → Posts: 1-to-N, unbounded. REFERENCE — never embed. Paginate via cursor on (created_at, post_id).
- Post → Likes: 1-to-N, can be massive. REFERENCE. Like_count denormalised on Post for the render path. Updated async from a like-counter pipeline (e.g., Redis INCR + periodic flush) so a viral post does not slam Postgres.
- Post → Comments: 1-to-N, can be many. REFERENCE. Embed only the latest 5 comments into the post document for the inline render (A4 hot path); fetch the rest on demand.
- User → Follows: N-to-N, can be huge (millions of followers for celebrities). REFERENCE. Two-way indexed.
Step 4 — write model vs read model for the home feed (A1)
- Write model: Post stored once in Postgres, partitioned by user_id.
- Read model: home feed served from a per-user "inbox" KV store (Redis or DynamoDB). On each new post by a user with < 1M followers, fan out the post_id into each follower's inbox. For celebrities (>1M followers), do not fan out — instead, do a pull-on-read merge of the celebrity's posts with the inbox. This is the hybrid Twitter timeline pattern.
- Consistency: home feed is eventually consistent (within seconds of post). Acceptable trade-off for the latency budget.
Step 5 — soft deletes and audit
- Posts: soft delete (deleted_at column) for 30 days, then hard delete. Deleted posts are excluded from feeds via a filter.
- Likes / follows: hard delete. No audit need.
- User account deletion: soft delete with 14-day grace period (GDPR-compliant), then hard delete cascading to posts.
Step 6 — index sanity check
- User: PK user_id, unique index on username, no others (avoid bloat).
- Post: PK post_id, index (user_id, created_at DESC) for A2.
- Like: PK (user_id, post_id), index (post_id) for "who liked this post".
- Comment: PK comment_id, index (post_id, created_at DESC) for A4.
- Follow: PK (follower_id, followee_id), index (followee_id) for "who follows this user".
Total indexes: 9. Each tied to an access pattern. No speculative indexes.
That is a complete data model in three minutes that holds up to "now what about celebrities" and "now what about the like counter spam".
Good vs bad answer
Interviewer probe
“Design the data model for a Q&A site (users, questions, answers, votes).”
Weak answer
"I'd use Postgres. Users table with id, email. Questions table with id, user_id, title, body. Answers table with id, question_id, user_id, body. Votes table with id, user_id, target_id, value."
Strong answer
"Top access patterns are (1) render a question page — question + its answers + vote counts, (2) render a user profile — their questions and answers, (3) voting — write-heavy and idempotent per (user, target).
Entities: User (surrogate id, natural email), Question (surrogate id, FK user, indexed by (user_id, created_at) for the profile, (tag, score DESC) for "top by tag"), Answer (FK question, FK user, indexed (question_id, score DESC)), Vote (composite PK (user_id, target_type, target_id), value ±1 — idempotent on PK conflict).
Denormalised score on Question and Answer, updated async from a vote pipeline — the render path cannot afford COUNT-on-read at 10k QPS. I would start in Postgres for everything and move votes to a separate KV store (Redis or DynamoDB) only if vote QPS exceeds 5k/s sustained. Soft delete on questions and answers (30-day grace), hard delete on votes."
Why it wins: Access patterns named first and tied to indexes. Primary keys justified (composite PK on Vote gives idempotency for free). Denormalisation of score is explicit, defended on a concrete read-path latency argument, and has an async update path. Starts simple (one store) with a clear migration trigger. Soft/hard delete addressed.
When it comes up
- Right after API design, when interviewer says "what does your data look like"
- When the prompt involves complex relationships (feed, social graph, e-commerce catalog)
- During scaling deep-dives (the model determines the storage and the sharding)
- When asked "what does one row look like" — they want PK + key fields + indexes
- When the read path looks like it needs a join the system cannot afford
Order of reveal
- 1Re-state the top 3–5 access patterns. "Before I model entities, the top reads are: A1 (...), A2 (...). The top writes are: W1 (...), W2 (...). Each with rough QPS and latency target."
- 2Name the entities. "Entities are: <Entity1>, <Entity2>, <Entity3>. For each I will give you PK, hot-path fields, and relationships."
- 3Pick PK type for each. "User PK is UUIDv7 (opaque, time-ordered, no enumeration). Vote PK is composite (user_id, target_id) — gives idempotency for free."
- 4Name relationship cardinalities. "User → Posts is 1-to-N unbounded — reference, never embed. Post → Comments is 1-to-N bounded for the hot render — embed latest 5, reference the rest."
- 5Call out denormalisation. "like_count denormalised on Post — read path needs it at 100k QPS, COUNT(*) every render is unaffordable. Updated async from a counter pipeline."
- 6Split write vs read model if patterns diverge. "Write model is normalised in Postgres. Home feed is a separate per-user inbox in Redis, populated by a fan-out worker on new posts."
- 7Name indexes and tie to patterns. "Indexes: Post(user_id, created_at DESC) for profile reads; Comment(post_id, created_at DESC) for thread reads. Each tied to an access pattern."
- 8Address soft/hard delete + audit. "Posts soft-deleted with 30-day grace; likes hard-deleted. Audit via an event log if compliance requires it."
Signature phrases
- “Queries first, schema second” — Reverses the wrong-direction approach interviewers see most often.
- “Embed only when bounded and read together” — Sharp rule for embed-vs-reference decisions.
- “Read:write ratio > 10:1 justifies denormalisation” — Concrete trigger for an otherwise hand-wavy decision.
- “Composite PK gives idempotency for free” — Subtle senior insight on Like / Follow modelling.
- “Project to terminal cardinality, not launch day” — Catches the "bounded today, unbounded in 2 years" trap.
- “Each index tied to an access pattern” — Shows index discipline, not speculative bloat.
Likely follow-ups
?“Why composite PK on Vote / Like / Follow rather than a surrogate id?”Reveal
Three reasons:
- 1Idempotency for free. A composite PK
(user_id, post_id)enforces "one row per (user, post)" in the database.INSERT ... ON CONFLICT DO NOTHINGmakes "like a post" idempotent at the storage layer. No application-level dedupe needed.
- 1Lookup is the PK. Every interesting query is "did user X like post Y" or "all likes for post Y" — both served directly by the composite PK and the (post_id) index. A surrogate id adds nothing.
- 1No wasted space. A surrogate id costs 8 bytes per row × billions of rows = real money. The composite PK is already there.
The only reason to add a surrogate id: if the relationship has independent identity (rare for Like/Follow). For most join entities, composite PK is the right call.
?“When do you embed vs reference a sub-collection?”Reveal
Embed when ALL three are true:
- 1The sub-collection is bounded at design time (< ~100 items, ideally < 20)
- 2The sub-collection is always read with the parent on the hot path
- 3The sub-collection has no independent access pattern (nobody queries it without the parent)
Reference otherwise. Specifically:
- Unbounded growth (a user's posts, a thread's messages) → reference. Embedded growth blows up the document size limit (16 MB Mongo, 400 KB Dynamo) and forces re-reads of the whole thing on every change.
- Independent access (you sometimes query "all comments by user X across all posts") → reference. Embedded sub-collections are invisible to indexes outside the parent.
- Different write rate than the parent (likes change every second; user profile changes monthly) → reference. Embedded means the parent doc gets rewritten on every like, blowing your write QPS.
Hybrid pattern: embed the latest 5 comments on a post for the inline render, reference the full collection for the "view all" page. Best of both.
?“How do you handle a celebrity user with 50M followers in a feed system?”Reveal
The standard fan-out-on-write breaks for celebrities — writing 50M inbox entries per tweet is infeasible. The fix is a hybrid model:
- 1Regular users (< ~1M followers): fan out on write. New post → write post_id into every follower's inbox.
- 2Celebrities (> ~1M followers): do NOT fan out. Their posts stay in their own outbox.
- 3Read path: when serving a user's feed, merge two sources:
- Their inbox (regular fan-out) - The outboxes of all celebrities they follow (pull on read)
- 1Sort and paginate the merged set in memory (typically capped at 100 candidates per page).
Operational details:
- Threshold (1M) is tunable based on read QPS vs write QPS economics
- Inboxes are usually a per-user sorted set in Redis, capped at ~1000 entries
- Celebrity outboxes are aggressively cached
- The threshold is often dynamic: a user above 1M and posting frequently may stay in fan-out if their followers are mostly active
This is the actual Twitter / Instagram pattern. Naming it shows you have read past blog-post depth.
?“You said "denormalised score updated async". What if the user wants their like to show up immediately?”Reveal
Read-your-writes via local update + reconciliation.
The pattern:
- 1Server writes the like row (
(user_id, post_id)composite PK) — synchronous, <10 ms. - 2Server returns 201 to the client immediately.
- 3Client locally increments the displayed like count by 1 — feels instant to the user.
- 4Async pipeline (Kafka or Redis Streams) eventually updates the denormalised
like_counton the Post row — typically within seconds. - 5On the next page load, the client reads the canonical denormalised count from the server.
The user perceives an instant update; the server-side counter catches up shortly. Eventual consistency is invisible because the client UI is the local source of truth for the duration of the session.
Edge case: if the user refreshes within the propagation window, they may see a count that does not include their own like yet. Mitigations: (a) cache the user's recent likes client-side and add 1 to the displayed count; (b) maintain a per-user "my recent likes" cache server-side and merge on read.
Code examples
CREATE TABLE likes (
user_id uuid NOT NULL,
post_id uuid NOT NULL,
created_at timestamptz NOT NULL DEFAULT now(),
PRIMARY KEY (user_id, post_id)
);
-- "like this post" — idempotent at the storage layer
INSERT INTO likes (user_id, post_id)
VALUES ($1, $2)
ON CONFLICT (user_id, post_id) DO NOTHING;
-- "all likes for this post" — uses the secondary index
CREATE INDEX likes_by_post ON likes (post_id, created_at DESC);-- A1: latest posts by user, paginated
SELECT post_id, caption, created_at, like_count
FROM posts
WHERE user_id = $1
AND (created_at, post_id) < ($2_after_created_at, $2_after_post_id)
ORDER BY created_at DESC, post_id DESC
LIMIT 20;
-- The (created_at, post_id) tuple is the cursor.
-- Index: (user_id, created_at DESC, post_id DESC)
-- Constant time per page regardless of how deep you paginate.// posts/post_abc123 — MongoDB / Firestore aggregate
{
"_id": "post_abc123",
"user_id": "user_xyz789",
"media_url": "https://cdn.../abc.jpg",
"caption": "Sunset over the bridge",
"created_at": "2025-04-20T18:30:00Z",
"like_count": 4287, // denormalised, async updated
"comment_count": 312, // denormalised, async updated
"latest_comments": [ // BOUNDED — latest 5, for inline render
{ "user_id": "user_a", "body": "🔥", "at": "2025-04-20T18:31Z" },
{ "user_id": "user_b", "body": "wow", "at": "2025-04-20T18:32Z" }
// ... 3 more
]
}
// Full comment list lives in /comments collection, fetched on "view all"Common mistakes
You drew a perfect 3NF ERD and then discovered the home screen needs a 5-way join at 50k QPS. Work backwards from the queries — always.
"User has many comments — I'll embed them" works until a user has 100k comments. Embed only when the child list is bounded (e.g. <100) and always read with the parent.
An email address or an ISBN is already unique and already the lookup key. Adding user_id on top creates a double-lookup on every foreign reference. Use natural PK when stable; surrogate when mutable or composite.
A 1-N relationship where N is small today but "posts per user" over five years is unbounded. Model for the terminal cardinality, not the launch-day one.
The same table serves the transactional write path AND a 100× higher analytics read — and neither path is happy. Name the two models explicitly and split via CDC when the divergence justifies it.
UUIDv4 is random, so insert location is random, so B-tree indexes fragment, so write QPS drops 5–10× as the table grows. Use UUIDv7, ULID, or Snowflake for time-ordered IDs that play nicely with indexes.
Practice drills
You're modelling a URL shortener. Short code is 7 chars, 100M links, ~10k creations/s at peak. What's the PK and why?Reveal
The natural PK is the short code itself — it's already unique, it's the lookup key on every read (GET /:code → long URL), and adding a surrogate int creates a useless second index that has to be maintained on every write.
Uniqueness is enforced at generation time: pre-reserved code ranges per shard (e.g., shard 1 owns codes starting with a–f) eliminates collisions across shards; within a shard, retry on collision (rare with 7 base62 chars = 3.5T space).
Storage layout: (short_code PRIMARY KEY, long_url, created_at, expires_at). With 100M rows and the PK being the lookup key, every read is a single B-tree lookup. No JOIN, no secondary index needed.
Instagram-style photo app. A User has many Photos. Embed or reference?Reveal
Reference. Three reasons:
- 1Unbounded — a heavy user has 10k–100k photos. That blows past the document size limit (16 MB MongoDB, 400 KB DynamoDB).
- 2Different write rate — User profile changes monthly; Photos are created daily. Embedding means rewriting the User doc on every photo creation.
- 3Independent access — Photos are queried in their own right (timeline, search), not just via User.
Model: Photos in their own collection, indexed by (user_id, created_at DESC) for profile pagination. User document stores cached photo_count (denormalised, async-updated) plus latest_photo_id for the profile header thumbnail. Best of both worlds.
Ticketmaster. Seat inventory for one high-demand event. What's the data model constraint that drives storage choice?Reveal
Oversell-prevention on writes. The access pattern is "reserve seat X atomically; fail if someone else got it first."
That demands one of:
- Single-row ACID update in a relational store (
UPDATE seats SET reserved_by=$1 WHERE seat_id=$2 AND reserved_by IS NULL RETURNING *) - Conditional write in a KV store (DynamoDB
ConditionExpression: attribute_not_exists(reserved_by)) - Serialised queue of reservation attempts
It rules out eventual-consistency stores for the reservation row itself — if two clients can both succeed and reconciliation happens later, you have oversold the event.
The rest of the system (browsing seats, viewing event details) can be eventually consistent and cached aggressively. The constraint applies only to the reservation write path. This separation — strong consistency where it matters, eventual elsewhere — is the design instinct interviewers reward.
You have a `posts` table. Every page render does `SELECT COUNT(*) FROM likes WHERE post_id = ?`. What's wrong and what's the fix?Reveal
What's wrong: COUNT(*) on the likes table is O(N) per render — 10k likes means 10k row reads per page view. At 100k page views/s with viral posts (1M+ likes), this is multiple millions of row reads per second. Postgres dies.
The fix: denormalise like_count onto the posts row.
- Write path:
INSERT INTO likes ... ON CONFLICT DO NOTHINGreturns whether the row was new. If yes, incrementlike_counton the post (either synchronously with a trigger, or async via a counter pipeline). - Counter pipeline (preferred for hot posts): writes go to a Redis INCR per post; a worker flushes the increments to Postgres every few seconds. Survives 100k+ writes/s on a single post without hammering Postgres.
- Read path:
SELECT like_count FROM posts WHERE post_id = ?— single row lookup, sub-millisecond.
Trade-off: like_count is eventually consistent (lag of seconds). Acceptable — nobody notices their like is reflected 2 seconds late, and the local UI optimistically increments on click.
When would you choose event sourcing over a normal CRUD model?Reveal
Event sourcing wins when at least two of these are true:
- Audit trail is a hard requirement — banking, healthcare, accounting, anything regulated. Replaying events gives perfect history; CRUD requires bolt-on audit tables that drift.
- Temporal queries matter — "what was the balance on March 15?" is trivial from events, painful from CRUD.
- Multiple read projections of the same data — current balance, monthly statement, fraud signal, search index. Each is a projection over the same event log.
- Replay / recovery / debugging is critical — you can rebuild any read model from the events.
Choose CRUD when:
- The domain has no audit need (most product features)
- The team is small and cannot afford the operational complexity (event store, projections, replay tooling)
- The data shape is genuinely simple and won't support multiple read views
The most common trap: choosing event sourcing because it sounds elegant, then drowning in schema-evolution complexity (events are immutable forever — every renamed field is a versioned event). Event sourcing is power that costs ongoing engineering tax. Justify it.
Cheat sheet
- •3-step ritual: queries → entities → relationships. Never schema-first.
- •For every entity: PK choice, growth rate, mutability, hot-path fields.
- •Embed only when the child collection is bounded AND always read together.
- •Denormalise when read:write > 10:1 AND the denormalised column is expensive to compute.
- •Composite PK on join entities (Like, Follow, Vote) — idempotency for free.
- •Time-ordered IDs (UUIDv7, ULID) over UUIDv4 — index-friendly.
- •Call out "write model vs read model" the moment they diverge.
- •Cardinality drift kills designs — project 3 years out, not launch day.
- •Each index must be justified by an access pattern. No speculative bloat.
- •Name soft delete vs hard delete per entity. Mention GDPR if user data is in scope.
Practice this skill
These problems exercise Data model design. Try one now to apply what you just learned.