coredata

Data model design

Entities, relationships, keys, normalization vs denormalization.

"We'll put it in Postgres" is not a data model. The data model is entities, keys, relationships, cardinalities, and the access patterns each one has to serve — and it locks in every trade-off you will chase for the rest of the design.

Read this if your last attempt…

You jumped to "we use Postgres / DynamoDB / Cassandra" before naming a single entity
Your reviewer asked "what does one row look like?" and you paused
You don't know whether your design needs joins
You can't answer "what's the cardinality of that relationship?"
You always reach for 3NF and feel guilty about denormalising

The concept

Data modelling in an interview is the discipline of moving from "what queries do we serve" to "what does one row look like" — in that direction, never the reverse.

The candidate who draws an ER diagram first and then asks how the home page renders has it backwards. The candidate who says "the home page does these three queries, here are the entities those queries touch, here is the key for each one" is going to design something that survives launch.

Architecture diagram· Access patterns → entities → schema

The arrow only points this way. You derive the schema from the queries you need to serve, not the other way around.

Normalise vs denormalise

Axis	Keep normalised	Denormalise
Primary concern	Writes consistent and cheap	Reads fast without joins
Good when	Strong consistency required; writes dominate; data mutates often	Read:write ratio > 10:1; the read path must be one lookup
Cost	Join on every read; risk of N+1; harder to shard	Writes fan out; need an outbox, CDC, or batch rebuild
Update propagation	Immediate via single source of truth	Eventually consistent — staleness window must be acceptable
Operational cost	Low — one schema, one path	High — multiple copies, drift detection, rebuild
Canonical example	Bank accounts → transactions	Twitter timeline (author tweets fanned into follower inboxes)

How interviewers grade this

Before naming a database, you list the top access patterns with QPS and latency.
For each entity you state: primary key (natural or surrogate), cardinality of each relationship, and growth rate over time.
You distinguish the write model from the read model when the access patterns diverge.
Your denormalisation decisions cite a specific read that would otherwise be a multi-hop join.
You name which fields are immutable vs mutable — mutability drives caching and sharding choices.
You name which fields are indexed and tie each index to an access pattern.
You answer "soft delete or hard delete?" before being asked.

Variants

Relational, normalised (3NF)

3NF + foreign keys; joins at read time.

The default for transactional systems. Each entity gets a table; relationships are foreign keys; reads use JOINs to assemble views.

When it shines:

Strong consistency required (financial, inventory, ACID)
Mixed read patterns — ad-hoc analytics, reporting
Schema is well-understood and stable
Read QPS is low to medium (sub-10k/s on joined paths)

When it breaks:

Read QPS > 10k on a multi-table join
Need to shard joined tables horizontally (cross-shard joins are slow)
Denormalised read views become essential — but you keep adding indexes instead

Tools: Postgres, MySQL, Aurora, Spanner.

Pros

+Strong consistency, ACID
+Ad-hoc queries are easy
+Familiar to every team
+Single source of truth

Cons

−Joins get expensive at scale
−Hard to shard tables that join
−Schema migrations can be painful

Choose this variant when

OLTP transactional workloads
Compliance/financial systems
Mixed read patterns; ad-hoc query needs

Document / aggregate-oriented

Whole object stored together; no joins; embedded sub-entities.

The aggregate is the unit of read and write. A user document contains the user's profile, preferences, and short bounded sub-collections (e.g., shipping addresses).

When it shines:

Access pattern is "fetch one aggregate and render"
Schema evolves per-tenant (no migrations)
Sub-collections are bounded (< ~100 items)
Eventual consistency on cross-aggregate views is acceptable

When it breaks:

Sub-collections grow unbounded (16 MB MongoDB doc limit; 400 KB DynamoDB item)
Cross-document joins become essential (you reinvent JOIN in app code, badly)
Two consumers need different sub-slices of the same doc

Tools: MongoDB, DynamoDB single-table, Cosmos DB, Firestore.

Pros

+One read = one fetch
+Schema flexibility per document
+Natural fit for object models
+Horizontal scaling is easier

Cons

−Cross-doc analytics is awkward
−Bounded sub-collections only
−Eventual consistency on related entities

Choose this variant when

Aggregate is the natural unit of read
Schema varies by tenant
Mobile-style "fetch one screen" patterns

Wide-column / one table per query

Rows are keyed tuples; design the table for the exact query.

Wide-column stores (Cassandra, ScyllaDB, HBase, BigTable) flip data modelling on its head: you design one table per query, denormalised at write time, with the partition key chosen for the query.

The mental model:

A table is a giant sorted map: partition key → cluster key → row
Every query is a partition lookup + range scan within the partition
No joins. No secondary indexes worth using at scale. No ad-hoc queries.

When it shines:

Write-heavy time-series (IoT, telemetry, logs)
Predictable, finite set of access patterns known up front
Need linear horizontal write scaling
Eventual consistency is acceptable

When it breaks:

Access patterns change (you have to rebuild the table)
Need ad-hoc analytics (use a separate OLAP store)
Hot partitions appear (one celebrity user gets 100x the traffic)

Pros

+Linear horizontal write scale
+Predictable single-digit-ms p99
+Great for time-series

Cons

−One table per query — denormalisation everywhere
−No ad-hoc queries
−Hot partition risk

Choose this variant when

Write-heavy time-series
Known, finite query set
Need to scale writes linearly across nodes

CQRS / materialised read model

Normalised write side; denormalised read side; rebuilt from a log.

When write and read patterns diverge sharply, split the model. Writes go to a normalised OLTP store (Postgres). A change-data-capture (CDC) stream projects each write into one or more denormalised read models — search index (Elasticsearch), analytics warehouse (Snowflake), cache (Redis), pre-computed view (Postgres materialised view).

The contract:

Write side: small number of normalised tables, ACID transactions
CDC pipeline (Debezium, Kafka Connect, or app-level outbox)
Read side: one or more shapes, each tuned to its consumer
Eventual consistency between write and read sides (typically sub-second)

When it shines:

Read:write ratio is huge (search, analytics, dashboards)
Multiple consumers want different shapes of the same data
Replay-from-log is needed for new consumers or recovery

When it hurts:

Small teams: operational cost of the pipeline is real
Strict read-your-writes consistency: clients may need to read from the write side

Pros

+Each side optimised for its workload
+New consumers added by adding a new projection
+Replay enables recovery and migration

Cons

−Operational cost of the CDC pipeline
−Eventual consistency on the read side
−Requires logs/events as first-class infra

Choose this variant when

Read patterns diverge sharply from writes
Multiple read consumers (search, analytics, cache)
Team can run a streaming pipeline

Event-sourced

The source of truth is an append-only event log; current state is a projection.

Instead of storing current state directly, store every state-changing event as an immutable record. Current state is a fold over the event log — usually cached in a snapshot for fast reads.

The contract:

Domain events are the source of truth (AccountOpened, MoneyDeposited, MoneyWithdrawn)
Read model is a projection (current balance) computed from events
Append-only — never update or delete
Snapshots periodically materialised for performance

When it shines:

Audit and compliance are first-class (banking, healthcare)
Need temporal queries ("what was the balance on March 15?")
Multiple read projections share the same event log
Recovery / replay / debugging are critical

When it hurts:

Schema evolution is hard (events are immutable forever)
Eventual consistency on read projections
Significantly higher engineering and operational complexity

Pros

+Perfect audit trail
+Temporal queries are natural
+New projections from old events

Cons

−Event schema evolution is hard
−Higher engineering bar
−Eventual consistency

Choose this variant when

Audit/compliance is core requirement
Temporal queries needed
Banking, accounting, healthcare

Worked example

Scenario: Design the data model for an Instagram-style feed (users, posts, likes, comments, follows).

Step 1 — list access patterns

A1: Render a user's home feed (latest posts from accounts they follow). p95 < 100 ms. ~100k QPS.
A2: Render a user profile (their photos, paginated). p95 < 200 ms. ~10k QPS.
A3: Like a post (idempotent per user-post pair). ~50k QPS.
A4: Comment on a post (latest 20 shown inline). ~10k QPS.
A5: Follow / unfollow a user. ~1k QPS.

Step 2 — entities and keys

User: PK user_id (UUIDv7). Fields: username (unique, indexed), display_name, avatar_url, follower_count (denormalised), following_count (denormalised), created_at.
Post: PK post_id (UUIDv7, time-ordered). Fields: user_id (FK, indexed (user_id, created_at DESC) for A2), media_url, caption, like_count (denormalised), comment_count (denormalised), created_at.
Like: composite PK (user_id, post_id). Idempotent — re-like is a no-op.
Comment: PK comment_id (UUIDv7). Fields: post_id (indexed (post_id, created_at DESC) for A4), user_id, body, created_at.
Follow: composite PK (follower_id, followee_id). Indexed both ways.

Step 3 — relationship shapes

User → Posts: 1-to-N, unbounded. REFERENCE — never embed. Paginate via cursor on (created_at, post_id).
Post → Likes: 1-to-N, can be massive. REFERENCE. Like_count denormalised on Post for the render path. Updated async from a like-counter pipeline (e.g., Redis INCR + periodic flush) so a viral post does not slam Postgres.
Post → Comments: 1-to-N, can be many. REFERENCE. Embed only the latest 5 comments into the post document for the inline render (A4 hot path); fetch the rest on demand.
User → Follows: N-to-N, can be huge (millions of followers for celebrities). REFERENCE. Two-way indexed.

Step 4 — write model vs read model for the home feed (A1)

Write model: Post stored once in Postgres, partitioned by user_id.
Read model: home feed served from a per-user "inbox" KV store (Redis or DynamoDB). On each new post by a user with < 1M followers, fan out the post_id into each follower's inbox. For celebrities (>1M followers), do not fan out — instead, do a pull-on-read merge of the celebrity's posts with the inbox. This is the hybrid Twitter timeline pattern.
Consistency: home feed is eventually consistent (within seconds of post). Acceptable trade-off for the latency budget.

Step 5 — soft deletes and audit

Posts: soft delete (deleted_at column) for 30 days, then hard delete. Deleted posts are excluded from feeds via a filter.
Likes / follows: hard delete. No audit need.
User account deletion: soft delete with 14-day grace period (GDPR-compliant), then hard delete cascading to posts.

Step 6 — index sanity check

User: PK user_id, unique index on username, no others (avoid bloat).
Post: PK post_id, index (user_id, created_at DESC) for A2.
Like: PK (user_id, post_id), index (post_id) for "who liked this post".
Comment: PK comment_id, index (post_id, created_at DESC) for A4.
Follow: PK (follower_id, followee_id), index (followee_id) for "who follows this user".

Total indexes: 9. Each tied to an access pattern. No speculative indexes.

That is a complete data model in three minutes that holds up to "now what about celebrities" and "now what about the like counter spam".

Good vs bad answer

Interviewer probe

“Design the data model for a Q&A site (users, questions, answers, votes).”

Weak answer

"I'd use Postgres. Users table with id, email. Questions table with id, user_id, title, body. Answers table with id, question_id, user_id, body. Votes table with id, user_id, target_id, value."

Strong answer

"Top access patterns are (1) render a question page — question + its answers + vote counts, (2) render a user profile — their questions and answers, (3) voting — write-heavy and idempotent per (user, target).

Entities: User (surrogate id, natural email), Question (surrogate id, FK user, indexed by (user_id, created_at) for the profile, (tag, score DESC) for "top by tag"), Answer (FK question, FK user, indexed (question_id, score DESC)), Vote (composite PK (user_id, target_type, target_id), value ±1 — idempotent on PK conflict).

Denormalised score on Question and Answer, updated async from a vote pipeline — the render path cannot afford COUNT-on-read at 10k QPS. I would start in Postgres for everything and move votes to a separate KV store (Redis or DynamoDB) only if vote QPS exceeds 5k/s sustained. Soft delete on questions and answers (30-day grace), hard delete on votes."

Why it wins: Access patterns named first and tied to indexes. Primary keys justified (composite PK on Vote gives idempotency for free). Denormalisation of score is explicit, defended on a concrete read-path latency argument, and has an async update path. Starts simple (one store) with a clear migration trigger. Soft/hard delete addressed.

Interview playbook5–8 min, immediately after API design

When it comes up

Right after API design, when interviewer says "what does your data look like"
When the prompt involves complex relationships (feed, social graph, e-commerce catalog)
During scaling deep-dives (the model determines the storage and the sharding)
When asked "what does one row look like" — they want PK + key fields + indexes
When the read path looks like it needs a join the system cannot afford

Order of reveal

1
Re-state the top 3–5 access patterns. "Before I model entities, the top reads are: A1 (...), A2 (...). The top writes are: W1 (...), W2 (...). Each with rough QPS and latency target."
2
Name the entities. "Entities are: <Entity1>, <Entity2>, <Entity3>. For each I will give you PK, hot-path fields, and relationships."
3
Pick PK type for each. "User PK is UUIDv7 (opaque, time-ordered, no enumeration). Vote PK is composite (user_id, target_id) — gives idempotency for free."
4
Name relationship cardinalities. "User → Posts is 1-to-N unbounded — reference, never embed. Post → Comments is 1-to-N bounded for the hot render — embed latest 5, reference the rest."
5
Call out denormalisation. "like_count denormalised on Post — read path needs it at 100k QPS, COUNT(*) every render is unaffordable. Updated async from a counter pipeline."
6
Split write vs read model if patterns diverge. "Write model is normalised in Postgres. Home feed is a separate per-user inbox in Redis, populated by a fan-out worker on new posts."
7
Name indexes and tie to patterns. "Indexes: Post(user_id, created_at DESC) for profile reads; Comment(post_id, created_at DESC) for thread reads. Each tied to an access pattern."
8
Address soft/hard delete + audit. "Posts soft-deleted with 30-day grace; likes hard-deleted. Audit via an event log if compliance requires it."

Signature phrases

“Queries first, schema second”

“Embed only when bounded and read together”

“Read:write ratio > 10:1 justifies denormalisation”

“Composite PK gives idempotency for free”

“Project to terminal cardinality, not launch day”

“Each index tied to an access pattern”

“Queries first, schema second” — Reverses the wrong-direction approach interviewers see most often.
“Embed only when bounded and read together” — Sharp rule for embed-vs-reference decisions.
“Read:write ratio > 10:1 justifies denormalisation” — Concrete trigger for an otherwise hand-wavy decision.
“Composite PK gives idempotency for free” — Subtle senior insight on Like / Follow modelling.
“Project to terminal cardinality, not launch day” — Catches the "bounded today, unbounded in 2 years" trap.
“Each index tied to an access pattern” — Shows index discipline, not speculative bloat.

Likely follow-ups

?“Why composite PK on Vote / Like / Follow rather than a surrogate id?”Reveal

Three reasons:

1Idempotency for free. A composite PK (user_id, post_id) enforces "one row per (user, post)" in the database. INSERT ... ON CONFLICT DO NOTHING makes "like a post" idempotent at the storage layer. No application-level dedupe needed.

1Lookup is the PK. Every interesting query is "did user X like post Y" or "all likes for post Y" — both served directly by the composite PK and the (post_id) index. A surrogate id adds nothing.

1No wasted space. A surrogate id costs 8 bytes per row × billions of rows = real money. The composite PK is already there.

The only reason to add a surrogate id: if the relationship has independent identity (rare for Like/Follow). For most join entities, composite PK is the right call.

?“When do you embed vs reference a sub-collection?”Reveal

Embed when ALL three are true:

1The sub-collection is bounded at design time (< ~100 items, ideally < 20)
2The sub-collection is always read with the parent on the hot path
3The sub-collection has no independent access pattern (nobody queries it without the parent)

Reference otherwise. Specifically:

Unbounded growth (a user's posts, a thread's messages) → reference. Embedded growth blows up the document size limit (16 MB Mongo, 400 KB Dynamo) and forces re-reads of the whole thing on every change.
Independent access (you sometimes query "all comments by user X across all posts") → reference. Embedded sub-collections are invisible to indexes outside the parent.
Different write rate than the parent (likes change every second; user profile changes monthly) → reference. Embedded means the parent doc gets rewritten on every like, blowing your write QPS.

Hybrid pattern: embed the latest 5 comments on a post for the inline render, reference the full collection for the "view all" page. Best of both.

?“How do you handle a celebrity user with 50M followers in a feed system?”Reveal

The standard fan-out-on-write breaks for celebrities — writing 50M inbox entries per tweet is infeasible. The fix is a hybrid model:

1Regular users (< ~1M followers): fan out on write. New post → write post_id into every follower's inbox.
2Celebrities (> ~1M followers): do NOT fan out. Their posts stay in their own outbox.
3Read path: when serving a user's feed, merge two sources:

- Their inbox (regular fan-out) - The outboxes of all celebrities they follow (pull on read)

1Sort and paginate the merged set in memory (typically capped at 100 candidates per page).

Operational details:

Threshold (1M) is tunable based on read QPS vs write QPS economics
Inboxes are usually a per-user sorted set in Redis, capped at ~1000 entries
Celebrity outboxes are aggressively cached
The threshold is often dynamic: a user above 1M and posting frequently may stay in fan-out if their followers are mostly active

This is the actual Twitter / Instagram pattern. Naming it shows you have read past blog-post depth.

?“You said "denormalised score updated async". What if the user wants their like to show up immediately?”Reveal

Read-your-writes via local update + reconciliation.

The pattern:

1Server writes the like row ((user_id, post_id) composite PK) — synchronous, <10 ms.
2Server returns 201 to the client immediately.
3Client locally increments the displayed like count by 1 — feels instant to the user.
4Async pipeline (Kafka or Redis Streams) eventually updates the denormalised like_count on the Post row — typically within seconds.
5On the next page load, the client reads the canonical denormalised count from the server.

The user perceives an instant update; the server-side counter catches up shortly. Eventual consistency is invisible because the client UI is the local source of truth for the duration of the session.

Edge case: if the user refreshes within the propagation window, they may see a count that does not include their own like yet. Mitigations: (a) cache the user's recent likes client-side and add 1 to the displayed count; (b) maintain a per-user "my recent likes" cache server-side and merge on read.

Code examples

sqlComposite PK gives idempotency for free

CREATE TABLE likes (
  user_id   uuid NOT NULL,
  post_id   uuid NOT NULL,
  created_at timestamptz NOT NULL DEFAULT now(),
  PRIMARY KEY (user_id, post_id)
);

-- "like this post" — idempotent at the storage layer
INSERT INTO likes (user_id, post_id)
VALUES ($1, $2)
ON CONFLICT (user_id, post_id) DO NOTHING;

-- "all likes for this post" — uses the secondary index
CREATE INDEX likes_by_post ON likes (post_id, created_at DESC);

sqlCursor pagination with a composite sort key

-- A1: latest posts by user, paginated
SELECT post_id, caption, created_at, like_count
FROM posts
WHERE user_id = $1
  AND (created_at, post_id) < ($2_after_created_at, $2_after_post_id)
ORDER BY created_at DESC, post_id DESC
LIMIT 20;

-- The (created_at, post_id) tuple is the cursor.
-- Index: (user_id, created_at DESC, post_id DESC)
-- Constant time per page regardless of how deep you paginate.

jsonAggregate document with bounded embedded latest-comments

// posts/post_abc123  — MongoDB / Firestore aggregate
{
  "_id": "post_abc123",
  "user_id": "user_xyz789",
  "media_url": "https://cdn.../abc.jpg",
  "caption": "Sunset over the bridge",
  "created_at": "2025-04-20T18:30:00Z",
  "like_count": 4287,            // denormalised, async updated
  "comment_count": 312,          // denormalised, async updated
  "latest_comments": [           // BOUNDED — latest 5, for inline render
    { "user_id": "user_a", "body": "🔥", "at": "2025-04-20T18:31Z" },
    { "user_id": "user_b", "body": "wow", "at": "2025-04-20T18:32Z" }
    // ... 3 more
  ]
}
// Full comment list lives in /comments collection, fetched on "view all"

Common mistakes

Schema-first design

You drew a perfect 3NF ERD and then discovered the home screen needs a 5-way join at 50k QPS. Work backwards from the queries — always.

Unbounded collections in a document

"User has many comments — I'll embed them" works until a user has 100k comments. Embed only when the child list is bounded (e.g. <100) and always read with the parent.

Surrogate key when a natural one existsAdvanced

An email address or an ISBN is already unique and already the lookup key. Adding user_id on top creates a double-lookup on every foreign reference. Use natural PK when stable; surrogate when mutable or composite.

Ignoring cardinality growth

A 1-N relationship where N is small today but "posts per user" over five years is unbounded. Model for the terminal cardinality, not the launch-day one.

Mixing write model and read model unconsciouslyAdvanced

The same table serves the transactional write path AND a 100× higher analytics read — and neither path is happy. Name the two models explicitly and split via CDC when the divergence justifies it.

Random UUIDv4 as PK on a hot tableAdvanced

UUIDv4 is random, so insert location is random, so B-tree indexes fragment, so write QPS drops 5–10× as the table grows. Use UUIDv7, ULID, or Snowflake for time-ordered IDs that play nicely with indexes.

Practice drills

You're modelling a URL shortener. Short code is 7 chars, 100M links, ~10k creations/s at peak. What's the PK and why?Reveal

The natural PK is the short code itself — it's already unique, it's the lookup key on every read (GET /:code → long URL), and adding a surrogate int creates a useless second index that has to be maintained on every write.

Uniqueness is enforced at generation time: pre-reserved code ranges per shard (e.g., shard 1 owns codes starting with a–f) eliminates collisions across shards; within a shard, retry on collision (rare with 7 base62 chars = 3.5T space).

Storage layout: (short_code PRIMARY KEY, long_url, created_at, expires_at). With 100M rows and the PK being the lookup key, every read is a single B-tree lookup. No JOIN, no secondary index needed.

Instagram-style photo app. A User has many Photos. Embed or reference?Reveal

Reference. Three reasons:

1Unbounded — a heavy user has 10k–100k photos. That blows past the document size limit (16 MB MongoDB, 400 KB DynamoDB).
2Different write rate — User profile changes monthly; Photos are created daily. Embedding means rewriting the User doc on every photo creation.
3Independent access — Photos are queried in their own right (timeline, search), not just via User.

Model: Photos in their own collection, indexed by (user_id, created_at DESC) for profile pagination. User document stores cached photo_count (denormalised, async-updated) plus latest_photo_id for the profile header thumbnail. Best of both worlds.

Ticketmaster. Seat inventory for one high-demand event. What's the data model constraint that drives storage choice?Reveal

Oversell-prevention on writes. The access pattern is "reserve seat X atomically; fail if someone else got it first."

That demands one of:

Single-row ACID update in a relational store (UPDATE seats SET reserved_by=$1 WHERE seat_id=$2 AND reserved_by IS NULL RETURNING *)
Conditional write in a KV store (DynamoDB ConditionExpression: attribute_not_exists(reserved_by))
Serialised queue of reservation attempts

It rules out eventual-consistency stores for the reservation row itself — if two clients can both succeed and reconciliation happens later, you have oversold the event.

The rest of the system (browsing seats, viewing event details) can be eventually consistent and cached aggressively. The constraint applies only to the reservation write path. This separation — strong consistency where it matters, eventual elsewhere — is the design instinct interviewers reward.

You have a `posts` table. Every page render does `SELECT COUNT(*) FROM likes WHERE post_id = ?`. What's wrong and what's the fix?Reveal

What's wrong: COUNT(*) on the likes table is O(N) per render — 10k likes means 10k row reads per page view. At 100k page views/s with viral posts (1M+ likes), this is multiple millions of row reads per second. Postgres dies.

The fix: denormalise like_count onto the posts row.

Write path: INSERT INTO likes ... ON CONFLICT DO NOTHING returns whether the row was new. If yes, increment like_count on the post (either synchronously with a trigger, or async via a counter pipeline).
Counter pipeline (preferred for hot posts): writes go to a Redis INCR per post; a worker flushes the increments to Postgres every few seconds. Survives 100k+ writes/s on a single post without hammering Postgres.
Read path: SELECT like_count FROM posts WHERE post_id = ? — single row lookup, sub-millisecond.

Trade-off: like_count is eventually consistent (lag of seconds). Acceptable — nobody notices their like is reflected 2 seconds late, and the local UI optimistically increments on click.

When would you choose event sourcing over a normal CRUD model?Reveal

Event sourcing wins when at least two of these are true:

Audit trail is a hard requirement — banking, healthcare, accounting, anything regulated. Replaying events gives perfect history; CRUD requires bolt-on audit tables that drift.
Temporal queries matter — "what was the balance on March 15?" is trivial from events, painful from CRUD.
Multiple read projections of the same data — current balance, monthly statement, fraud signal, search index. Each is a projection over the same event log.
Replay / recovery / debugging is critical — you can rebuild any read model from the events.

Choose CRUD when:

The domain has no audit need (most product features)
The team is small and cannot afford the operational complexity (event store, projections, replay tooling)
The data shape is genuinely simple and won't support multiple read views

The most common trap: choosing event sourcing because it sounds elegant, then drowning in schema-evolution complexity (events are immutable forever — every renamed field is a versioned event). Event sourcing is power that costs ongoing engineering tax. Justify it.

Cheat sheet

•3-step ritual: queries → entities → relationships. Never schema-first.
•For every entity: PK choice, growth rate, mutability, hot-path fields.
•Embed only when the child collection is bounded AND always read together.
•Denormalise when read:write > 10:1 AND the denormalised column is expensive to compute.
•Composite PK on join entities (Like, Follow, Vote) — idempotency for free.
•Time-ordered IDs (UUIDv7, ULID) over UUIDv4 — index-friendly.
•Call out "write model vs read model" the moment they diverge.
•Cardinality drift kills designs — project 3 years out, not launch day.
•Each index must be justified by an access pattern. No speculative bloat.
•Name soft delete vs hard delete per entity. Mention GDPR if user data is in scope.

Practice this skill

These problems exercise Data model design. Try one now to apply what you just learned.

url shortener news feed ticketmaster

7% complete

Current

Read this if

Step 1 of 14

The concept

Jump to next

Axis

Keep normalised

Denormalise

Primary concern

Writes consistent and cheap

Reads fast without joins

Good when

Strong consistency required; writes dominate; data mutates often

Read:write ratio > 10:1; the read path must be one lookup

Cost

Join on every read; risk of N+1; harder to shard

Writes fan out; need an outbox, CDC, or batch rebuild

Update propagation

Immediate via single source of truth

Eventually consistent — staleness window must be acceptable

Operational cost

Low — one schema, one path

High — multiple copies, drift detection, rebuild

Canonical example

Bank accounts → transactions

Twitter timeline (author tweets fanned into follower inboxes)

CREATE TABLE likes ( user_id uuid NOT NULL, post_id uuid NOT NULL, created_at timestamptz NOT NULL DEFAULT now(), PRIMARY KEY (user_id, post_id) ); -- "like this post" — idempotent at the storage layer INSERT INTO likes (user_id, post_id) VALUES ($1, $2) ON CONFLICT (user_id, post_id) DO NOTHING; -- "all likes for this post" — uses the secondary index CREATE INDEX likes_by_post ON likes (post_id, created_at DESC);

-- A1: latest posts by user, paginated SELECT post_id, caption, created_at, like_count FROM posts WHERE user_id = $1 AND (created_at, post_id) < ($2_after_created_at, $2_after_post_id) ORDER BY created_at DESC, post_id DESC LIMIT 20; -- The (created_at, post_id) tuple is the cursor. -- Index: (user_id, created_at DESC, post_id DESC) -- Constant time per page regardless of how deep you paginate.

// posts/post_abc123 — MongoDB / Firestore aggregate { "_id": "post_abc123", "user_id": "user_xyz789", "media_url": "https://cdn.../abc.jpg", "caption": "Sunset over the bridge", "created_at": "2025-04-20T18:30:00Z", "like_count": 4287, // denormalised, async updated "comment_count": 312, // denormalised, async updated "latest_comments": [ // BOUNDED — latest 5, for inline render { "user_id": "user_a", "body": "🔥", "at": "2025-04-20T18:31Z" }, { "user_id": "user_b", "body": "wow", "at": "2025-04-20T18:32Z" } // ... 3 more ] } // Full comment list lives in /comments collection, fetched on "view all"