coredata

Storage choice justification

Picking SQL vs KV vs doc vs blob vs timeseries based on access patterns.

Picking a database is a first-principles decision, not a defaults one. "We use Postgres" is a cultural statement; "the access pattern is point-lookup at 100k QPS with eventual consistency, so we use DynamoDB" is a design.

Read this if your last attempt…

You've said "we'll use Postgres" or "we'll use DynamoDB" without justifying it
You can't explain when Cassandra wins over Postgres
You were asked "why not SQL?" and didn't have a crisp answer
You're confused about when S3 is the right answer vs a database BLOB column
You default to polyglot persistence without a migration story

The concept

When an interviewer asks "which database?", they aren't asking for your favourite store. They're asking: do you understand the shape of your access pattern well enough to match it to a storage shape?

The candidate who names the access pattern first and the store second is doing design. The candidate who names the store first and justifies it after is signalling cultural preference. The direction of reasoning is the entire signal.

Architecture diagram· Access pattern → storage shape

Match the shape of your reads + writes to the shape the storage is optimised for. There is no universal default.

Which store for which access pattern

Shape	Best at	Latency (p50)	Horizontal scale	Consistency	Wrong answer when	Typical tools
SQL	Multi-row tx, joins, ad-hoc queries	1–10 ms	Replicas easy; writes need sharding	Strong by default	>10k/s on joined queries with unbounded growth	Postgres, MySQL, Aurora
KV / wide-col	Linear scale on PK lookup	<5 ms p50	Native (designed for it)	Eventual, tunable	Ad-hoc queries; cross-PK transactions	DynamoDB, Cassandra, Bigtable
Document	Read one aggregate by id	2–10 ms	Native via collection sharding	Tunable per write	Deeply relational domain	MongoDB, Firestore, Cosmos
Object / blob	Large opaque bytes (>100 KB)	30–100 ms	Native (global namespace)	Read-after-write strong	Per-byte random access; sub-100 ms required	S3, GCS, Azure Blob
Search	Ranked free-text queries, facets	10–50 ms	Native via index sharding	Near-real-time (seconds)	Source-of-truth for transactional data	Elasticsearch, OpenSearch
Time-series / columnar	Windowed aggregates over time	10–100 ms	Native via time partitioning	Eventual	Point lookups by non-time key	ClickHouse, InfluxDB, Druid
Cache	Sub-ms hot-read acceleration	<1 ms	Native via key partitioning	Best-effort (in-memory)	Source of truth if loss unacceptable	Redis, Memcached
Graph	Multi-hop relationship traversal	5–50 ms	Hard — most still vertical	Strong or tunable	Shallow queries dominated by filters	Neo4j, Neptune, JanusGraph

Start in SQL for anything ambiguous. Migrate when you have a named trigger.
Managed wins at small team size; self-managed wins when markup > ops cost.

How interviewers grade this

You name the access pattern before the store — "point lookup by user_id", not "we need a database".
You cite one concrete property that the chosen store gives you and no other store in the tier does.
You name the failure mode of the chosen store (e.g., "DynamoDB gives no cross-item transactions, so I handle that with an idempotent write + saga").
You know the difference between a BLOB column and an object store, and you pick the object store above ~100 KB.
You can defend "start with Postgres" as a position, with a named migration trigger.
You name managed vs self-managed explicitly and justify against team size.
You identify when search is an index and never the source of truth.

Variants

Relational / SQL — the right default

ACID transactions, JOINs, mature tooling. Postgres until proven otherwise.

The default for 80% of system design prompts. Postgres at a well-tuned instance handles 10k+ QPS with JOINs, transactions, and a query planner that handles queries nobody anticipated.

What SQL gives you that nothing else does:

Multi-row ACID transactions (booking systems, financial ledgers)
FOREIGN KEY integrity enforced at the DB
JOINs across normalised tables
A decade of debugging tools (psql, pganalyze, EXPLAIN)
Mature replication (physical, logical, CDC via WAL)

Scaling path: read replicas first, connection pooling, query optimisation, partitioning, finally horizontal sharding (Citus, Vitess) or migration to a sharded SQL-compatible (Spanner, CockroachDB) / moving hot tables to KV.

Named migration triggers: sustained write QPS > 20k/s on a single table; dataset > 5 TB; the workload has fractured into multiple independent access patterns.

Pros

+ACID transactions and JOINs
+Mature ops and debugging
+Preserves optionality for future decisions
+Familiar to every backend engineer

Cons

−Horizontal scaling is hard (sharding or migration)
−Large blobs hurt (use object store)
−Per-shard hot tables can become bottlenecks

Choose this variant when

Default for most prompts
Multi-row transactions required
Ad-hoc query and reporting needs
Team has no distributed-systems ops bench

KV / wide-column — linear scale, sharp trade-offs

Single-partition point lookup at unbounded scale. DynamoDB, Cassandra.

The choice when one access pattern dominates, QPS is extreme, and you will trade JOINs for linear horizontal scale.

What you get:

Single-digit-ms p99 at any scale (if modelled correctly)
Linear horizontal scale — add nodes, throughput goes up proportionally
DynamoDB: zero ops, pay-per-request, serverless
Cassandra / ScyllaDB: cheaper at extreme scale, need an ops team

What you give up:

JOINs (you denormalise into one-table-per-query)
Cross-partition transactions (DynamoDB now has TransactWriteItems but it is limited and slow)
Ad-hoc queries (the schema is the access pattern)
Secondary indexes are expensive or limited

Hot partition trap: if one partition key gets 10× the traffic (celebrity user, hot product), the whole shard is throttled. Design keys for uniform distribution.

Pros

+Linear horizontal scale
+Single-digit-ms p99 at any size
+Predictable cost per operation
+Zero-ops option (DynamoDB)

Cons

−No JOINs — denormalise everywhere
−No ad-hoc queries
−Hot partitions are a real failure mode
−One table per query mindset is a cultural shift

Choose this variant when

Dominant access pattern is PK lookup
QPS > 20k on a single table
No JOIN or cross-row transaction requirement
Need linear write scaling

Document / aggregate — flexible schema, bounded sub-collections

MongoDB / Firestore for aggregate-per-read, schema-per-tenant.

Document stores optimise for reading and writing a single aggregate document. The schema flexes per document, sub-collections are embedded up to a bound, and cross-document queries are possible but awkward.

Sweet spot:

Access pattern: fetch one aggregate, render it
Schema varies per tenant (e.g., SaaS where each customer has custom fields)
Sub-collections are bounded and always read with the parent
Eventual consistency across aggregates is acceptable

Traps:

Sub-collections grow unbounded → hit 16 MB MongoDB doc limit or 400 KB DynamoDB item
Cross-document JOINs reinvent SQL in application code, badly
Schema-less is not schema-less, it is schema-on-read — the schema exists, just in application code

Note: MongoDB 4+ has multi-doc transactions and JOINs (via $lookup). Both work. Neither is fast.

Pros

+Schema flexibility per document
+One read per aggregate
+Horizontal scaling baked in
+Natural fit for object models

Cons

−Cross-document analytics is awkward
−Doc size limits bite at scale
−Transactions are slow (MongoDB) or limited (Firestore)

Choose this variant when

Aggregate-per-read access pattern
Multi-tenant with custom schema per tenant
Bounded sub-collections
Rapid prototyping where schema is fluid

Object / blob store — large bytes, cheap and durable

S3 / GCS for anything > ~100 KB served via CDN.

Object stores are where large opaque bytes live. They are not databases — you cannot query the contents — but at cents per GB-month with 11 nines of durability and native CDN integration, they are the only right answer for large files.

The canonical pattern:

1Client uploads to S3 via pre-signed URL (no traffic through your app)
2S3 triggers a notification (Lambda, EventBridge) on upload complete
3Your app writes the metadata row (s3_key, size, mime, uploader, created_at) in SQL
4Read path: DB gives URL → CloudFront / CloudFlare serves the bytes

When NOT an object store:

Bytes < 10 KB (just put in DB; row-level cache works)
Random access to middle of the file at low latency (use a block store or specialised format)
Strong consistency across many keys (S3 is strong for single-key read-after-write; multi-key is best-effort)

Pros

+$0.02/GB/month — cheapest storage tier
+11 nines of durability
+CDN integration is trivial
+No ops; scales forever

Cons

−Not queryable
−30–100 ms first-byte latency (CDN mitigates)
−Cost per request adds up at high QPS
−No partial updates — rewrite the whole object

Choose this variant when

File size > 100 KB
Serving static assets, images, videos
Archival / backup / audit logs
ML model artefacts or dataset files

Search / inverted index — ranked full-text

Elasticsearch for search; NEVER the source of truth.

Inverted indexes make "find all documents matching free text, ranked by relevance" fast and feature-rich (facets, fuzzy match, boosting, multilingual analysis). They are an index, not a database.

The critical mental model:

Source of truth lives in SQL or a document store
Dual-write or CDC (Debezium, Kafka Connect) into Elasticsearch
When the index is corrupt or schema evolves, rebuild from the primary
NEVER let application code write only to ES

Failure modes candidates miss:

Using ES as the primary = data loss risk (ES has had silent data-loss bugs historically)
Not having a rebuild path = index divergence with no recovery
Hot query patterns (expensive aggregations) = cluster-wide slowdown, not just one shard

When to reach for it: product search, log search, document search, faceted filters on complex criteria, relevance ranking beyond what SQL LIKE can do.

Pros

+Ranked free-text with facets
+Fuzzy matching, multilingual analysis
+Near-real-time index updates
+Scales via sharding

Cons

−Not a primary store — dual-write needed
−Expensive aggregations can take down a cluster
−Ops-intensive (JVM, heap tuning)
−Consistency lag of seconds

Choose this variant when

Product / content / log search
Faceted filtering on complex criteria
Ranked relevance beyond SQL LIKE
Multilingual or fuzzy text matching

Time-series / columnar — append + aggregate

ClickHouse, InfluxDB, Druid for metrics and telemetry at scale.

When the workload is time-ordered appends with queries that aggregate over time windows, time-series / columnar stores destroy SQL on both axes. ClickHouse can ingest a million rows per second per node and run a GROUP BY region, minute over a billion rows in seconds.

Shape:

Data is time-stamped (metrics, events, logs, ticks)
Writes are append-only (updates to historical data are rare / impossible)
Reads are aggregations over time windows ("sum requests per region per 5-minute window")
Point lookup by non-time key is terrible

Architecture pattern:

Ingest via Kafka → ClickHouse / Druid via a consumer
Retention policy (downsample old data, drop after TTL)
Query layer in front (Grafana for ops, custom dashboards for product)

Adjacent category: OLAP data warehouses (BigQuery, Snowflake, Redshift) — same columnar principles, different operational model (loaded in batch, queried ad-hoc, pay-per-query).

Pros

+Millions of appends/s per node
+Aggregations over billions of rows in seconds
+Compression is extreme (10×+ for time-series)
+Downsampling / retention native

Cons

−Point lookup by non-time key is awful
−Updates are painful or impossible
−Adds a distinct ops surface
−Overkill below 100k events/s

Choose this variant when

Metrics, telemetry, observability
Ad events, financial ticks
Log aggregation at scale
Workload is > 100k events/s sustained

Worked example

Scenario: Pick storage for an Uber-like ride-sharing backend.

Sub-domain 1 — Rider / Driver profiles

Access pattern: CRUD on user records; JOINs to account settings, payment methods.
Volume: 100M users, < 1k QPS writes, < 10k QPS reads.
Pick: Postgres. Standard OLTP. Nothing weird. ACID matters for payment-method updates.

Sub-domain 2 — Driver location (live)

Access pattern: drivers post location every 3 s; riders query "nearest drivers to me" every 2 s.
Volume: 1M active drivers × 0.33 writes/s = 330k writes/s. Geo query at 50k/s.
Pick: Redis with geospatial commands (GEOADD, GEORADIUS) or a specialised geo store (Tile38). In-memory is fine because 3-second TTL is acceptable — if Redis crashes, drivers will re-report within 3 s and inventory rebuilds.
Trade-off: no durability on driver location. Acceptable because the data is ephemeral.

Sub-domain 3 — Ride history / receipts

Access pattern: append a ride on completion; read a rider's last 50 rides; analytics queries on revenue per city per day.
Volume: 10M rides/day writes, 100M rides/year. Read QPS modest.
Pick (primary): Postgres partitioned by month, indexed by (rider_id, completed_at DESC). Covers CRUD + rider-history access patterns.
Pick (analytics): ClickHouse loaded from Postgres via CDC, for the revenue-per-city-per-day aggregations. SQL can handle the OLAP workload at this scale, but ClickHouse runs the queries in 100× less time and gives the analytics team a separate blast radius.

Sub-domain 4 — Trip photos / receipts as PDFs

Access pattern: upload on trip complete, fetch on demand, archive after 90 days.
Volume: 2M uploads/day, ~500 KB average.
Pick: S3 with CloudFront in front. DB holds only the S3 key + mime + size. Move to Glacier after 90 days for cost.

Sub-domain 5 — Support ticket search

Access pattern: support agents search "find all tickets mentioning X in the last 30 days".
Volume: 100k tickets/day written, low read QPS but complex queries.
Pick: Postgres as source of truth. Elasticsearch as the search index, populated via CDC from Postgres. Agents query ES; the ticket record lives in Postgres.

Sub-domain 6 — Surge pricing state

Access pattern: geohash → current multiplier. Written by a pricing service every 30 s; read by the booking flow at 50k QPS.
Volume: 100k geohashes × hot read.
Pick: Redis. Hot read path with sub-ms latency, data is derived and safely re-computable on crash.

Summary of stores used: Postgres (primary transactional), Redis (live driver locations + surge), S3 (photos / PDFs), ClickHouse (analytics), Elasticsearch (support search). Five stores for five sharply different access patterns — justified each time, not polyglot for its own sake.

Good vs bad answer

Interviewer probe

“Which store would you use for the core data of a chat application (messages, channels, members)?”

Weak answer

"I'd use MongoDB because it's flexible and scales well."

Strong answer

"Three distinct access patterns, three shapes.

• Messages are write-heavy, append-only, and read in time order within a channel — that's a wide-column fit. I'd use Cassandra partitioned on channel_id, clustered on (timestamp, message_id). One partition = one channel's history. Read is a single partition scan from the newest N messages. Scales linearly on writes across channels.

• Channel metadata (name, members, settings) is low-volume and read-heavy — Postgres is fine; ACID matters when someone leaves a channel or a settings write has to be atomic with an audit log entry.

• Membership lookup on the hot path (is user X in channel Y?) hits a KV store — Redis or DynamoDB — keyed on user_id → channel_set. That lets the message-send path check membership in <1 ms without hitting Postgres on every message.

The trade-off I'm buying: message history is eventually consistent across replicas; a reader may see messages out of order for a ~100 ms window. That's acceptable for chat UX and is handled by client-side message-id ordering."

Why it wins: Separated three access patterns before naming three stores. Each store is defended against the specific property it provides. The consistency trade-off is named explicitly and justified against the UX. Shows maturity by knowing when to combine stores rather than force one shape to fit everything.

Interview playbook3–8 min, either as part of "pick a database" or during a scaling deep-dive

When it comes up

When the interviewer asks "which database" after you've described the data model
In scaling deep-dives when the single-store answer breaks
When the prompt involves search, media, or analytics — each is a distinct shape
During trade-off discussions: "why not SQL? why not DynamoDB?"
When asked about "polyglot persistence" — the answer is "only when justified"

Order of reveal

1
Restate the access pattern. "Before I name a store, the dominant access pattern is X at Y QPS with Z latency target."
2
Name the shape. "That shape is SQL / KV / document / object / search / time-series because..."
3
Pick the specific tool. "Specifically Postgres / DynamoDB / Cassandra / S3 / Elasticsearch / ClickHouse — and here is why this one over the others in the tier."
4
Name the single property. "What I'm buying from this choice that nothing else gives me: <ACID / linear scale / schema flex / cheap bytes / ranked search / time-window aggs>."
5
Name the trade-off. "What I'm giving up: <JOINs / ad-hoc queries / durability / sub-ms latency / strong consistency>. That's acceptable because <reason>."
6
Separate sub-domains if needed. "This sub-system wants a different shape than the main path. For the search path I'd layer Elasticsearch on top; for the media path, S3."
7
Name the migration trigger. "I'd start with one store. I'd migrate/add when <specific QPS threshold, dataset size, or access-pattern divergence>."
8
Managed vs self-managed. "For a small team I'd go managed (RDS / DynamoDB / Atlas). At scale X I'd revisit self-managed to save cost."

Signature phrases

“Access pattern first, store second”

“The property this store gives me that others do not”

“Bytes above 100 KB go to the object store”

“Elasticsearch is an index, not a primary store”

“Start in Postgres, migrate when trigger X fires”

“Eventual consistency is fine when the UX tolerates a 100 ms window”

“Access pattern first, store second” — Reverses the wrong-direction reasoning the interviewer sees most often.
“The property this store gives me that others do not” — Forces a real defence instead of a cultural preference.
“Bytes above 100 KB go to the object store” — Sharp heuristic that shows you have dealt with BLOB pain.
“Elasticsearch is an index, not a primary store” — Catches the data-loss trap candidates routinely walk into.
“Start in Postgres, migrate when trigger X fires” — Demonstrates storage maturity and avoids premature polyglot.
“Eventual consistency is fine when the UX tolerates a 100 ms window” — Concrete trade-off rather than hand-wavy "it's fine".

Likely follow-ups

?“Why not just use Postgres for everything?”Reveal

You can — for most prompts, you should. Postgres is the right default. The migration triggers are:

1Sustained write QPS > 20k/s on a single table — Postgres will struggle; the hot table becomes a bottleneck even with sharding.
2Dataset > 5 TB with mixed access patterns — operational weight (backups, dumps, index rebuilds) becomes prohibitive.
3Workload has fractured into distinct access patterns — a single store optimised for all of them ends up optimal for none.
4Media / large blobs — always to object storage, never in a BLOB column.
5Full-text search beyond `LIKE` — layer Elasticsearch.
6OLAP analytics on hot OLTP data — CDC into a columnar store (ClickHouse, BigQuery, Snowflake).

Before any of those triggers fire, "just Postgres" is the correct design. The senior move is to name the triggers explicitly rather than reach for polyglot on day one.

?“When would you pick DynamoDB over Cassandra (or vice versa)?”Reveal

Both are KV / wide-column with similar data models. The choice is operational:

Pick DynamoDB when:

You want zero-ops and serverless pricing
Your team has no distributed-systems ops bench
You're on AWS already
Traffic is bursty (pay-per-request scales to zero)

Pick Cassandra / ScyllaDB when:

You're at extreme scale where DynamoDB cost gets painful (>$50k/month is often the crossover)
You need multi-cloud or on-prem
You have an ops team comfortable with JVM / distributed-systems work
You need features DynamoDB doesn't have (materialised views, user-defined types, tunable consistency per read)

ScyllaDB specifically when Cassandra's tail latencies (JVM GC) hurt — ScyllaDB is Cassandra rewritten in C++ and routinely ships 5–10× better p99.

Default for a new AWS team: DynamoDB. Default for a mature team at scale: Cassandra / ScyllaDB. The right answer depends more on your org than your access pattern.

?“Where does Redis fit — cache or primary store?”Reveal

Primarily a cache, but legitimately a primary store in three specific cases:

Cache (the default use):

In-memory hot-read layer in front of a primary store
Sub-ms latency
Data can be rebuilt from the primary on loss

Primary store, legitimately when:

1Ephemeral data — session tokens, rate-limit counters, WebSocket connection state. Loss on restart is acceptable.
2Hot volatile data — live scores, current driver locations, surge pricing. Data is derived and re-computable.
3Specialised data structures — leaderboards (ZSETs), feed inboxes (LISTs), real-time analytics (HyperLogLog, TopK). SQL would be orders of magnitude slower.

Durability: RDB snapshots + AOF append-only log give Redis persistence that is "good enough for most things, not good enough for money." Production Redis runs with AOF on every-second-fsync and replication. Do not use it for anything where a 1-second data loss is catastrophic.

?“You chose object storage (S3) for user-uploaded images. How does the client upload without going through your app servers?”Reveal

Pre-signed URLs. The pattern:

1Client calls your API: POST /uploads with metadata (filename, mime, size).
2Server validates, generates a pre-signed S3 PUT URL (valid for 15 min) with constraints (content-type, size limit).
3Server returns the URL + the eventual asset ID to the client.
4Client PUTs the file bytes directly to S3 using that URL. Your app servers never see the bytes.
5S3 triggers an ObjectCreated event → Lambda / EventBridge → your app updates the asset row with status: ready.

Why this matters:

Your app servers don't bottleneck on upload bandwidth
Cost drops 10× (no egress through your compute)
Scale is trivial (S3 handles 10k+ PUTs/s per prefix natively)
Security: the pre-signed URL is scoped to one object, one operation, one short window

Gotchas:

CORS on the S3 bucket must allow browser uploads
The client needs to handle retries (S3 rejects malformed uploads)
You need the "file uploaded but metadata never updated" reconciliation path (periodic sweeper, or handle it in the S3 event)

Code examples

sqlPostgres handling "multi-row ACID transaction" — the canonical SQL superpower

BEGIN;

-- Reserve the seat atomically; fail if already taken
UPDATE seats
SET reserved_by = $user_id, reserved_at = now()
WHERE seat_id = $seat_id AND reserved_by IS NULL
RETURNING seat_id;

-- Insert the booking record (fails if no seat was reserved)
INSERT INTO bookings (booking_id, user_id, seat_id, amount_cents)
VALUES ($booking_id, $user_id, $seat_id, $amount)
ON CONFLICT DO NOTHING;

COMMIT;

-- If UPDATE returned no rows, the seat was taken by someone else.
-- If INSERT conflicted, client retried with same booking_id (idempotent).
-- Both guaranteed atomic — you cannot end up with a booking without a reservation.

javascriptDynamoDB single-table design — one table, multiple access patterns

// One table: PK = "USER#<id>" / SK = "PROFILE" | "ORDER#<ts>" | "ADDR#<id>"
// Lets you fetch a user profile + recent orders + addresses in ONE query.

await ddb.query({
  TableName: 'AppData',
  KeyConditionExpression: 'PK = :pk AND begins_with(SK, :prefix)',
  ExpressionAttributeValues: {
    ':pk': `USER#${userId}`,
    ':prefix': '',  // all items for this user
  },
  ScanIndexForward: false,
  Limit: 25,
});

// Returns: 1 profile + latest 24 orders + all addresses, sorted by SK DESC.
// Single round-trip. No JOIN. Linear scale per partition.

javascriptPre-signed S3 upload URL — client uploads directly, server never sees bytes

import { S3Client, PutObjectCommand } from '@aws-sdk/client-s3';
import { getSignedUrl } from '@aws-sdk/s3-request-presigner';

async function createUploadUrl(userId: string, mime: string) {
  const key = `uploads/${userId}/${crypto.randomUUID()}`;
  const cmd = new PutObjectCommand({
    Bucket: 'app-uploads',
    Key: key,
    ContentType: mime,
    ContentLength: 10_000_000,  // 10 MB cap
  });

  const url = await getSignedUrl(s3, cmd, { expiresIn: 900 });

  await db.assets.insert({
    asset_id: key,
    owner_id: userId,
    status: 'pending',
    mime,
  });

  return { upload_url: url, asset_id: key };
}

// Client PUTs directly to upload_url.
// S3 event -> Lambda -> db.assets.update({ asset_id }, { status: 'ready' })

sqlClickHouse — append + aggregate over a billion rows in seconds

-- Events table, partitioned by day, ordered by (region, event_at)
CREATE TABLE events (
  event_at  DateTime,
  region    LowCardinality(String),
  user_id   UInt64,
  event_kind String,
  value_ms  Float64
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(event_at)
ORDER BY (region, event_at);

-- "Average latency per region per 5-min window over the last hour"
SELECT
  region,
  toStartOfInterval(event_at, INTERVAL 5 MINUTE) AS bucket,
  avg(value_ms) AS p_avg,
  quantile(0.99)(value_ms) AS p99
FROM events
WHERE event_at >= now() - INTERVAL 1 HOUR
  AND event_kind = 'api_request'
GROUP BY region, bucket
ORDER BY region, bucket;

-- Runs in 200 ms over a billion rows. SQL would take minutes.

Common mistakes

"We chose X because it scales"

"Scales" is not a property. Scales-in-what-dimension matters. Cassandra scales writes linearly but gives you almost nothing on ad-hoc reads. Name the dimension.

Storing a 50 MB file in a database BLOB column

You've just made your backups 50× bigger, your rows un-cacheable, your replication bandwidth spike, and your dump/restore a multi-hour operation. The rule: > ~100 KB → object store, DB holds only URL + metadata.

Using the search index as the source of truth

Elasticsearch is a great index and a mediocre database. Dual-write from your primary store into ES; never the reverse. When the index is corrupt you rebuild it from the primary. Treating ES as a database is how teams end up with data loss.

Polyglot persistence without an owner or planAdvanced

Two teams end up with three databases and no coherent cross-store consistency story. Introduce new stores deliberately, with a named owner, a dual-write / CDC plan, and an understanding of failure modes.

Treating "eventually consistent" as equivalent to "sometimes wrong"Advanced

Eventual consistency has a bounded convergence time (usually milliseconds). The correct question is "is the UX OK if a user sees a 100 ms stale read?" — for most read paths, yes. For money and inventory, no.

Ignoring the managed-vs-self-managed axis

Small teams underestimate the ops cost of running Cassandra / Kafka / Elasticsearch themselves. The managed markup (DynamoDB vs Cassandra, Confluent vs self-Kafka) is often cheaper than one ops engineer.

Practice drills

A URL shortener at 100M links, 20k redirects/s. Which store and why?Reveal

KV store (DynamoDB or Redis-backed).

The access pattern is pure point-lookup by short code at high QPS with no JOINs and no transactions. A KV store gives you <5 ms p50 and linear scale.

Specifically DynamoDB if managed-first: short_code as PK, long_url + expires_at + owner_id as attributes. On-demand pricing handles the burst pattern of viral links. Global tables give you multi-region reads sub-10 ms.

Or Redis if the dataset fits in memory: 100M × 200 bytes = 20 GB, fits on a single m5.2xlarge. Replication + AOF gives you durability. 100× cheaper than DynamoDB at this scale.

Postgres works until the cache tier fails and the DB eats 20k/s on a hot SELECT — the KV store doesn't care.

A hotel booking system that must prevent double-booking a room. Which store?Reveal

SQL — Postgres or MySQL with SERIALIZABLE isolation on the reservation.

The access pattern requires multi-row ACID transactions: reserve the room row AND write the booking row atomically, or fail. Eventual consistency here means two guests in one room.

Implementation:

sql

BEGIN;
UPDATE rooms
SET reserved_by = $user, reserved_at = now()
WHERE room_id = $room AND reserved_by IS NULL
RETURNING *;

-- If zero rows returned, someone else got it. Rollback.
INSERT INTO bookings (user_id, room_id, start_date, end_date)
VALUES ($user, $room, $start, $end);
COMMIT;

DynamoDB's conditional writes + TransactWriteItems could also work but are slower and have item-count limits. Postgres is the natural answer.

A photo-sharing app storing originals + thumbnails + EXIF. Where does each piece live?Reveal

Originals and thumbnails → S3 (large opaque bytes, CDN-served). EXIF + photo metadata (user_id, timestamp, caption, album_id) → SQL or document store.

The architecture:

Client uploads original to S3 via pre-signed URL
S3 event triggers a thumbnail generator (Lambda)
Thumbnails (3 sizes) stored as photos/<id>/thumb_{sm,md,lg}.jpg
Metadata row in Postgres: (photo_id, user_id, album_id, taken_at, s3_key, width, height, exif_blob, created_at)
CDN (CloudFront) in front of S3 for reads

Why this split:

S3: $0.02/GB/month, CDN-friendly, infinite scale
Postgres: queryable (find photos by user, by date, by album), indexable, tiny rows (~1 KB)
BLOB-in-database would 10× your storage cost, kill replication bandwidth, and make thumbnails un-cacheable

This is the standard media architecture; worth memorising as a reflex answer.

A ride-sharing app needs to answer "find the 10 nearest drivers to this rider" at 50k QPS. Which store?Reveal

Redis with geospatial commands (GEOADD, GEORADIUS) — or a specialised geo-store (Tile38, Elasticsearch with geo-point fields).

Why Redis:

In-memory = sub-ms geo-query latency
GEORADIUS computes "drivers within X km of (lat, lng)" natively
Driver locations update every 3 s → TTL handles stale data for free
If Redis crashes, drivers re-report within 3 s and inventory rebuilds — durability doesn't matter

Sharding: by geohash prefix across multiple Redis shards. Each shard handles one region; the "nearest drivers" query hits 1–4 shards max.

Why NOT SQL: PostGIS can do geo-queries, but not at 50k QPS on churning data. Writes at 330k/s (1M drivers × 0.33 Hz) would require aggressive sharding, and the 3-s churn invalidates B-tree indexes constantly.

The pattern: hot derived state → Redis; source of truth (ride history, driver profile) → Postgres.

You have 500M user events per day. Product wants "what's the conversion rate by funnel step by country by day for the last 90 days?" Which store, and how do you get data into it?Reveal

ClickHouse (or BigQuery / Snowflake / Druid if on that stack). Columnar time-series store.

Why columnar: the query is a GROUP BY country, funnel_step, day aggregation over 45 billion rows. Columnar compression shrinks that to ~1–5 TB on disk; the aggregation reads only the relevant columns. ClickHouse runs this in seconds.

Ingestion pipeline:

1Product emits events to Kafka (events.raw, partitioned by country)
2A Kafka consumer writes batches (every 5 s) into ClickHouse via native protocol
3ClickHouse MergeTree engine partitioned by toYYYYMMDD(event_at), ordered by (country, funnel_step, event_at)
4Retention: drop partitions > 90 days via ALTER TABLE ... DROP PARTITION

Why NOT Postgres: 500M/day = 5800 writes/s sustained, manageable, but the aggregation query over 45B rows would take hours even with indexes. Postgres is OLTP; this is OLAP.

Why NOT a generic document store: no columnar compression, no vectorised query engine. Ingest is fine; query is the blocker.

Serving layer: pre-compute the top dashboards (hourly cron) into a small Postgres table for the main product analytics UI. ClickHouse handles ad-hoc queries from the analytics team.

Cheat sheet

•Access pattern before storage. Never the reverse.
•Name the one property the chosen store gives you that others don't.
•BLOBs > ~100 KB → object store; keep only URL in your DB.
•Search index is an index, not a primary store. Source of truth lives elsewhere.
•Start simple (one SQL store) if unsure; name the migration trigger.
•"Scales" is not a decision — scales in what dimension, at what cost?
•Pre-signed URLs for client uploads — bytes bypass your app servers.
•Managed for small teams; self-managed when markup > ops cost.
•Redis is primarily a cache. Primary store only for ephemeral or volatile data.
•Time-series workloads ≥ 100k events/s → columnar (ClickHouse, Druid), not SQL.

Practice this skill

These problems exercise Storage choice justification. Try one now to apply what you just learned.

url shortener news feed dropbox youtube

7% complete

Current

Read this if

Step 1 of 14

The concept

Jump to next

Shape

Best at

Latency (p50)

Horizontal scale

Consistency

Wrong answer when

Typical tools

SQL

Multi-row tx, joins, ad-hoc queries

1–10 ms

Replicas easy; writes need sharding

Strong by default

>10k/s on joined queries with unbounded growth

Postgres, MySQL, Aurora

KV / wide-col

Linear scale on PK lookup

<5 ms p50

Native (designed for it)

Eventual, tunable

Ad-hoc queries; cross-PK transactions

DynamoDB, Cassandra, Bigtable

Document

Read one aggregate by id

2–10 ms

Native via collection sharding

Tunable per write

Deeply relational domain

MongoDB, Firestore, Cosmos

Object / blob

Large opaque bytes (>100 KB)

30–100 ms

Native (global namespace)

Read-after-write strong

Per-byte random access; sub-100 ms required

S3, GCS, Azure Blob

Ranked free-text queries, facets

10–50 ms

Native via index sharding

Near-real-time (seconds)

Source-of-truth for transactional data

Elasticsearch, OpenSearch

Time-series / columnar

Windowed aggregates over time

10–100 ms

Native via time partitioning

Eventual

Point lookups by non-time key

ClickHouse, InfluxDB, Druid

Cache

Sub-ms hot-read acceleration

<1 ms

Native via key partitioning

Best-effort (in-memory)

Source of truth if loss unacceptable

Redis, Memcached

Graph

Multi-hop relationship traversal

5–50 ms

Hard — most still vertical

Strong or tunable

Shallow queries dominated by filters

Neo4j, Neptune, JanusGraph

BEGIN; -- Reserve the seat atomically; fail if already taken UPDATE seats SET reserved_by = $user_id, reserved_at = now() WHERE seat_id = $seat_id AND reserved_by IS NULL RETURNING seat_id; -- Insert the booking record (fails if no seat was reserved) INSERT INTO bookings (booking_id, user_id, seat_id, amount_cents) VALUES ($booking_id, $user_id, $seat_id, $amount) ON CONFLICT DO NOTHING; COMMIT; -- If UPDATE returned no rows, the seat was taken by someone else. -- If INSERT conflicted, client retried with same booking_id (idempotent). -- Both guaranteed atomic — you cannot end up with a booking without a reservation.

// One table: PK = "USER#<id>" / SK = "PROFILE" | "ORDER#<ts>" | "ADDR#<id>" // Lets you fetch a user profile + recent orders + addresses in ONE query. await ddb.query({ TableName: 'AppData', KeyConditionExpression: 'PK = :pk AND begins_with(SK, :prefix)', ExpressionAttributeValues: { ':pk': `USER#${userId}`, ':prefix': '', // all items for this user }, ScanIndexForward: false, Limit: 25, }); // Returns: 1 profile + latest 24 orders + all addresses, sorted by SK DESC. // Single round-trip. No JOIN. Linear scale per partition.

import { S3Client, PutObjectCommand } from '@aws-sdk/client-s3'; import { getSignedUrl } from '@aws-sdk/s3-request-presigner'; async function createUploadUrl(userId: string, mime: string) { const key = `uploads/${userId}/${crypto.randomUUID()}`; const cmd = new PutObjectCommand({ Bucket: 'app-uploads', Key: key, ContentType: mime, ContentLength: 10_000_000, // 10 MB cap }); const url = await getSignedUrl(s3, cmd, { expiresIn: 900 }); await db.assets.insert({ asset_id: key, owner_id: userId, status: 'pending', mime, }); return { upload_url: url, asset_id: key }; } // Client PUTs directly to upload_url. // S3 event -> Lambda -> db.assets.update({ asset_id }, { status: 'ready' })

-- Events table, partitioned by day, ordered by (region, event_at) CREATE TABLE events ( event_at DateTime, region LowCardinality(String), user_id UInt64, event_kind String, value_ms Float64 ) ENGINE = MergeTree PARTITION BY toYYYYMMDD(event_at) ORDER BY (region, event_at); -- "Average latency per region per 5-min window over the last hour" SELECT region, toStartOfInterval(event_at, INTERVAL 5 MINUTE) AS bucket, avg(value_ms) AS p_avg, quantile(0.99)(value_ms) AS p99 FROM events WHERE event_at >= now() - INTERVAL 1 HOUR AND event_kind = 'api_request' GROUP BY region, bucket ORDER BY region, bucket; -- Runs in 200 ms over a billion rows. SQL would take minutes.

BEGIN; UPDATE rooms SET reserved_by = $user, reserved_at = now() WHERE room_id = $room AND reserved_by IS NULL RETURNING *; -- If zero rows returned, someone else got it. Rollback. INSERT INTO bookings (user_id, room_id, start_date, end_date) VALUES ($user, $room, $start, $end); COMMIT;