Storage choice justification
Picking SQL vs KV vs doc vs blob vs timeseries based on access patterns.
Picking a database is a first-principles decision, not a defaults one. "We use Postgres" is a cultural statement; "the access pattern is point-lookup at 100k QPS with eventual consistency, so we use DynamoDB" is a design.
Read this if your last attempt…
- You've said "we'll use Postgres" or "we'll use DynamoDB" without justifying it
- You can't explain when Cassandra wins over Postgres
- You were asked "why not SQL?" and didn't have a crisp answer
- You're confused about when S3 is the right answer vs a database BLOB column
- You default to polyglot persistence without a migration story
The concept
When an interviewer asks "which database?", they aren't asking for your favourite store. They're asking: do you understand the shape of your access pattern well enough to match it to a storage shape?
The candidate who names the access pattern first and the store second is doing design. The candidate who names the store first and justifies it after is signalling cultural preference. The direction of reasoning is the entire signal.
Match the shape of your reads + writes to the shape the storage is optimised for. There is no universal default.
Which store for which access pattern
| Shape | Best at | Latency (p50) | Horizontal scale | Consistency | Wrong answer when | Typical tools |
|---|---|---|---|---|---|---|
| SQL | Multi-row tx, joins, ad-hoc queries | 1–10 ms | Replicas easy; writes need sharding | Strong by default | >10k/s on joined queries with unbounded growth | Postgres, MySQL, Aurora |
| KV / wide-col | Linear scale on PK lookup | <5 ms p50 | Native (designed for it) | Eventual, tunable | Ad-hoc queries; cross-PK transactions | DynamoDB, Cassandra, Bigtable |
| Document | Read one aggregate by id | 2–10 ms | Native via collection sharding | Tunable per write | Deeply relational domain | MongoDB, Firestore, Cosmos |
| Object / blob | Large opaque bytes (>100 KB) | 30–100 ms | Native (global namespace) | Read-after-write strong | Per-byte random access; sub-100 ms required | S3, GCS, Azure Blob |
| Search | Ranked free-text queries, facets | 10–50 ms | Native via index sharding | Near-real-time (seconds) | Source-of-truth for transactional data | Elasticsearch, OpenSearch |
| Time-series / columnar | Windowed aggregates over time | 10–100 ms | Native via time partitioning | Eventual | Point lookups by non-time key | ClickHouse, InfluxDB, Druid |
| Cache | Sub-ms hot-read acceleration | <1 ms | Native via key partitioning | Best-effort (in-memory) | Source of truth if loss unacceptable | Redis, Memcached |
| Graph | Multi-hop relationship traversal | 5–50 ms | Hard — most still vertical | Strong or tunable | Shallow queries dominated by filters | Neo4j, Neptune, JanusGraph |
- Start in SQL for anything ambiguous. Migrate when you have a named trigger.
- Managed wins at small team size; self-managed wins when markup > ops cost.
How interviewers grade this
- You name the access pattern before the store — "point lookup by user_id", not "we need a database".
- You cite one concrete property that the chosen store gives you and no other store in the tier does.
- You name the failure mode of the chosen store (e.g., "DynamoDB gives no cross-item transactions, so I handle that with an idempotent write + saga").
- You know the difference between a BLOB column and an object store, and you pick the object store above ~100 KB.
- You can defend "start with Postgres" as a position, with a named migration trigger.
- You name managed vs self-managed explicitly and justify against team size.
- You identify when search is an index and never the source of truth.
Variants
Relational / SQL — the right default
ACID transactions, JOINs, mature tooling. Postgres until proven otherwise.
The default for 80% of system design prompts. Postgres at a well-tuned instance handles 10k+ QPS with JOINs, transactions, and a query planner that handles queries nobody anticipated.
What SQL gives you that nothing else does:
- Multi-row ACID transactions (booking systems, financial ledgers)
- FOREIGN KEY integrity enforced at the DB
- JOINs across normalised tables
- A decade of debugging tools (psql, pganalyze, EXPLAIN)
- Mature replication (physical, logical, CDC via WAL)
Scaling path: read replicas first, connection pooling, query optimisation, partitioning, finally horizontal sharding (Citus, Vitess) or migration to a sharded SQL-compatible (Spanner, CockroachDB) / moving hot tables to KV.
Named migration triggers: sustained write QPS > 20k/s on a single table; dataset > 5 TB; the workload has fractured into multiple independent access patterns.
Pros
- +ACID transactions and JOINs
- +Mature ops and debugging
- +Preserves optionality for future decisions
- +Familiar to every backend engineer
Cons
- −Horizontal scaling is hard (sharding or migration)
- −Large blobs hurt (use object store)
- −Per-shard hot tables can become bottlenecks
Choose this variant when
- Default for most prompts
- Multi-row transactions required
- Ad-hoc query and reporting needs
- Team has no distributed-systems ops bench
KV / wide-column — linear scale, sharp trade-offs
Single-partition point lookup at unbounded scale. DynamoDB, Cassandra.
The choice when one access pattern dominates, QPS is extreme, and you will trade JOINs for linear horizontal scale.
What you get:
- Single-digit-ms p99 at any scale (if modelled correctly)
- Linear horizontal scale — add nodes, throughput goes up proportionally
- DynamoDB: zero ops, pay-per-request, serverless
- Cassandra / ScyllaDB: cheaper at extreme scale, need an ops team
What you give up:
- JOINs (you denormalise into one-table-per-query)
- Cross-partition transactions (DynamoDB now has TransactWriteItems but it is limited and slow)
- Ad-hoc queries (the schema is the access pattern)
- Secondary indexes are expensive or limited
Hot partition trap: if one partition key gets 10× the traffic (celebrity user, hot product), the whole shard is throttled. Design keys for uniform distribution.
Pros
- +Linear horizontal scale
- +Single-digit-ms p99 at any size
- +Predictable cost per operation
- +Zero-ops option (DynamoDB)
Cons
- −No JOINs — denormalise everywhere
- −No ad-hoc queries
- −Hot partitions are a real failure mode
- −One table per query mindset is a cultural shift
Choose this variant when
- Dominant access pattern is PK lookup
- QPS > 20k on a single table
- No JOIN or cross-row transaction requirement
- Need linear write scaling
Document / aggregate — flexible schema, bounded sub-collections
MongoDB / Firestore for aggregate-per-read, schema-per-tenant.
Document stores optimise for reading and writing a single aggregate document. The schema flexes per document, sub-collections are embedded up to a bound, and cross-document queries are possible but awkward.
Sweet spot:
- Access pattern: fetch one aggregate, render it
- Schema varies per tenant (e.g., SaaS where each customer has custom fields)
- Sub-collections are bounded and always read with the parent
- Eventual consistency across aggregates is acceptable
Traps:
- Sub-collections grow unbounded → hit 16 MB MongoDB doc limit or 400 KB DynamoDB item
- Cross-document JOINs reinvent SQL in application code, badly
- Schema-less is not schema-less, it is schema-on-read — the schema exists, just in application code
Note: MongoDB 4+ has multi-doc transactions and JOINs (via $lookup). Both work. Neither is fast.
Pros
- +Schema flexibility per document
- +One read per aggregate
- +Horizontal scaling baked in
- +Natural fit for object models
Cons
- −Cross-document analytics is awkward
- −Doc size limits bite at scale
- −Transactions are slow (MongoDB) or limited (Firestore)
Choose this variant when
- Aggregate-per-read access pattern
- Multi-tenant with custom schema per tenant
- Bounded sub-collections
- Rapid prototyping where schema is fluid
Object / blob store — large bytes, cheap and durable
S3 / GCS for anything > ~100 KB served via CDN.
Object stores are where large opaque bytes live. They are not databases — you cannot query the contents — but at cents per GB-month with 11 nines of durability and native CDN integration, they are the only right answer for large files.
The canonical pattern:
- 1Client uploads to S3 via pre-signed URL (no traffic through your app)
- 2S3 triggers a notification (Lambda, EventBridge) on upload complete
- 3Your app writes the metadata row (
s3_key,size,mime,uploader,created_at) in SQL - 4Read path: DB gives URL → CloudFront / CloudFlare serves the bytes
When NOT an object store:
- Bytes < 10 KB (just put in DB; row-level cache works)
- Random access to middle of the file at low latency (use a block store or specialised format)
- Strong consistency across many keys (S3 is strong for single-key read-after-write; multi-key is best-effort)
Pros
- +$0.02/GB/month — cheapest storage tier
- +11 nines of durability
- +CDN integration is trivial
- +No ops; scales forever
Cons
- −Not queryable
- −30–100 ms first-byte latency (CDN mitigates)
- −Cost per request adds up at high QPS
- −No partial updates — rewrite the whole object
Choose this variant when
- File size > 100 KB
- Serving static assets, images, videos
- Archival / backup / audit logs
- ML model artefacts or dataset files
Search / inverted index — ranked full-text
Elasticsearch for search; NEVER the source of truth.
Inverted indexes make "find all documents matching free text, ranked by relevance" fast and feature-rich (facets, fuzzy match, boosting, multilingual analysis). They are an index, not a database.
The critical mental model:
- Source of truth lives in SQL or a document store
- Dual-write or CDC (Debezium, Kafka Connect) into Elasticsearch
- When the index is corrupt or schema evolves, rebuild from the primary
- NEVER let application code write only to ES
Failure modes candidates miss:
- Using ES as the primary = data loss risk (ES has had silent data-loss bugs historically)
- Not having a rebuild path = index divergence with no recovery
- Hot query patterns (expensive aggregations) = cluster-wide slowdown, not just one shard
When to reach for it: product search, log search, document search, faceted filters on complex criteria, relevance ranking beyond what SQL LIKE can do.
Pros
- +Ranked free-text with facets
- +Fuzzy matching, multilingual analysis
- +Near-real-time index updates
- +Scales via sharding
Cons
- −Not a primary store — dual-write needed
- −Expensive aggregations can take down a cluster
- −Ops-intensive (JVM, heap tuning)
- −Consistency lag of seconds
Choose this variant when
- Product / content / log search
- Faceted filtering on complex criteria
- Ranked relevance beyond SQL LIKE
- Multilingual or fuzzy text matching
Time-series / columnar — append + aggregate
ClickHouse, InfluxDB, Druid for metrics and telemetry at scale.
When the workload is time-ordered appends with queries that aggregate over time windows, time-series / columnar stores destroy SQL on both axes. ClickHouse can ingest a million rows per second per node and run a GROUP BY region, minute over a billion rows in seconds.
Shape:
- Data is time-stamped (metrics, events, logs, ticks)
- Writes are append-only (updates to historical data are rare / impossible)
- Reads are aggregations over time windows ("sum requests per region per 5-minute window")
- Point lookup by non-time key is terrible
Architecture pattern:
- Ingest via Kafka → ClickHouse / Druid via a consumer
- Retention policy (downsample old data, drop after TTL)
- Query layer in front (Grafana for ops, custom dashboards for product)
Adjacent category: OLAP data warehouses (BigQuery, Snowflake, Redshift) — same columnar principles, different operational model (loaded in batch, queried ad-hoc, pay-per-query).
Pros
- +Millions of appends/s per node
- +Aggregations over billions of rows in seconds
- +Compression is extreme (10×+ for time-series)
- +Downsampling / retention native
Cons
- −Point lookup by non-time key is awful
- −Updates are painful or impossible
- −Adds a distinct ops surface
- −Overkill below 100k events/s
Choose this variant when
- Metrics, telemetry, observability
- Ad events, financial ticks
- Log aggregation at scale
- Workload is > 100k events/s sustained
Worked example
Scenario: Pick storage for an Uber-like ride-sharing backend.
Sub-domain 1 — Rider / Driver profiles
- Access pattern: CRUD on user records; JOINs to account settings, payment methods.
- Volume: 100M users, < 1k QPS writes, < 10k QPS reads.
- Pick: Postgres. Standard OLTP. Nothing weird. ACID matters for payment-method updates.
Sub-domain 2 — Driver location (live)
- Access pattern: drivers post location every 3 s; riders query "nearest drivers to me" every 2 s.
- Volume: 1M active drivers × 0.33 writes/s = 330k writes/s. Geo query at 50k/s.
- Pick: Redis with geospatial commands (GEOADD, GEORADIUS) or a specialised geo store (Tile38). In-memory is fine because 3-second TTL is acceptable — if Redis crashes, drivers will re-report within 3 s and inventory rebuilds.
- Trade-off: no durability on driver location. Acceptable because the data is ephemeral.
Sub-domain 3 — Ride history / receipts
- Access pattern: append a ride on completion; read a rider's last 50 rides; analytics queries on revenue per city per day.
- Volume: 10M rides/day writes, 100M rides/year. Read QPS modest.
- Pick (primary): Postgres partitioned by month, indexed by (rider_id, completed_at DESC). Covers CRUD + rider-history access patterns.
- Pick (analytics): ClickHouse loaded from Postgres via CDC, for the revenue-per-city-per-day aggregations. SQL can handle the OLAP workload at this scale, but ClickHouse runs the queries in 100× less time and gives the analytics team a separate blast radius.
Sub-domain 4 — Trip photos / receipts as PDFs
- Access pattern: upload on trip complete, fetch on demand, archive after 90 days.
- Volume: 2M uploads/day, ~500 KB average.
- Pick: S3 with CloudFront in front. DB holds only the S3 key + mime + size. Move to Glacier after 90 days for cost.
Sub-domain 5 — Support ticket search
- Access pattern: support agents search "find all tickets mentioning X in the last 30 days".
- Volume: 100k tickets/day written, low read QPS but complex queries.
- Pick: Postgres as source of truth. Elasticsearch as the search index, populated via CDC from Postgres. Agents query ES; the ticket record lives in Postgres.
Sub-domain 6 — Surge pricing state
- Access pattern: geohash → current multiplier. Written by a pricing service every 30 s; read by the booking flow at 50k QPS.
- Volume: 100k geohashes × hot read.
- Pick: Redis. Hot read path with sub-ms latency, data is derived and safely re-computable on crash.
Summary of stores used: Postgres (primary transactional), Redis (live driver locations + surge), S3 (photos / PDFs), ClickHouse (analytics), Elasticsearch (support search). Five stores for five sharply different access patterns — justified each time, not polyglot for its own sake.
Good vs bad answer
Interviewer probe
“Which store would you use for the core data of a chat application (messages, channels, members)?”
Weak answer
"I'd use MongoDB because it's flexible and scales well."
Strong answer
"Three distinct access patterns, three shapes.
• Messages are write-heavy, append-only, and read in time order within a channel — that's a wide-column fit. I'd use Cassandra partitioned on channel_id, clustered on (timestamp, message_id). One partition = one channel's history. Read is a single partition scan from the newest N messages. Scales linearly on writes across channels.
• Channel metadata (name, members, settings) is low-volume and read-heavy — Postgres is fine; ACID matters when someone leaves a channel or a settings write has to be atomic with an audit log entry.
• Membership lookup on the hot path (is user X in channel Y?) hits a KV store — Redis or DynamoDB — keyed on user_id → channel_set. That lets the message-send path check membership in <1 ms without hitting Postgres on every message.
The trade-off I'm buying: message history is eventually consistent across replicas; a reader may see messages out of order for a ~100 ms window. That's acceptable for chat UX and is handled by client-side message-id ordering."
Why it wins: Separated three access patterns before naming three stores. Each store is defended against the specific property it provides. The consistency trade-off is named explicitly and justified against the UX. Shows maturity by knowing when to combine stores rather than force one shape to fit everything.
When it comes up
- When the interviewer asks "which database" after you've described the data model
- In scaling deep-dives when the single-store answer breaks
- When the prompt involves search, media, or analytics — each is a distinct shape
- During trade-off discussions: "why not SQL? why not DynamoDB?"
- When asked about "polyglot persistence" — the answer is "only when justified"
Order of reveal
- 1Restate the access pattern. "Before I name a store, the dominant access pattern is X at Y QPS with Z latency target."
- 2Name the shape. "That shape is SQL / KV / document / object / search / time-series because..."
- 3Pick the specific tool. "Specifically Postgres / DynamoDB / Cassandra / S3 / Elasticsearch / ClickHouse — and here is why this one over the others in the tier."
- 4Name the single property. "What I'm buying from this choice that nothing else gives me: <ACID / linear scale / schema flex / cheap bytes / ranked search / time-window aggs>."
- 5Name the trade-off. "What I'm giving up: <JOINs / ad-hoc queries / durability / sub-ms latency / strong consistency>. That's acceptable because <reason>."
- 6Separate sub-domains if needed. "This sub-system wants a different shape than the main path. For the search path I'd layer Elasticsearch on top; for the media path, S3."
- 7Name the migration trigger. "I'd start with one store. I'd migrate/add when <specific QPS threshold, dataset size, or access-pattern divergence>."
- 8Managed vs self-managed. "For a small team I'd go managed (RDS / DynamoDB / Atlas). At scale X I'd revisit self-managed to save cost."
Signature phrases
- “Access pattern first, store second” — Reverses the wrong-direction reasoning the interviewer sees most often.
- “The property this store gives me that others do not” — Forces a real defence instead of a cultural preference.
- “Bytes above 100 KB go to the object store” — Sharp heuristic that shows you have dealt with BLOB pain.
- “Elasticsearch is an index, not a primary store” — Catches the data-loss trap candidates routinely walk into.
- “Start in Postgres, migrate when trigger X fires” — Demonstrates storage maturity and avoids premature polyglot.
- “Eventual consistency is fine when the UX tolerates a 100 ms window” — Concrete trade-off rather than hand-wavy "it's fine".
Likely follow-ups
?“Why not just use Postgres for everything?”Reveal
You can — for most prompts, you should. Postgres is the right default. The migration triggers are:
- 1Sustained write QPS > 20k/s on a single table — Postgres will struggle; the hot table becomes a bottleneck even with sharding.
- 2Dataset > 5 TB with mixed access patterns — operational weight (backups, dumps, index rebuilds) becomes prohibitive.
- 3Workload has fractured into distinct access patterns — a single store optimised for all of them ends up optimal for none.
- 4Media / large blobs — always to object storage, never in a BLOB column.
- 5Full-text search beyond `LIKE` — layer Elasticsearch.
- 6OLAP analytics on hot OLTP data — CDC into a columnar store (ClickHouse, BigQuery, Snowflake).
Before any of those triggers fire, "just Postgres" is the correct design. The senior move is to name the triggers explicitly rather than reach for polyglot on day one.
?“When would you pick DynamoDB over Cassandra (or vice versa)?”Reveal
Both are KV / wide-column with similar data models. The choice is operational:
Pick DynamoDB when:
- You want zero-ops and serverless pricing
- Your team has no distributed-systems ops bench
- You're on AWS already
- Traffic is bursty (pay-per-request scales to zero)
Pick Cassandra / ScyllaDB when:
- You're at extreme scale where DynamoDB cost gets painful (>$50k/month is often the crossover)
- You need multi-cloud or on-prem
- You have an ops team comfortable with JVM / distributed-systems work
- You need features DynamoDB doesn't have (materialised views, user-defined types, tunable consistency per read)
ScyllaDB specifically when Cassandra's tail latencies (JVM GC) hurt — ScyllaDB is Cassandra rewritten in C++ and routinely ships 5–10× better p99.
Default for a new AWS team: DynamoDB. Default for a mature team at scale: Cassandra / ScyllaDB. The right answer depends more on your org than your access pattern.
?“Where does Redis fit — cache or primary store?”Reveal
Primarily a cache, but legitimately a primary store in three specific cases:
Cache (the default use):
- In-memory hot-read layer in front of a primary store
- Sub-ms latency
- Data can be rebuilt from the primary on loss
Primary store, legitimately when:
- 1Ephemeral data — session tokens, rate-limit counters, WebSocket connection state. Loss on restart is acceptable.
- 2Hot volatile data — live scores, current driver locations, surge pricing. Data is derived and re-computable.
- 3Specialised data structures — leaderboards (ZSETs), feed inboxes (LISTs), real-time analytics (HyperLogLog, TopK). SQL would be orders of magnitude slower.
Durability: RDB snapshots + AOF append-only log give Redis persistence that is "good enough for most things, not good enough for money." Production Redis runs with AOF on every-second-fsync and replication. Do not use it for anything where a 1-second data loss is catastrophic.
?“You chose object storage (S3) for user-uploaded images. How does the client upload without going through your app servers?”Reveal
Pre-signed URLs. The pattern:
- 1Client calls your API:
POST /uploadswith metadata (filename, mime, size). - 2Server validates, generates a pre-signed S3 PUT URL (valid for 15 min) with constraints (content-type, size limit).
- 3Server returns the URL + the eventual asset ID to the client.
- 4Client PUTs the file bytes directly to S3 using that URL. Your app servers never see the bytes.
- 5S3 triggers an
ObjectCreatedevent → Lambda / EventBridge → your app updates the asset row withstatus: ready.
Why this matters:
- Your app servers don't bottleneck on upload bandwidth
- Cost drops 10× (no egress through your compute)
- Scale is trivial (S3 handles 10k+ PUTs/s per prefix natively)
- Security: the pre-signed URL is scoped to one object, one operation, one short window
Gotchas:
- CORS on the S3 bucket must allow browser uploads
- The client needs to handle retries (S3 rejects malformed uploads)
- You need the "file uploaded but metadata never updated" reconciliation path (periodic sweeper, or handle it in the S3 event)
Code examples
BEGIN;
-- Reserve the seat atomically; fail if already taken
UPDATE seats
SET reserved_by = $user_id, reserved_at = now()
WHERE seat_id = $seat_id AND reserved_by IS NULL
RETURNING seat_id;
-- Insert the booking record (fails if no seat was reserved)
INSERT INTO bookings (booking_id, user_id, seat_id, amount_cents)
VALUES ($booking_id, $user_id, $seat_id, $amount)
ON CONFLICT DO NOTHING;
COMMIT;
-- If UPDATE returned no rows, the seat was taken by someone else.
-- If INSERT conflicted, client retried with same booking_id (idempotent).
-- Both guaranteed atomic — you cannot end up with a booking without a reservation.// One table: PK = "USER#<id>" / SK = "PROFILE" | "ORDER#<ts>" | "ADDR#<id>"
// Lets you fetch a user profile + recent orders + addresses in ONE query.
await ddb.query({
TableName: 'AppData',
KeyConditionExpression: 'PK = :pk AND begins_with(SK, :prefix)',
ExpressionAttributeValues: {
':pk': `USER#${userId}`,
':prefix': '', // all items for this user
},
ScanIndexForward: false,
Limit: 25,
});
// Returns: 1 profile + latest 24 orders + all addresses, sorted by SK DESC.
// Single round-trip. No JOIN. Linear scale per partition.import { S3Client, PutObjectCommand } from '@aws-sdk/client-s3';
import { getSignedUrl } from '@aws-sdk/s3-request-presigner';
async function createUploadUrl(userId: string, mime: string) {
const key = `uploads/${userId}/${crypto.randomUUID()}`;
const cmd = new PutObjectCommand({
Bucket: 'app-uploads',
Key: key,
ContentType: mime,
ContentLength: 10_000_000, // 10 MB cap
});
const url = await getSignedUrl(s3, cmd, { expiresIn: 900 });
await db.assets.insert({
asset_id: key,
owner_id: userId,
status: 'pending',
mime,
});
return { upload_url: url, asset_id: key };
}
// Client PUTs directly to upload_url.
// S3 event -> Lambda -> db.assets.update({ asset_id }, { status: 'ready' })-- Events table, partitioned by day, ordered by (region, event_at)
CREATE TABLE events (
event_at DateTime,
region LowCardinality(String),
user_id UInt64,
event_kind String,
value_ms Float64
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(event_at)
ORDER BY (region, event_at);
-- "Average latency per region per 5-min window over the last hour"
SELECT
region,
toStartOfInterval(event_at, INTERVAL 5 MINUTE) AS bucket,
avg(value_ms) AS p_avg,
quantile(0.99)(value_ms) AS p99
FROM events
WHERE event_at >= now() - INTERVAL 1 HOUR
AND event_kind = 'api_request'
GROUP BY region, bucket
ORDER BY region, bucket;
-- Runs in 200 ms over a billion rows. SQL would take minutes.Common mistakes
"Scales" is not a property. Scales-in-what-dimension matters. Cassandra scales writes linearly but gives you almost nothing on ad-hoc reads. Name the dimension.
You've just made your backups 50× bigger, your rows un-cacheable, your replication bandwidth spike, and your dump/restore a multi-hour operation. The rule: > ~100 KB → object store, DB holds only URL + metadata.
Elasticsearch is a great index and a mediocre database. Dual-write from your primary store into ES; never the reverse. When the index is corrupt you rebuild it from the primary. Treating ES as a database is how teams end up with data loss.
Two teams end up with three databases and no coherent cross-store consistency story. Introduce new stores deliberately, with a named owner, a dual-write / CDC plan, and an understanding of failure modes.
Eventual consistency has a bounded convergence time (usually milliseconds). The correct question is "is the UX OK if a user sees a 100 ms stale read?" — for most read paths, yes. For money and inventory, no.
Small teams underestimate the ops cost of running Cassandra / Kafka / Elasticsearch themselves. The managed markup (DynamoDB vs Cassandra, Confluent vs self-Kafka) is often cheaper than one ops engineer.
Practice drills
A URL shortener at 100M links, 20k redirects/s. Which store and why?Reveal
KV store (DynamoDB or Redis-backed).
The access pattern is pure point-lookup by short code at high QPS with no JOINs and no transactions. A KV store gives you <5 ms p50 and linear scale.
Specifically DynamoDB if managed-first: short_code as PK, long_url + expires_at + owner_id as attributes. On-demand pricing handles the burst pattern of viral links. Global tables give you multi-region reads sub-10 ms.
Or Redis if the dataset fits in memory: 100M × 200 bytes = 20 GB, fits on a single m5.2xlarge. Replication + AOF gives you durability. 100× cheaper than DynamoDB at this scale.
Postgres works until the cache tier fails and the DB eats 20k/s on a hot SELECT — the KV store doesn't care.
A hotel booking system that must prevent double-booking a room. Which store?Reveal
SQL — Postgres or MySQL with SERIALIZABLE isolation on the reservation.
The access pattern requires multi-row ACID transactions: reserve the room row AND write the booking row atomically, or fail. Eventual consistency here means two guests in one room.
Implementation:
BEGIN;
UPDATE rooms
SET reserved_by = $user, reserved_at = now()
WHERE room_id = $room AND reserved_by IS NULL
RETURNING *;
-- If zero rows returned, someone else got it. Rollback.
INSERT INTO bookings (user_id, room_id, start_date, end_date)
VALUES ($user, $room, $start, $end);
COMMIT;DynamoDB's conditional writes + TransactWriteItems could also work but are slower and have item-count limits. Postgres is the natural answer.
A photo-sharing app storing originals + thumbnails + EXIF. Where does each piece live?Reveal
Originals and thumbnails → S3 (large opaque bytes, CDN-served). EXIF + photo metadata (user_id, timestamp, caption, album_id) → SQL or document store.
The architecture:
- Client uploads original to S3 via pre-signed URL
- S3 event triggers a thumbnail generator (Lambda)
- Thumbnails (3 sizes) stored as
photos/<id>/thumb_{sm,md,lg}.jpg - Metadata row in Postgres:
(photo_id, user_id, album_id, taken_at, s3_key, width, height, exif_blob, created_at) - CDN (CloudFront) in front of S3 for reads
Why this split:
- S3: $0.02/GB/month, CDN-friendly, infinite scale
- Postgres: queryable (find photos by user, by date, by album), indexable, tiny rows (~1 KB)
- BLOB-in-database would 10× your storage cost, kill replication bandwidth, and make thumbnails un-cacheable
This is the standard media architecture; worth memorising as a reflex answer.
A ride-sharing app needs to answer "find the 10 nearest drivers to this rider" at 50k QPS. Which store?Reveal
Redis with geospatial commands (GEOADD, GEORADIUS) — or a specialised geo-store (Tile38, Elasticsearch with geo-point fields).
Why Redis:
- In-memory = sub-ms geo-query latency
- GEORADIUS computes "drivers within X km of (lat, lng)" natively
- Driver locations update every 3 s → TTL handles stale data for free
- If Redis crashes, drivers re-report within 3 s and inventory rebuilds — durability doesn't matter
Sharding: by geohash prefix across multiple Redis shards. Each shard handles one region; the "nearest drivers" query hits 1–4 shards max.
Why NOT SQL: PostGIS can do geo-queries, but not at 50k QPS on churning data. Writes at 330k/s (1M drivers × 0.33 Hz) would require aggressive sharding, and the 3-s churn invalidates B-tree indexes constantly.
The pattern: hot derived state → Redis; source of truth (ride history, driver profile) → Postgres.
You have 500M user events per day. Product wants "what's the conversion rate by funnel step by country by day for the last 90 days?" Which store, and how do you get data into it?Reveal
ClickHouse (or BigQuery / Snowflake / Druid if on that stack). Columnar time-series store.
Why columnar: the query is a GROUP BY country, funnel_step, day aggregation over 45 billion rows. Columnar compression shrinks that to ~1–5 TB on disk; the aggregation reads only the relevant columns. ClickHouse runs this in seconds.
Ingestion pipeline:
- 1Product emits events to Kafka (
events.raw, partitioned by country) - 2A Kafka consumer writes batches (every 5 s) into ClickHouse via native protocol
- 3ClickHouse MergeTree engine partitioned by
toYYYYMMDD(event_at), ordered by(country, funnel_step, event_at) - 4Retention: drop partitions > 90 days via
ALTER TABLE ... DROP PARTITION
Why NOT Postgres: 500M/day = 5800 writes/s sustained, manageable, but the aggregation query over 45B rows would take hours even with indexes. Postgres is OLTP; this is OLAP.
Why NOT a generic document store: no columnar compression, no vectorised query engine. Ingest is fine; query is the blocker.
Serving layer: pre-compute the top dashboards (hourly cron) into a small Postgres table for the main product analytics UI. ClickHouse handles ad-hoc queries from the analytics team.
Cheat sheet
- •Access pattern before storage. Never the reverse.
- •Name the one property the chosen store gives you that others don't.
- •BLOBs > ~100 KB → object store; keep only URL in your DB.
- •Search index is an index, not a primary store. Source of truth lives elsewhere.
- •Start simple (one SQL store) if unsure; name the migration trigger.
- •"Scales" is not a decision — scales in what dimension, at what cost?
- •Pre-signed URLs for client uploads — bytes bypass your app servers.
- •Managed for small teams; self-managed when markup > ops cost.
- •Redis is primarily a cache. Primary store only for ephemeral or volatile data.
- •Time-series workloads ≥ 100k events/s → columnar (ClickHouse, Druid), not SQL.
Practice this skill
These problems exercise Storage choice justification. Try one now to apply what you just learned.