corerequirements

Requirements & scope framing

Separating functional from non-functional, surfacing assumptions, bounding scope.

The first five minutes decide the next forty. Candidates who skip scoping design the wrong system brilliantly — and still fail. The ones who nail it look senior before they've drawn a single box.

Read this if your last attempt…

You got 15 minutes into HLD and the interviewer said "but what about X?"
You listed ten features and built none of them
Your NFRs were "fast and scalable" — nothing numeric
You asked six clarifying questions before drawing anything
You designed a global multi-region system for a niche internal tool

The concept

Requirements scoping is the deliberate narrowing of an open-ended prompt into a designable system. It's the single highest-leverage activity in a system-design interview because every choice you make after it depends on it.

Ninety percent of bad interviews aren't bad because the candidate forgot consistent hashing. They're bad because the candidate was designing a general-purpose social platform when the interviewer wanted a consumer chat app with specific latency requirements. Narrow first, design second.

Architecture diagram· The scoping funnel

Open prompt narrows to designable system via functional (in/out), non-functional with numbers, and assumptions.

NFR starter set — pick a number for each line, not an adjective.

NFR	Consumer target	Enterprise target	Payments target	Trade-off it forces
p99 latency (read)	< 200 ms	< 500 ms	< 300 ms	Cache layer, CDN, regional replicas
p99 latency (write)	< 500 ms	< 2 s	< 1 s	Sync vs async commit, single-leader shape
Availability	99.9–99.99%	99.95%	99.999%	Replication, multi-AZ, graceful degrade
Durability	Zero loss on commit	Zero loss on commit	Zero loss + audit	Quorum writes, backups, PITR
Consistency	Eventual on read, strong on write	Strong	Linearizable + serializable	Single-leader vs multi-leader
Scale (DAU + peak RPS)	100M / 20K	100K / 500	10M / 5K	Shard count, cache tier
Security / abuse	Rate limit + captcha	SSO + audit	PCI-DSS + fraud	Edge stack, logging, retention

Numbers are starters for conversation; adjust on prompt cues.
Payments systems often need availability higher than consumer social — downtime costs real money.

How interviewers grade this

You list 3–5 must-have functional requirements and explicitly mark 2–3 as out-of-scope or "nice to have".
You name the non-functional requirements that will drive design (latency target, availability target, durability, scale) — not generic "fast and reliable".
You ask clarifying questions before assuming — "who are the users, what is the scale, is this read-heavy?" — but only 1–2, then state assumptions for the rest.
You distinguish read-path requirements from write-path requirements because they often have different NFRs.
You tie each NFR to a number ("p99 < 100ms", "99.99% availability", not "fast" and "highly available").
You bundle assumptions and invite correction once, instead of asking questions serially.

Variants

MoSCoW (must/should/could/won't)

Four explicit priority buckets; most compact way to state product judgment.

Structure:

Must-have: the three functional musts without which the system is not v1
Should-have: important but not blocking; acknowledge and park
Could-have: nice-to-have; mention once, move on
Won't-have: explicit out-of-scope; tells the interviewer you chose

Why it works in interviews: it takes 60–90 seconds to say, the interviewer can follow it without you drawing anything, and it produces an obvious agenda for the rest of the interview. Must-haves become the features you design; should-haves become the "if time permits" deep-dive candidates.

Example for a chat app:

Must: send message, receive message, conversation list
Should: typing indicators, read receipts
Could: voice messages, reactions
Won't: end-to-end encryption, video calls, group admin roles

Pros

+Compact, universally understood
+Shows priority-setting as a skill
+Sets up a natural deep-dive agenda

Cons

−Easy to over-populate "should-have" and lose the three-musts focus

Choose this variant when

The prompt has obvious sprawl (multiple sub-systems in one product)
You need a compact structure to impose on an open question

NFR elicitation (5-number checklist)

Walk the five NFRs (latency, availability, durability, consistency, scale) and put a number on each.

The five NFRs, with starter numbers:

Latency — read p99 target (100–500 ms), write p99 target (200 ms–2 s)
Availability — 99.9% (4 nines is stretch for non-payments), 99.99% if on the critical user path
Durability — almost always "zero data loss on commit"; some systems (metrics, analytics) can tolerate 0.1% loss
Consistency — strong vs eventual; read-your-write vs lax; name the guarantee
Scale — DAU, peak QPS, total storage over retention window

Do not say "fast, reliable, scalable." Say "p99 under 200 ms, 99.99% available, linearizable on the write path, 100M DAU peaking at 20K RPS."

The hack: run this checklist in under 90 seconds. Walking through five lines keeps you from forgetting one, and any NFR you skip is a question the interviewer will ask later.

Pros

+Catches NFRs you would forget otherwise
+Forces numeric precision on every line
+Sets up the capacity pass cleanly

Cons

−Can feel robotic if delivered without context

Choose this variant when

You tend to forget one or two NFRs under pressure
The prompt is light on explicit NFR hints

Assumption-stating (bundled)

State multiple assumptions at once, invite challenge once.

Instead of asking "is it consumer or enterprise? what is the DAU? what is the read:write ratio?" (three round-trips), bundle them:

"I am going to assume consumer, 100M DAU, read-heavy by 100:1, global but deployed single-region on day one. Push back on any of that if it is off."

Why it works: the interviewer can correct any line in 10 seconds, and the other lines stay valid. You have shown that you know which knobs matter without using three minutes of setup time.

Watch out for: stating an assumption that contradicts an explicit prompt. Re-read the prompt before stating assumptions.

Pros

+Fast; compresses three minutes of Q&A into 30 seconds
+Shows which knobs you know matter
+Keeps momentum — you can start designing immediately

Cons

−If interviewer corrects several lines, you lose a minute rebasing

Choose this variant when

Prompt is open-ended with multiple obvious unknowns
Time pressure (round is < 45 min)

Scope-cutting under time pressure

Mid-interview, the interviewer adds a feature; cut something else.

When the interviewer pulls a scope expansion — "what about notifications?" or "what if we also support group chat?" — you have three options:

1Absorb into existing design if it fits naturally. ("Notifications run off the existing event bus; here is the topic shape.")
2Acknowledge and park if it does not fit. ("I can address group chat at the HLD level but we will not have time to do the deep dive.")
3Swap for something else if the new ask is more important than an existing must-have. ("Let me swap this with click analytics — which do you care about more?")

The move senior engineers make: never just say "sure, I'll add that" without acknowledging the time cost. Protecting your scope is as important as defining it.

Pros

+Keeps the design deliverable in the time remaining
+Shows product judgment under pressure

Cons

−Can feel confrontational if delivered without grace

Choose this variant when

Interviewer expands scope mid-design
You are 25+ minutes in and a new ask would derail the deep dive

Read-path vs write-path NFR split

Name separate latency, consistency, and availability targets for reads and writes.

Most systems have wildly different NFRs on the read and write paths.

URL shortener: redirect latency < 100 ms (user waits), create latency < 500 ms (user clicks a button, waits).
Payments: read latency < 300 ms, write latency < 1 s but durable and linearizable.
Social feed: read latency < 200 ms (eventual consistency OK), write latency < 500 ms (linearizable for the author's read-your-write).

How to state it: "The read path and write path have different requirements. Reads need X with Y consistency; writes need Z with W consistency."

This framing pays dividends in the design — read path gets cached aggressively with eventual consistency; write path goes through the primary with strict guarantees. Interviewers love it because it signals you will not over-engineer both paths.

Pros

+Prevents over-engineering one path
+Sets up the read/write-path split that the architecture will follow

Cons

−Adds 30 seconds to the scoping pass; worth it

Choose this variant when

System has obvious read-path and write-path asymmetry
Read:write ratio is > 10:1 or < 1:10

Worked example

Prompt: "Design a URL shortener."

Your 3-minute scoping (spoken out loud):

MoSCoW

Must-have (in scope):

Create short URL from long URL
Redirect short → long via HTTP 302
Click analytics (count, referrer, geo)

Should-have (acknowledge, park):

Link expiration / TTL
User account management

Could-have (mention once, move on):

Custom slugs
QR code generation

Will-not-have (explicit out of scope):

Custom branded domains
User-facing analytics dashboard
Team / organisation accounts

Non-functional requirements with numbers:

Redirect latency: p99 < 100 ms — on the critical user path
Create latency: p99 < 500 ms — user is clicking a button
Availability: 99.99% for redirects, 99.9% for creates
Durability: zero data loss on committed writes — quorum required
Consistency: eventual on read, strong on write
Scale: 100M new URLs/month, 100:1 read:write, ~20K peak redirect RPS
Security: short codes must not be enumerable

Read-path vs write-path split:

Read path (redirect): tight latency, cache-heavy, eventual consistency, 99.99% availability
Write path (create): moderate latency, durable, strong consistency, 99.9% availability

Clarifying questions (2 max):

"Consumer or internal?" - affects auth and abuse surface
"Global on day one, or single region?" - affects cost and complexity

Bundled assumptions (state once):

"I am going to assume consumer, global but single-region on day one, links are publicly accessible (no per-link auth on redirect), 302 is fine. Push back if any of that is off."

End of scoping: 3 minutes. Every downstream decision will be measured against these NFRs. The interviewer now has a chance to correct any line before we invest in architecture.

Good vs bad answer

Interviewer probe

“What are the requirements?”

Weak answer

"We need to shorten URLs, redirect them, maybe track clicks, support custom URLs, analytics dashboards, team accounts, API keys, and it should be fast and scalable."

Strong answer

"Three functional musts: create, redirect, click analytics. Link expiry and user accounts are should-haves — I will park those. Custom domains, team accounts, and API keys are out of scope.

NFRs with numbers: redirect p99 < 100 ms, create p99 < 500 ms, 99.99% availability on the redirect path, 99.9% on create, durability of created URLs must be bulletproof. Consistency: eventual on read, strong on write. Scale: 100M creates/month, 100:1 reads:writes so ~20K peak redirect RPS.

Bundled assumptions: consumer-facing, single-region to start, 302 redirects. Does that match what you have in mind?"

Why it wins: Three musts, explicit should-haves and will-not-haves, seven numeric NFRs split into read/write, bundled assumptions, invites correction once. Every downstream decision now has a measuring stick.

Interview playbook3–5 min at the start

When it comes up

Always — the first 3–5 minutes of every system design round
Before you draw any architecture box
When the interviewer says "how would you design X?" without any numbers
When the interviewer expands scope mid-interview ("what if we also support...")

Order of reveal

1
Acknowledge the prompt. "Before I start designing, let me scope this. I want to make sure we are designing the same system."
2
List functional musts (three). "Three must-haves: X, Y, Z. These are what define v1 of the system."
3
List explicit out-of-scope. "Out of scope: A, B, C. These are real product concerns but not part of this design."
4
Walk the 5 NFRs with numbers. "Latency target is X; availability is Y; durability is Z; consistency is W on reads and V on writes; scale is N DAU with M peak RPS."
5
Split read vs write path. "Reads are cache-friendly with eventual consistency; writes need strict durability. That asymmetry will drive the architecture."
6
Bundle assumptions. "I am going to assume consumer, single-region, [specific defaults]. Push back on any of that."
7
Invite correction. "Does this match what you have in mind, or should I adjust anything before I start designing?"

Signature phrases

“Let me scope this before I draw anything”

“Three must-haves, everything else is out of scope”

“Every NFR has a number”

“Reads and writes have different NFRs”

“I am going to assume X; push back if that is off”

“Out of scope is a feature”

“Let me scope this before I draw anything” — Signals discipline; buys three minutes without looking slow.
“Three must-haves, everything else is out of scope” — Imposes a priority structure the interviewer can follow.
“Every NFR has a number” — Demonstrates the "design is measured" mindset.
“Reads and writes have different NFRs” — Shows you avoid the over-engineer-both-paths trap.
“I am going to assume X; push back if that is off” — Invites correction without stalling.
“Out of scope is a feature” — Product judgment in a four-word frame.

Likely follow-ups

?“Why did you cut custom domains from scope?”Reveal

Two reasons:

1Feature depth: custom domains involve DNS management, SSL provisioning, and billing complexity. Doing them right is a month of engineering, not a design deep-dive.
2Orthogonal to core shape: the core system (create + redirect + analytics) does not depend on custom domains. Adding them later is a CDN config change and an ownership-model addition — not a redesign.

If the interviewer specifically wants custom domains, I would swap them in for analytics. But by default, I optimize for designing the core deeply over covering every feature shallowly.

?“You said 99.99% availability. Defend that number.”Reveal

99.99% = ~52 minutes of downtime per year, ~4.3 minutes per month. For a URL shortener, broken redirects break every shared link — a 1-hour outage means every link shared on social media for that hour is dead. That is a brand-damaging event.

At 99.9% we get ~8.7 hours of downtime per year — unacceptable for something on the public web. At 99.99% we need multi-AZ deployment, graceful degradation on cache failure, and an error budget gating risky deploys. At 99.999% (five nines) we would need multi-region active-active, which adds cost and complexity not justified for a social share tool.

99.99% is the right tier for this product. If the interviewer said "this is for internal use only," I would drop to 99.9%.

?“What if I told you we actually need 1B DAU, not 100M?”Reveal

That changes three things:

1Capacity: peak RPS goes from 20K to 200K. That crosses sharding thresholds — primary Postgres cannot do 20K writes/sec; we would need sharded storage (Cassandra or sharded Postgres).
2Cache: working set grows 10×. Hot set is now ~600 GB instead of 60 GB; we need a proper Redis cluster.
3Geographic: 1B DAU is globally distributed; a single region adds 300+ ms for some users. Multi-region is now required.

Let me rebase: I will re-state the assumptions, revisit the capacity pass, and flag which architecture choices change. The functional scope stays the same; the NFRs tighten; the architecture grows a region and a shard dimension.

?“Can we add group chat to this messaging system?”Reveal

Let me think about what that costs. Group chat changes three things in the existing design:

Fan-out: 1-to-N delivery instead of 1-to-1; group size bounds (10? 100? 1000?) change the write amplification significantly.
Read model: members list, read receipts, typing indicators all become per-member-per-group.
Permissions: who can add/remove, who can read history, admin roles.

Options:

1Absorb: if we assume small groups (< 50 members), fan-out is bounded; we can add it as a variant of the DM delivery path. Adds ~5 minutes of design.
2Swap: drop something else from must-haves. Which would you prefer to cut?
3Park: mark it as HLD-level with "here is how it would plug in" but do not deep-dive.

Which do you want?

Common mistakes

Asking six clarifying questions in a row

Drains time and looks hesitant. Ask at most 2, state the rest as assumptions. Assumptions the interviewer disagrees with are a cheap round-trip to correct; question-asking theatre is not.

Non-functional requirements that are adjectives

"Fast", "scalable", "robust" — you have said nothing. Every NFR has a number. Missing numbers is the most common reason interviews drift.

No read/write splitAdvanced

Read-path and write-path often have dramatically different NFRs (latency target, consistency tolerance, cache tolerance). Not splitting them forces you to over-engineer one or under-engineer the other.

Assuming the biggest possible scale unprompted

Jumping to "1B DAU" forces an over-engineered design that reads as "doesn't know when to simplify". Anchor to the prompt; state your number; let the interviewer nudge it up.

No out-of-scope list

A scope without an out-of-scope list is aspirational, not designable. Interviewers read "I will design everything" as "I will design nothing deeply". Name three things you are not doing.

Mid-design scope creep without re-anchoringAdvanced

Interviewer says "what about group chat?"; you just add it. Now your 3-feature design is a 4-feature design and the time budget no longer works. Name the cost: "adding this pushes out the deep-dive; should I swap, absorb, or park?"

Practice drills

Interviewer: "Let's design Twitter." You've got 3 minutes for scoping. Go.Reveal

MoSCoW

Must: (1) post a tweet, (2) home timeline (follow-based feed), (3) user profile page
Should: notifications, replies
Could: trending, search
Won't: DMs, ads, media upload, advertiser dashboards

NFRs with numbers:

Feed read p99 < 200 ms (consumer social)
Tweet write p99 < 500 ms, durable on commit
Availability: 99.99% on read, 99.9% on write
Consistency: eventual on feed read, strong on tweet write (read-your-write for the author)
Scale: 300M DAU, avg user reads 50 tweets/day → ~175k read RPS avg, ~700k peak at 4× (consumer social)

Read/write split: read-heavy by 100:1. That shapes the fan-out decision (fan-out on write vs read).

Assumptions: "Single-region first; global is a follow-up. Tweets are public by default. Push back if any of that is off."

You state "99.99% availability". Interviewer asks: "What does that mean exactly?"Reveal

99.99% = ~52 minutes of downtime per year, or ~4.3 minutes per month. "Available" means the read path returns a successful response within the latency SLO.

That's a budget, not a promise. Every deploy, bad config push, or regional outage burns part of that budget. To actually achieve it you need:

Multi-AZ deployment (one AZ failure = still available)
Graceful degradation (read path stays up even if write path is down)
An error budget that gates risky changes — if you are over budget, the next risky deploy waits

Interviewer expands scope: "Oh, we also need multi-region replication." You are 25 minutes in. What do you do?Reveal

Name the cost before agreeing. "Multi-region changes three things: latency on the write path (async vs sync replication), consistency model (single-leader vs multi-leader), and operational complexity (conflict resolution, failover).

I have two options:

1HLD-level only — I can show how the current design extends to multi-region at the architecture level; we will not have time for deep-dive on conflict resolution.
2Swap the deep-dive — I can drop the [current deep-dive target] and replace it with the multi-region design, which gets us a real deep-dive on the hardest part.

Which do you prefer?"

Never just agree to expand scope without showing you know what you're giving up.

How do you decide the right availability target — 99.9%, 99.99%, or 99.999%?Reveal

Work from the cost of downtime.

99.9% (~9 hours/year): internal tools, dev dashboards, non-blocking features. Cheap to operate, not worth more.
99.99% (~52 min/year): consumer products where downtime damages brand (social, shopping, streaming). Needs multi-AZ, good observability, graceful degradation.
99.999% (~5 min/year): payments, critical infrastructure, auth services that gate everything else. Needs multi-region active-active, extensive testing, organizational maturity.

The test: "what does a 1-hour outage cost us in reputation, revenue, or safety?" If the answer is "annoying", 99.9% is fine. If "brand damaging", 99.99%. If "existential", 99.999%.

The prompt says "design a chat app." What are three clarifying questions worth asking?Reveal

Pick the two whose answers most change the design:

11-to-1 only or group chat? — group chat changes fan-out, permissions, and storage shape dramatically.
2Consumer or enterprise? — affects auth (SSO vs phone), abuse surface (spam vs insider threat), and retention (product chat vs compliance logs).
3Global or regional? — affects latency budget and whether cross-region replication is a must.

Pick two, state the third as an assumption. Do not ask about programming language, cloud provider, team size, or budget — those do not change the design.

Cheat sheet

•Spend 3–5 minutes on scoping. Not 1, not 10.
•MoSCoW: must / should / could / won't. Three musts, three explicit won'ts.
•Every NFR has a number. "99.99%", "p99 < 100 ms", "3 TB over 5 years".
•Walk the 5-NFR checklist: latency, availability, durability, consistency, scale.
•Read-path and write-path NFRs often differ. Split them if they do.
•Ask 1–2 clarifying questions max. Bundle everything else as stated assumptions.
•"Out of scope" is a feature — it shows you prioritised.
•Invite correction once: "push back on any of that."
•Scope creep mid-interview: name the cost, offer absorb / swap / park.
•If you can't defend an NFR with one sentence, drop the number.

Practice this skill

These problems exercise Requirements & scope framing. Try one now to apply what you just learned.

url shortener rate limiter

8% complete

Current

Read this if

Step 1 of 13

The concept

Jump to next

NFR

Consumer target

Enterprise target

Payments target

Trade-off it forces

p99 latency (read)

< 200 ms

< 500 ms

< 300 ms

Cache layer, CDN, regional replicas

p99 latency (write)

< 500 ms

< 2 s

< 1 s

Sync vs async commit, single-leader shape

Availability

99.9–99.99%

99.95%

99.999%

Replication, multi-AZ, graceful degrade

Durability

Zero loss on commit

Zero loss + audit

Quorum writes, backups, PITR

Consistency

Eventual on read, strong on write

Strong

Linearizable + serializable

Single-leader vs multi-leader

Scale (DAU + peak RPS)

100M / 20K

100K / 500

10M / 5K

Shard count, cache tier

Security / abuse

Rate limit + captcha

SSO + audit

PCI-DSS + fraud

Edge stack, logging, retention