Requirements & scope framing
Separating functional from non-functional, surfacing assumptions, bounding scope.
The first five minutes decide the next forty. Candidates who skip scoping design the wrong system brilliantly — and still fail. The ones who nail it look senior before they've drawn a single box.
Read this if your last attempt…
- You got 15 minutes into HLD and the interviewer said "but what about X?"
- You listed ten features and built none of them
- Your NFRs were "fast and scalable" — nothing numeric
- You asked six clarifying questions before drawing anything
- You designed a global multi-region system for a niche internal tool
The concept
Requirements scoping is the deliberate narrowing of an open-ended prompt into a designable system. It's the single highest-leverage activity in a system-design interview because every choice you make after it depends on it.
Ninety percent of bad interviews aren't bad because the candidate forgot consistent hashing. They're bad because the candidate was designing a general-purpose social platform when the interviewer wanted a consumer chat app with specific latency requirements. Narrow first, design second.
Open prompt narrows to designable system via functional (in/out), non-functional with numbers, and assumptions.
NFR starter set — pick a number for each line, not an adjective.
| NFR | Consumer target | Enterprise target | Payments target | Trade-off it forces |
|---|---|---|---|---|
| p99 latency (read) | < 200 ms | < 500 ms | < 300 ms | Cache layer, CDN, regional replicas |
| p99 latency (write) | < 500 ms | < 2 s | < 1 s | Sync vs async commit, single-leader shape |
| Availability | 99.9–99.99% | 99.95% | 99.999% | Replication, multi-AZ, graceful degrade |
| Durability | Zero loss on commit | Zero loss on commit | Zero loss + audit | Quorum writes, backups, PITR |
| Consistency | Eventual on read, strong on write | Strong | Linearizable + serializable | Single-leader vs multi-leader |
| Scale (DAU + peak RPS) | 100M / 20K | 100K / 500 | 10M / 5K | Shard count, cache tier |
| Security / abuse | Rate limit + captcha | SSO + audit | PCI-DSS + fraud | Edge stack, logging, retention |
- Numbers are starters for conversation; adjust on prompt cues.
- Payments systems often need availability higher than consumer social — downtime costs real money.
How interviewers grade this
- You list 3–5 must-have functional requirements and explicitly mark 2–3 as out-of-scope or "nice to have".
- You name the non-functional requirements that will drive design (latency target, availability target, durability, scale) — not generic "fast and reliable".
- You ask clarifying questions before assuming — "who are the users, what is the scale, is this read-heavy?" — but only 1–2, then state assumptions for the rest.
- You distinguish read-path requirements from write-path requirements because they often have different NFRs.
- You tie each NFR to a number ("p99 < 100ms", "99.99% availability", not "fast" and "highly available").
- You bundle assumptions and invite correction once, instead of asking questions serially.
Variants
MoSCoW (must/should/could/won't)
Four explicit priority buckets; most compact way to state product judgment.
Structure:
- Must-have: the three functional musts without which the system is not v1
- Should-have: important but not blocking; acknowledge and park
- Could-have: nice-to-have; mention once, move on
- Won't-have: explicit out-of-scope; tells the interviewer you chose
Why it works in interviews: it takes 60–90 seconds to say, the interviewer can follow it without you drawing anything, and it produces an obvious agenda for the rest of the interview. Must-haves become the features you design; should-haves become the "if time permits" deep-dive candidates.
Example for a chat app:
- Must: send message, receive message, conversation list
- Should: typing indicators, read receipts
- Could: voice messages, reactions
- Won't: end-to-end encryption, video calls, group admin roles
Pros
- +Compact, universally understood
- +Shows priority-setting as a skill
- +Sets up a natural deep-dive agenda
Cons
- −Easy to over-populate "should-have" and lose the three-musts focus
Choose this variant when
- The prompt has obvious sprawl (multiple sub-systems in one product)
- You need a compact structure to impose on an open question
NFR elicitation (5-number checklist)
Walk the five NFRs (latency, availability, durability, consistency, scale) and put a number on each.
The five NFRs, with starter numbers:
- Latency — read p99 target (100–500 ms), write p99 target (200 ms–2 s)
- Availability — 99.9% (4 nines is stretch for non-payments), 99.99% if on the critical user path
- Durability — almost always "zero data loss on commit"; some systems (metrics, analytics) can tolerate 0.1% loss
- Consistency — strong vs eventual; read-your-write vs lax; name the guarantee
- Scale — DAU, peak QPS, total storage over retention window
Do not say "fast, reliable, scalable." Say "p99 under 200 ms, 99.99% available, linearizable on the write path, 100M DAU peaking at 20K RPS."
The hack: run this checklist in under 90 seconds. Walking through five lines keeps you from forgetting one, and any NFR you skip is a question the interviewer will ask later.
Pros
- +Catches NFRs you would forget otherwise
- +Forces numeric precision on every line
- +Sets up the capacity pass cleanly
Cons
- −Can feel robotic if delivered without context
Choose this variant when
- You tend to forget one or two NFRs under pressure
- The prompt is light on explicit NFR hints
Assumption-stating (bundled)
State multiple assumptions at once, invite challenge once.
Instead of asking "is it consumer or enterprise? what is the DAU? what is the read:write ratio?" (three round-trips), bundle them:
"I am going to assume consumer, 100M DAU, read-heavy by 100:1, global but deployed single-region on day one. Push back on any of that if it is off."
Why it works: the interviewer can correct any line in 10 seconds, and the other lines stay valid. You have shown that you know which knobs matter without using three minutes of setup time.
Watch out for: stating an assumption that contradicts an explicit prompt. Re-read the prompt before stating assumptions.
Pros
- +Fast; compresses three minutes of Q&A into 30 seconds
- +Shows which knobs you know matter
- +Keeps momentum — you can start designing immediately
Cons
- −If interviewer corrects several lines, you lose a minute rebasing
Choose this variant when
- Prompt is open-ended with multiple obvious unknowns
- Time pressure (round is < 45 min)
Scope-cutting under time pressure
Mid-interview, the interviewer adds a feature; cut something else.
When the interviewer pulls a scope expansion — "what about notifications?" or "what if we also support group chat?" — you have three options:
- 1Absorb into existing design if it fits naturally. ("Notifications run off the existing event bus; here is the topic shape.")
- 2Acknowledge and park if it does not fit. ("I can address group chat at the HLD level but we will not have time to do the deep dive.")
- 3Swap for something else if the new ask is more important than an existing must-have. ("Let me swap this with click analytics — which do you care about more?")
The move senior engineers make: never just say "sure, I'll add that" without acknowledging the time cost. Protecting your scope is as important as defining it.
Pros
- +Keeps the design deliverable in the time remaining
- +Shows product judgment under pressure
Cons
- −Can feel confrontational if delivered without grace
Choose this variant when
- Interviewer expands scope mid-design
- You are 25+ minutes in and a new ask would derail the deep dive
Read-path vs write-path NFR split
Name separate latency, consistency, and availability targets for reads and writes.
Most systems have wildly different NFRs on the read and write paths.
- URL shortener: redirect latency < 100 ms (user waits), create latency < 500 ms (user clicks a button, waits).
- Payments: read latency < 300 ms, write latency < 1 s but durable and linearizable.
- Social feed: read latency < 200 ms (eventual consistency OK), write latency < 500 ms (linearizable for the author's read-your-write).
How to state it: "The read path and write path have different requirements. Reads need X with Y consistency; writes need Z with W consistency."
This framing pays dividends in the design — read path gets cached aggressively with eventual consistency; write path goes through the primary with strict guarantees. Interviewers love it because it signals you will not over-engineer both paths.
Pros
- +Prevents over-engineering one path
- +Sets up the read/write-path split that the architecture will follow
Cons
- −Adds 30 seconds to the scoping pass; worth it
Choose this variant when
- System has obvious read-path and write-path asymmetry
- Read:write ratio is > 10:1 or < 1:10
Worked example
Prompt: "Design a URL shortener."
Your 3-minute scoping (spoken out loud):
MoSCoW
Must-have (in scope):
- Create short URL from long URL
- Redirect short → long via HTTP 302
- Click analytics (count, referrer, geo)
Should-have (acknowledge, park):
- Link expiration / TTL
- User account management
Could-have (mention once, move on):
- Custom slugs
- QR code generation
Will-not-have (explicit out of scope):
- Custom branded domains
- User-facing analytics dashboard
- Team / organisation accounts
Non-functional requirements with numbers:
- Redirect latency: p99 < 100 ms — on the critical user path
- Create latency: p99 < 500 ms — user is clicking a button
- Availability: 99.99% for redirects, 99.9% for creates
- Durability: zero data loss on committed writes — quorum required
- Consistency: eventual on read, strong on write
- Scale: 100M new URLs/month, 100:1 read:write, ~20K peak redirect RPS
- Security: short codes must not be enumerable
Read-path vs write-path split:
- Read path (redirect): tight latency, cache-heavy, eventual consistency, 99.99% availability
- Write path (create): moderate latency, durable, strong consistency, 99.9% availability
Clarifying questions (2 max):
- "Consumer or internal?" - affects auth and abuse surface
- "Global on day one, or single region?" - affects cost and complexity
Bundled assumptions (state once):
- "I am going to assume consumer, global but single-region on day one, links are publicly accessible (no per-link auth on redirect), 302 is fine. Push back if any of that is off."
End of scoping: 3 minutes. Every downstream decision will be measured against these NFRs. The interviewer now has a chance to correct any line before we invest in architecture.
Good vs bad answer
Interviewer probe
“What are the requirements?”
Weak answer
"We need to shorten URLs, redirect them, maybe track clicks, support custom URLs, analytics dashboards, team accounts, API keys, and it should be fast and scalable."
Strong answer
"Three functional musts: create, redirect, click analytics. Link expiry and user accounts are should-haves — I will park those. Custom domains, team accounts, and API keys are out of scope.
NFRs with numbers: redirect p99 < 100 ms, create p99 < 500 ms, 99.99% availability on the redirect path, 99.9% on create, durability of created URLs must be bulletproof. Consistency: eventual on read, strong on write. Scale: 100M creates/month, 100:1 reads:writes so ~20K peak redirect RPS.
Bundled assumptions: consumer-facing, single-region to start, 302 redirects. Does that match what you have in mind?"
Why it wins: Three musts, explicit should-haves and will-not-haves, seven numeric NFRs split into read/write, bundled assumptions, invites correction once. Every downstream decision now has a measuring stick.
When it comes up
- Always — the first 3–5 minutes of every system design round
- Before you draw any architecture box
- When the interviewer says "how would you design X?" without any numbers
- When the interviewer expands scope mid-interview ("what if we also support...")
Order of reveal
- 1Acknowledge the prompt. "Before I start designing, let me scope this. I want to make sure we are designing the same system."
- 2List functional musts (three). "Three must-haves: X, Y, Z. These are what define v1 of the system."
- 3List explicit out-of-scope. "Out of scope: A, B, C. These are real product concerns but not part of this design."
- 4Walk the 5 NFRs with numbers. "Latency target is X; availability is Y; durability is Z; consistency is W on reads and V on writes; scale is N DAU with M peak RPS."
- 5Split read vs write path. "Reads are cache-friendly with eventual consistency; writes need strict durability. That asymmetry will drive the architecture."
- 6Bundle assumptions. "I am going to assume consumer, single-region, [specific defaults]. Push back on any of that."
- 7Invite correction. "Does this match what you have in mind, or should I adjust anything before I start designing?"
Signature phrases
- “Let me scope this before I draw anything” — Signals discipline; buys three minutes without looking slow.
- “Three must-haves, everything else is out of scope” — Imposes a priority structure the interviewer can follow.
- “Every NFR has a number” — Demonstrates the "design is measured" mindset.
- “Reads and writes have different NFRs” — Shows you avoid the over-engineer-both-paths trap.
- “I am going to assume X; push back if that is off” — Invites correction without stalling.
- “Out of scope is a feature” — Product judgment in a four-word frame.
Likely follow-ups
?“Why did you cut custom domains from scope?”Reveal
Two reasons:
- 1Feature depth: custom domains involve DNS management, SSL provisioning, and billing complexity. Doing them right is a month of engineering, not a design deep-dive.
- 2Orthogonal to core shape: the core system (create + redirect + analytics) does not depend on custom domains. Adding them later is a CDN config change and an ownership-model addition — not a redesign.
If the interviewer specifically wants custom domains, I would swap them in for analytics. But by default, I optimize for designing the core deeply over covering every feature shallowly.
?“You said 99.99% availability. Defend that number.”Reveal
99.99% = ~52 minutes of downtime per year, ~4.3 minutes per month. For a URL shortener, broken redirects break every shared link — a 1-hour outage means every link shared on social media for that hour is dead. That is a brand-damaging event.
At 99.9% we get ~8.7 hours of downtime per year — unacceptable for something on the public web. At 99.99% we need multi-AZ deployment, graceful degradation on cache failure, and an error budget gating risky deploys. At 99.999% (five nines) we would need multi-region active-active, which adds cost and complexity not justified for a social share tool.
99.99% is the right tier for this product. If the interviewer said "this is for internal use only," I would drop to 99.9%.
?“What if I told you we actually need 1B DAU, not 100M?”Reveal
That changes three things:
- 1Capacity: peak RPS goes from 20K to 200K. That crosses sharding thresholds — primary Postgres cannot do 20K writes/sec; we would need sharded storage (Cassandra or sharded Postgres).
- 2Cache: working set grows 10×. Hot set is now ~600 GB instead of 60 GB; we need a proper Redis cluster.
- 3Geographic: 1B DAU is globally distributed; a single region adds 300+ ms for some users. Multi-region is now required.
Let me rebase: I will re-state the assumptions, revisit the capacity pass, and flag which architecture choices change. The functional scope stays the same; the NFRs tighten; the architecture grows a region and a shard dimension.
?“Can we add group chat to this messaging system?”Reveal
Let me think about what that costs. Group chat changes three things in the existing design:
- Fan-out: 1-to-N delivery instead of 1-to-1; group size bounds (10? 100? 1000?) change the write amplification significantly.
- Read model: members list, read receipts, typing indicators all become per-member-per-group.
- Permissions: who can add/remove, who can read history, admin roles.
Options:
- 1Absorb: if we assume small groups (< 50 members), fan-out is bounded; we can add it as a variant of the DM delivery path. Adds ~5 minutes of design.
- 2Swap: drop something else from must-haves. Which would you prefer to cut?
- 3Park: mark it as HLD-level with "here is how it would plug in" but do not deep-dive.
Which do you want?
Common mistakes
Drains time and looks hesitant. Ask at most 2, state the rest as assumptions. Assumptions the interviewer disagrees with are a cheap round-trip to correct; question-asking theatre is not.
"Fast", "scalable", "robust" — you have said nothing. Every NFR has a number. Missing numbers is the most common reason interviews drift.
Read-path and write-path often have dramatically different NFRs (latency target, consistency tolerance, cache tolerance). Not splitting them forces you to over-engineer one or under-engineer the other.
Jumping to "1B DAU" forces an over-engineered design that reads as "doesn't know when to simplify". Anchor to the prompt; state your number; let the interviewer nudge it up.
A scope without an out-of-scope list is aspirational, not designable. Interviewers read "I will design everything" as "I will design nothing deeply". Name three things you are not doing.
Interviewer says "what about group chat?"; you just add it. Now your 3-feature design is a 4-feature design and the time budget no longer works. Name the cost: "adding this pushes out the deep-dive; should I swap, absorb, or park?"
Practice drills
Interviewer: "Let's design Twitter." You've got 3 minutes for scoping. Go.Reveal
MoSCoW
- Must: (1) post a tweet, (2) home timeline (follow-based feed), (3) user profile page
- Should: notifications, replies
- Could: trending, search
- Won't: DMs, ads, media upload, advertiser dashboards
NFRs with numbers:
- Feed read p99 < 200 ms (consumer social)
- Tweet write p99 < 500 ms, durable on commit
- Availability: 99.99% on read, 99.9% on write
- Consistency: eventual on feed read, strong on tweet write (read-your-write for the author)
- Scale: 300M DAU, avg user reads 50 tweets/day → ~175k read RPS avg, ~700k peak at 4× (consumer social)
Read/write split: read-heavy by 100:1. That shapes the fan-out decision (fan-out on write vs read).
Assumptions: "Single-region first; global is a follow-up. Tweets are public by default. Push back if any of that is off."
You state "99.99% availability". Interviewer asks: "What does that mean exactly?"Reveal
99.99% = ~52 minutes of downtime per year, or ~4.3 minutes per month. "Available" means the read path returns a successful response within the latency SLO.
That's a budget, not a promise. Every deploy, bad config push, or regional outage burns part of that budget. To actually achieve it you need:
- Multi-AZ deployment (one AZ failure = still available)
- Graceful degradation (read path stays up even if write path is down)
- An error budget that gates risky changes — if you are over budget, the next risky deploy waits
Interviewer expands scope: "Oh, we also need multi-region replication." You are 25 minutes in. What do you do?Reveal
Name the cost before agreeing. "Multi-region changes three things: latency on the write path (async vs sync replication), consistency model (single-leader vs multi-leader), and operational complexity (conflict resolution, failover).
I have two options:
- 1HLD-level only — I can show how the current design extends to multi-region at the architecture level; we will not have time for deep-dive on conflict resolution.
- 2Swap the deep-dive — I can drop the [current deep-dive target] and replace it with the multi-region design, which gets us a real deep-dive on the hardest part.
Which do you prefer?"
Never just agree to expand scope without showing you know what you're giving up.
How do you decide the right availability target — 99.9%, 99.99%, or 99.999%?Reveal
Work from the cost of downtime.
- 99.9% (~9 hours/year): internal tools, dev dashboards, non-blocking features. Cheap to operate, not worth more.
- 99.99% (~52 min/year): consumer products where downtime damages brand (social, shopping, streaming). Needs multi-AZ, good observability, graceful degradation.
- 99.999% (~5 min/year): payments, critical infrastructure, auth services that gate everything else. Needs multi-region active-active, extensive testing, organizational maturity.
The test: "what does a 1-hour outage cost us in reputation, revenue, or safety?" If the answer is "annoying", 99.9% is fine. If "brand damaging", 99.99%. If "existential", 99.999%.
The prompt says "design a chat app." What are three clarifying questions worth asking?Reveal
Pick the two whose answers most change the design:
- 11-to-1 only or group chat? — group chat changes fan-out, permissions, and storage shape dramatically.
- 2Consumer or enterprise? — affects auth (SSO vs phone), abuse surface (spam vs insider threat), and retention (product chat vs compliance logs).
- 3Global or regional? — affects latency budget and whether cross-region replication is a must.
Pick two, state the third as an assumption. Do not ask about programming language, cloud provider, team size, or budget — those do not change the design.
Cheat sheet
- •Spend 3–5 minutes on scoping. Not 1, not 10.
- •MoSCoW: must / should / could / won't. Three musts, three explicit won'ts.
- •Every NFR has a number. "99.99%", "p99 < 100 ms", "3 TB over 5 years".
- •Walk the 5-NFR checklist: latency, availability, durability, consistency, scale.
- •Read-path and write-path NFRs often differ. Split them if they do.
- •Ask 1–2 clarifying questions max. Bundle everything else as stated assumptions.
- •"Out of scope" is a feature — it shows you prioritised.
- •Invite correction once: "push back on any of that."
- •Scope creep mid-interview: name the cost, offer absorb / swap / park.
- •If you can't defend an NFR with one sentence, drop the number.
Practice this skill
These problems exercise Requirements & scope framing. Try one now to apply what you just learned.