High-level architecture
Component boundaries, request flow, sync vs async paths.
The HLD is the diagram every subsequent question is asked against. Clear boundaries + explicit dataflow beats clever components every time. Most candidates over-draw; seniors underdraw and label.
Read this if your last attempt…
- Your whiteboard ended up as a mess of boxes with no visible request path
- You got the feedback "the design was hard to follow"
- You jumped to components (Kafka! Redis!) before drawing the request shape
- You couldn't answer "walk me through a single request end-to-end"
- You drew 15 boxes because you thought more = senior
The concept
A high-level architecture diagram has exactly one job: make the request flow legible. If someone can look at your drawing and narrate "the client calls X, which does Y, which persists in Z", the HLD is working. Every decoration that doesn't serve that goal is noise.
The signal a strong HLD sends is not "I know the names of many technologies." It's "I can explain what happens in this system from the moment a user taps a button to the moment their screen updates."
Every HLD is a version of this: client → edge → app tier → data tier, with async side-channels for anything that doesn't have to block the response.
What to draw vs what to leave as an aside.
| Category | Always draw | Sometimes draw | Mention verbally |
|---|---|---|---|
| Entry | LB / edge | WAF / DDoS if asked | Cert termination |
| Compute | Stateless app tier | Service breakdown if >3 services | Framework / language |
| Primary data store | Yes | Read replicas if read-heavy | Replication lag budget |
| Cache | If on hot path | Multi-tier (edge + app) if relevant | Eviction policy |
| Queue / stream | If async work exists | DLQ / retry topology | Visibility timeout tuning |
| Search index | If search matters | CDC pipeline to it | Rebuild procedure |
| Object storage | If media is involved | Lifecycle rules (Glacier) | CDN on top |
| Auth | No — off to the side | Login flow arrow | JWT / session tactics |
| Observability | No | Only if prompt demands | Metrics / logs / traces |
How interviewers grade this
- You draw fewer than 10 boxes on the main diagram. More than that and the interviewer has stopped following.
- Every edge has a verb (SELECT, enqueue, publish, stream) — not just an arrow.
- You label the protocol on at least one edge (HTTP, gRPC, AMQP, WebSocket).
- Async paths are visually distinct from sync paths (dashed, via a queue).
- Stateful components (DB, cache, queue) are drawn with replicas or clusters, not as single boxes.
- You name the bottleneck before the interviewer asks where it is.
- Read and write paths are split when they genuinely diverge, not preemptively.
- You name what you are NOT drawing — control plane, auth, observability as verbal asides.
Variants
Monolith + read replicas
One app service, one primary DB, 2–3 read replicas, optional cache.
The honest answer for most prompts under ~10K QPS. Boring, correct, and always the baseline against which fancier designs get compared.
The shape:
- LB → N app replicas (stateless) → primary DB
- Read replicas for scaling reads (lag < 1 s)
- Redis in front for hot reads
- Single deploy unit; single on-call rotation
Why it wins for 80% of prompts:
- Cheapest to build and operate
- No distributed transactions, no eventual consistency
- Failure modes are few and understood
- Easy to reason about latency and capacity
When it breaks:
- Write QPS exceeds what a single primary handles (~20k sustained)
- Teams exceed ~30 engineers sharing one codebase
- Workloads diverge — one path is 100× the QPS of another
Pros
- +Fastest to reason about
- +One deploy unit, one DB, one team
- +Strong consistency by default
- +Ops cost is minimal
Cons
- −Single deploy blast radius
- −Vertical scaling ceiling on the DB
- −Hard to scale teams past ~30 engineers
Choose this variant when
- < 10K QPS combined
- Single domain / tight coupling
- Small team (< 30 engineers)
- Most system design prompts as the v1
Domain-split services
Services drawn along domain boundaries (feed, profile, messaging) with their own stores.
Each service owns its data. Cross-service reads/writes go over RPC or async events. Boundaries follow domains, not technologies.
The right answer when the prompt has clearly separable workloads (e.g., feed reads and messaging writes have nothing in common). Each service owns its data; cross-service calls are explicit.
The shape:
- LB → API gateway → N services (each with its own datastore)
- Services communicate via gRPC or async events
- Each service: own deploy, own dashboards, own on-call
Service boundaries that work:
- Feed service (read-heavy) separate from posting service (write-heavy)
- User service (identity, profile) separate from everything else
- Payment service (compliance) separate from product service
- Media service (CPU/bandwidth) separate from API service (low-resource)
Traps:
- Splitting too early (< 30 engineers, single domain)
- Splitting along technology rather than domain (a "Redis service" is not a domain)
- No answer for cross-service transactions (reach for saga / outbox)
Pros
- +Independent scaling per workload
- +Team ownership maps to service boundary
- +Blast radius bounded by service
- +Technology diversity where it matters
Cons
- −Cross-service transactions are a design problem (sagas, outbox)
- −More moving parts in ops (deploys, dashboards, on-call)
- −Easy to over-split too early
- −Distributed tracing becomes mandatory
Choose this variant when
- Clearly separable domains
- > 30 engineers
- Different scaling profiles per workload
- Different technology needs per domain
CQRS / separate read path
Writes go to the system of record; a derived read model is materialised into a denormalised store.
Writes go to the source of truth (normalised). A projector materialises the read model into a denormalised store optimised for the access pattern.
Use when read load dwarfs writes and the read shape differs sharply from the write shape (e.g., timelines, feeds, search). The read path can be rebuilt from the write log at any time.
The shape:
- Write path: client → app → primary DB (normalised)
- Projector: CDC stream → read models (cache, search index, denormalised KV)
- Read path: client → app → read model (fast, pre-joined)
Example — Instagram feed:
- Write: user posts → Postgres (source of truth)
- Projector: CDC → fan out to follower inbox in Redis
- Read: user loads feed → single Redis LRANGE per user
When NOT CQRS:
- Read:write ratio < 10:1 — operational cost not worth it
- Small team — the projector pipeline is real ongoing work
- Strict read-your-writes consistency — eventual consistency between sides breaks UX
Pros
- +Read and write paths scale independently
- +Read model is tailored to the access pattern (fast)
- +Rebuildable from the log — no schema-migration pain
- +New read models are cheap to add (new projector)
Cons
- −Operational cost of the projector pipeline is real
- −Eventual consistency between write and read sides
- −Overkill when read:write < ~10:1
- −Requires streaming infrastructure
Choose this variant when
- Read:write > 10:1
- Read shape ≠ write shape
- Can tolerate seconds of staleness on reads
- Multiple read consumers want different shapes
Event-driven / async-first
Kafka (or equivalent) is the backbone; services publish events and consume independently.
Producers publish facts to Kafka once. Each consumer reads independently with its own offset. Add new consumers without touching producers.
When multiple services need to react to the same facts without tight coupling, an event log becomes the architecture's spine.
The shape:
- Sync path: client → app → primary DB (write) → publish event to Kafka
- Kafka retention: 7–30 days, partitioned by entity id
- Consumers: search indexer, analytics pipeline, notification service, audit logger — each with independent offsets
Why it wins:
- New consumers added without touching producers
- Replay from the log recovers from consumer bugs
- Decouples write latency from downstream work
- Audit trail is built-in (the log itself)
When it hurts:
- Small systems drown in ops cost (Kafka is not free)
- Eventual consistency is the default everywhere
- Debugging cross-service issues requires distributed tracing
- Event schema evolution needs real discipline
Pros
- +Add consumers without touching producers
- +Replay for recovery and new features
- +Audit trail from the log
- +Decoupled write latency
Cons
- −Kafka ops / cost is real
- −Eventual consistency everywhere
- −Debugging needs distributed tracing
- −Schema evolution requires discipline
Choose this variant when
- Many independent consumers of the same facts
- Audit/replay requirements
- Fan-out is a core pattern
- Team has streaming infra expertise
Edge-heavy / CDN-first
CDN/edge workers handle most traffic; origin servers only serve the uncacheable tail.
CDN absorbs cacheable reads at the edge. Edge workers personalise where needed. Origin only sees cache misses + writes — typically <5% of total traffic.
When the workload is dominated by reads of cacheable content (media sites, e-commerce catalogs, documentation, blogs, content APIs), serving from the edge moves latency from 200 ms to 20 ms and cost drops 10×.
The shape:
- CloudFlare / CloudFront / Fastly at the edge, cache aggressively
- Edge workers (Cloudflare Workers, Lambda@Edge) for personalisation
- Origin tier is small and serves only cache misses + writes
- Static assets on S3 / GCS with lifecycle rules
Signals to reach for it:
- Geographically distributed users
- Content is public or semi-public (cacheable without user-specific data)
- Latency budget < 100 ms p95 globally
- Origin QPS would be prohibitive without cache
Constraints:
- Cache invalidation is the eternal problem
- Personalised content complicates things — edge workers or hybrid cache keys
- Cost model is different (CDN egress vs origin compute)
Pros
- +Sub-50 ms latency globally
- +10× cost reduction on static-heavy workloads
- +Origin tier is small and simple
- +DDoS protection is native to CDN
Cons
- −Cache invalidation is hard
- −Personalised content needs edge workers or hybrid caching
- −Debugging edge issues is harder (less observability)
- −Lock-in to CDN provider features
Choose this variant when
- Media / content / e-commerce catalogs
- Geographically distributed users
- Read-heavy with cacheable content
- Latency budget < 100 ms p95 globally
Worked example
Prompt: Design a Reddit-style link-sharing app.
Step 1 — establish the request shape
- Dominant read path: GET /feed (personalised, per-subreddit, paginated). ~100k QPS at peak.
- Dominant write path: POST /submit (~500/s), POST /vote (~20k/s).
- Non-functional: p95 < 200 ms on feed, writes durable, 3-region deploy eventually.
Step 2 — draw the edge → compute → state skeleton
- Edge: CloudFlare CDN for static (images, CSS, JS) + ALB for dynamic
- Compute: stateless app tier behind ALB (auto-scaled, 20–100 instances)
- State:
- Primary: Postgres (submissions, votes, comments) — source of truth, ACID - Cache: Redis (feed cache, vote counts, hot comment threads) - Media: S3 for uploaded images/videos, CloudFront CDN in front - Search: Elasticsearch (projected from Postgres via CDC) for /search
Step 3 — add async side-channels
- Vote written → publish event to Kafka
- Consumer 1: update denormalised score on submission row (async) - Consumer 2: update hot-feed ranking cache (async) - Consumer 3: analytics pipeline (async)
- Submit written → publish event
- Consumer 1: index for search (async) - Consumer 2: notify subscribed users (async) - Consumer 3: analytics
Step 4 — label every edge
- Client → CloudFlare: 10 ms (global anycast)
- CloudFlare → ALB: 20 ms (regional)
- ALB → app: 2 ms (intra-AZ)
- App → Redis: 1 ms (cache hit, 95% hit rate on feeds)
- App → Postgres: 5 ms (cache miss path)
- App → Kafka: 2 ms (fire-and-forget publish)
Total sync budget: 40 ms on hot path, well inside 200 ms p95 target.
Step 5 — read vs write path
- Read path: Client → CDN → ALB → App → Redis (hit) / else Postgres read replicas
- Write path: Client → ALB → App → Postgres primary → Kafka publish
- Separation lets reads scale horizontally via replicas; writes remain bounded by the primary
Step 6 — what I'm NOT drawing
- Auth service (users carry a signed JWT; app validates locally)
- Observability stack (Prometheus, Grafana, Jaeger, PagerDuty — all present, off diagram)
- Deploy pipeline, feature flags, secrets manager (control plane)
- Admin tools for moderation (separate low-QPS service)
Step 7 — named bottleneck "The pressure point is the feed cache hit rate. At 95% we serve 100k QPS with 5k/s hitting Postgres read replicas — manageable. Drop to 85% and Postgres gets 15k/s reads, which starts fighting with vote writes. The fix would be either more Postgres read replicas (short term) or pre-computing hot subreddit feeds into Redis with a write-time fan-out (longer term). The next scaling move after that is sharding Postgres by subreddit, which we'd do at ~10M active users."
That is the whole HLD in three minutes with clear dataflow, named bottleneck, and a scaling path.
Good vs bad answer
Interviewer probe
“Can you draw the architecture for a URL shortener?”
Weak answer
"Sure — we'll have a load balancer, a cluster of app servers, Postgres, Redis, Kafka for async events, Elasticsearch for search, S3 for storage, Cassandra for high write throughput, and probably a GraphQL layer on top." (Proceeds to draw 15 boxes with arrows going everywhere.)
Strong answer
"Four boxes to start. Client → LB → app tier → Postgres. One box off to the side: Redis, sitting next to the app tier, for the hot-read path (GET /:code). That's it for v1.
Let me walk a request through: client hits /abc → LB routes to any app instance → app checks Redis → on miss, SELECT long_url FROM urls WHERE short_code = 'abc' from Postgres → caches the result → 302 redirect. Writes are the reverse: INSERT INTO urls in Postgres, no cache involvement, return the short code.
This design handles ~20K reads/sec with one Redis at 95% hit rate and ~500 origin QPS on Postgres. The pressure point is the Redis hit rate.
I'd only add boxes when we hit a specific limit:
- Multi-region: when latency SLO forces geo-distribution, CloudFlare Workers at the edge with Redis replicas per region.
- Kafka: only when we need async analytics or click-stream processing for separate consumers.
- Sharding: at 500M+ short codes when Postgres becomes a bottleneck.
Not in v1."
Why it wins: The strong answer has a visible request path, a stated cache hit rate, a concrete QPS number, a named pressure point, and explicit triggers for when to add complexity. The weak answer is a box buffet with no dataflow and no reasoning for why each piece exists.
When it comes up
- After capacity estimation, when the interviewer says "sketch the architecture"
- When you need to orient the interviewer before deep-diving a specific component
- When the interviewer asks "walk me through a single request end-to-end"
- When asked "where are the bottlenecks?"
- When the conversation has drifted and you need to reset the shared mental model
Order of reveal
- 1State the request shape. "Two dominant paths: read (GET /X at Y QPS) and write (POST /X at Z QPS). Latency budget is W. Let me draw the shape that serves those."
- 2Draw the 3-layer skeleton. "Edge → compute → state. Client hits the LB, LB fans to stateless app tier, app tier reads/writes state. Three arrows, three layers."
- 3Place the primary store. "Primary is Postgres (ACID matters for the write path). It's the source of truth."
- 4Add the cache on the hot path. "Redis sits in front of Postgres on the read path. 95% hit rate target. The read goes Redis first, Postgres on miss, write-back to Redis."
- 5Name async side-channels. "Async work — analytics, search indexing, notifications — goes via Kafka. Dashed lines. Fire-and-forget from the write path."
- 6Label the edges. "Each edge gets a verb and a protocol. Client → LB over HTTPS. App → Redis via TCP GET. App → Kafka via producer.send()."
- 7Call out what you are NOT drawing. "Auth service is off-diagram — JWTs validated in-process. Observability, deploy pipeline, feature flags — control plane, mentioned not drawn."
- 8Name the bottleneck proactively. "The pressure point is the cache hit rate. If it drops below 90%, Postgres is on the hook and that is where I'd scale next — more read replicas, then sharding."
Signature phrases
- “Edge → compute → state” — Sharp skeleton that sets the shared mental model fast.
- “Every edge has a verb” — Forces dataflow over decoration.
- “Async is dashed, via a queue” — Visual convention that shows seniority.
- “The pressure point is X” — Pre-empts the bottleneck question and demonstrates failure thinking.
- “What I am NOT drawing” — Signals discipline instead of omission.
- “Stateless compute, state is where decisions live” — Core architecture mantra that frames every subsequent discussion.
Likely follow-ups
?“Your design has four services. Why not one monolith?”Reveal
Fair question — usually the right answer IS a monolith. Services here are justified because:
- 1Distinct scaling profiles. The feed service runs at 100k QPS; the posting service at 500 QPS. Putting them in one binary means overprovisioning compute for the low-traffic path or starving the high-traffic one.
- 2Blast radius. A bug in the comment service should not take down the feed read path. Separate processes = separate failures.
- 3Team boundaries. Each service has a clear owning team with independent deploy cadence.
But I would start with a monolith if any of those were absent. Premature service split is the more common mistake — the operational cost (dashboards, on-call, distributed tracing, deploys) is real and only pays back when the scaling / blast-radius / team arguments are concrete.
If the interviewer pushes back further, I would collapse to a monolith for the v1 and split only when the specific pain appears.
?“Why is auth off the main diagram?”Reveal
Because it is not on the critical path of every request. The pattern:
- 1Once per session: client hits
POST /auth/login→ auth service verifies credentials → returns a signed JWT (or opaque session id). - 2Every subsequent request: client sends the token in
Authorization: Bearer. App server verifies the signature locally (JWT) or hits a local session cache (Redis). No network call to auth service. - 3Token refresh: periodic background call to auth service to get a new token. Off the critical path.
Drawing auth as a middleman on every request is a common mistake — it implies the auth service is on the hot path, which would be disastrous for latency and availability. The correct architecture decouples auth verification from auth issuance.
The exception is fine-grained authorization (per-resource ACLs that can't fit in a token) — that may need a local cache with a short TTL or an explicit policy service call, but usually cached aggressively.
?“You said "Redis cache at 95% hit rate." What happens if Redis goes down?”Reveal
Graceful degradation, not total failure. The pattern:
- 1App catches Redis connection errors and falls through to Postgres directly.
- 2Circuit breaker on the Redis client — after N failures, stop trying Redis for ~30 seconds and go straight to Postgres.
- 3Postgres now eats 100% of read traffic (20k QPS instead of 1k). This is where read replicas matter — at least 3 replicas sized to handle full load without the cache.
- 4Alert fires: "Redis unavailable, serving from primary." On-call engineer paged.
- 5Recovery: Redis comes back, app re-connects, hit rate climbs over a few minutes.
What NOT to do:
- Fail the request if Redis is down (terrible — cache should be optional)
- Retry Redis indefinitely (cascading failure as app threads block)
- Size Postgres for "cache-hit-rate" — size it for the full load minus what you're confident the cache catches
This is the "cache is an optimisation, not a dependency" discipline — essential for any production system.
?“Walk me through a single write end-to-end.”Reveal
Take "user submits a vote" on the Reddit-style system:
- 1Client → CloudFlare (10 ms) — HTTPS POST /v1/votes
- 2CloudFlare → ALB (20 ms, anycast to region) — regional ingress
- 3ALB → app instance (2 ms) — round-robin, stateless
- 4App validates JWT locally (< 1 ms, no network)
- 5App validates request body (is vote ±1, are user/post IDs valid)
- 6App writes to Postgres primary:
INSERT INTO votes (user_id, post_id, value) ON CONFLICT (user_id, post_id) DO UPDATE ...— 5 ms, idempotent - 7App publishes
vote.createdevent to Kafka — 2 ms, fire-and-forget - 8App returns 201 to client — total p50 ~40 ms, p99 ~150 ms
Async consumers (not on the critical path):
- Consumer A updates denormalised
scoreon the post (eventual consistency, < 1 s lag) - Consumer B updates hot-feed ranking cache in Redis
- Consumer C writes to analytics data warehouse
What if the Kafka publish fails?
- Use the outbox pattern: write the event to a
vote_eventstable in the same Postgres transaction as the vote. A separate CDC process publishes the event to Kafka at-least-once. Never lose events even if Kafka is down.
This end-to-end walk shows the interviewer I understand sync vs async, the idempotency story, the failure mode, and the latency budget.
Code examples
// Write business data AND the event in the SAME DB transaction.
// A separate CDC process publishes events to Kafka at-least-once.
await db.transaction(async (tx) => {
// 1. Business write (source of truth)
await tx.votes.insert({
user_id: userId,
post_id: postId,
value: voteValue,
});
// 2. Event written in the SAME transaction
await tx.outbox.insert({
event_type: 'vote.created',
payload: JSON.stringify({ userId, postId, voteValue }),
created_at: new Date(),
});
});
// A separate process reads outbox rows and publishes to Kafka:
// SELECT * FROM outbox WHERE published_at IS NULL ORDER BY id LIMIT 100;
// for row in batch: kafka.produce(row.payload)
// UPDATE outbox SET published_at = now() WHERE id IN (...);
//
// Guarantees: event is published iff the business write committed.
// At-least-once delivery; consumers must be idempotent.class RedisCircuitBreaker {
private failures = 0;
private openedAt: number | null = null;
private readonly threshold = 5;
private readonly cooldownMs = 30_000;
async get(key: string): Promise<string | null> {
if (this.isOpen()) return null; // skip Redis, go to origin
try {
const result = await redis.get(key);
this.failures = 0; // reset on success
return result;
} catch (err) {
this.failures += 1;
if (this.failures >= this.threshold) {
this.openedAt = Date.now();
logger.warn('Redis circuit opened; degrading to origin');
}
return null; // treat as cache miss
}
}
private isOpen(): boolean {
if (this.openedAt === null) return false;
if (Date.now() - this.openedAt > this.cooldownMs) {
this.openedAt = null;
this.failures = 0;
return false; // half-open; try again
}
return true;
}
}# Reddit-style system — top-level topology
edge:
cdn: cloudflare
lb: aws-alb
waf: cloudflare-waf
compute:
app:
type: stateless
replicas: 20-100 (auto-scaled)
runtime: node.js
workers:
type: stateless
replicas: 10
consumes: kafka.topics.[vote.created, post.created]
state:
primary_db:
type: postgres
replicas: 1 primary + 3 read replicas
size: db.r5.4xlarge
cache:
type: redis
mode: cluster
nodes: 6 (3 primary + 3 replica)
stream:
type: kafka
brokers: 3
retention: 7d
topics: [vote.created, post.created, user.created]
media:
type: s3
bucket: reddit-media
cdn: cloudfront
search:
type: elasticsearch
nodes: 3
source: cdc-from-postgres
# Flows
sync_read: client -> cdn -> alb -> app -> redis (hit) / postgres-replica (miss)
sync_write: client -> alb -> app -> postgres-primary -> kafka (outbox)
async_fanout: kafka -> workers -> [denormalised-tables, search, analytics]Common mistakes
A diagram of 10 boxes without arrows tells the interviewer nothing. The whole point is the flow — draw the arrows first, label them with the verb and the protocol, then let the boxes emerge.
You don't need Kafka, Elasticsearch, and 5 microservices in the v1 design. Start with the simplest thing that handles the stated requirements, then justify every addition with a concrete trigger ("at X QPS we'd add Y").
If email/indexing/fan-out happens asynchronously, it must be visually distinct on the diagram. Candidates who pretend every side-effect is sync end up designing impossible latency budgets.
Auth server, deploy system, feature-flag service, secrets manager — all important, all cluttering. Mention them verbally; draw them only if the prompt is specifically about control-plane concerns.
A senior HLD shows where the system scales (app tier → horizontal, DB → replicas or shards, cache → cluster) and where it doesn't. If every component is a single box, you've drawn a prototype, not an architecture.
A "Redis service" or "Kafka service" is not a service boundary — technology is horizontal. Services split along domain boundaries: feed, profile, payment, messaging. Check every service against "what product capability does this own?"
Practice drills
You've drawn the HLD for a chat app. The interviewer asks "where does the message go first?". What's your answer?Reveal
Walk the request end-to-end:
- 1Client (WebSocket) → WebSocket-terminating LB (sticky by connection)
- 2LB → connection-handler service (holds the WS connection open)
- 3Connection-handler publishes to a Kafka topic
messages.publishedpartitioned by channel_id - 4Fan-out workers consume from Kafka:
- Worker A: writes to message DB (Postgres or Cassandra) — this is the source of truth - Worker B: looks up channel members, pushes the message to their active connections (via Redis pub/sub cross-cluster)
- 1Recipient clients receive the message over their existing WS connection
Key insight: the message hits durable storage IN PARALLEL with fan-out, not before it. Receiver latency is not gated on disk writes.
If they push on durability: "The Kafka log is the system of record. Even if the DB write fails, the message is in Kafka with 7-day retention — we can replay it. The DB is a materialised read model for history queries."
If they push on ordering: "Kafka partitioning by channel_id gives per-channel ordering. Within a channel, messages are delivered in publish order."
Your interviewer says "the diagram has 6 boxes but no labels on any arrow. What do they do?" What went wrong?Reveal
You drew components without dataflow. Fix in 60 seconds:
- 1Label each arrow with the verb: GET, SELECT, publish, enqueue, fan-out.
- 2Label each arrow with the protocol: HTTP, gRPC, AMQP, WebSocket, TCP.
- 3Label at least one arrow with a latency ballpark: "1 ms (intra-AZ)", "30 ms (cross-region)".
- 4Distinguish sync from async: solid for sync, dashed for async via a queue.
The interviewer grades understanding of the flow, not knowledge of component names. The boxes are just anchors for the arrows. An HLD with labelled arrows and unlabeled boxes is better than the reverse.
You want to add a Kafka cluster to your v1 HLD. The interviewer asks "why?". What's the shape of a good justification?Reveal
Good justification has three parts:
1. Concrete trigger. "At X QPS / Y fan-out consumers / Z durability requirement, the write path would have to fan out synchronously and exceed our 100 ms latency budget."
2. Alternative considered. "Simpler alternatives: direct RPC calls from the writer to each consumer (breaks when a consumer is down), or a single Redis pub/sub (no persistence, no replay). Kafka gives us decoupling, retention, and replay that those don't."
3. Ops acknowledgement. "Kafka is not free — we need a 3-broker cluster, schema registry, monitoring, DLQ strategy. I'd add it at <specific trigger>, not day 1."
The anti-pattern: adding Kafka because it sounds senior. If you can't name the trigger and the alternative, strip it out. "We start without Kafka; we add it when consumer X appears" is the right v1.
Your HLD has 12 services. The interviewer frowns. How do you respond?Reveal
Acknowledge and consolidate.
"You're right — 12 services is too many for this prompt. Let me collapse. The real domain boundaries here are:
- 1Identity (auth, profile, session)
- 2Content (posts, comments, media)
- 3Feed (timeline, ranking)
- 4Notifications (async fan-out)
That's 4 services. Everything else I drew (search-indexer, analytics-ingestor, media-transcoder) are workers that consume from Kafka, not services with their own APIs. They don't need their own deploy, dashboards, or on-call.
For the v1, I'd even collapse further — Content + Feed could be one service initially, split when the scaling profiles diverge."
The lesson: splitting is a tool, not a virtue. The right number of services is the smallest number that maps cleanly to team boundaries and scaling profiles. Everything else is workers, libraries, or modules inside one service.
Interviewer: "How would this HLD change if we added multi-region requirements?"Reveal
Three layers change:
1. Edge layer:
- CDN remains (already global)
- LB becomes geo-DNS (Route 53 / Cloudflare) routing users to their nearest region
- Each region has its own regional LB + app tier
2. Compute layer:
- Replicated per region — identical stateless app tier in each
- Session state in Redis per region (with cross-region replication for failover)
3. State layer — the hard part:
- Primary DB: either single-region primary with cross-region read replicas (if writes tolerate ~100 ms cross-region latency) OR multi-primary with conflict resolution (Cassandra, Spanner, DynamoDB Global Tables)
- Cache: independent per region (eventual consistency between regions is usually fine)
- Object store: S3 cross-region replication or native multi-region bucket
- Kafka / event stream: MirrorMaker 2 replicating topics between regions
Trade-offs to name:
- Active-active is 2× the cost but gives latency + availability wins
- Cross-region write consistency is the hard problem — either go to a multi-primary store or accept write-region pinning
- Data residency rules (GDPR — EU data stays in EU) constrain which regions can hold which data
When to propose multi-region: when the prompt demands sub-100 ms p95 globally, or when regional outage resilience is stated. Don't volunteer multi-region to look senior — it is a 3× complexity multiplier.
Cheat sheet
- •Shape: edge → compute → state. Every HLD starts here.
- •Draw < 10 boxes on the main diagram. More = lost interviewer.
- •Every edge: verb + protocol (e.g., "SELECT over TCP", "publish via AMQP").
- •Dashed lines for async. Through a queue — never direct.
- •Stateless compute = trivially replicated. State = where hard decisions live.
- •Name the bottleneck before asked. "The cache hit rate is the pressure point."
- •Justify every extra component with a QPS / storage / team threshold.
- •Separate control plane from data plane; mention control plane verbally.
- •Auth is off-diagram (JWT validated locally). Draw login flow, not every request.
- •Name what you are NOT drawing to signal discipline, not omission.
Practice this skill
These problems exercise High-level architecture. Try one now to apply what you just learned.