Learn · Patterns

System design patterns.

A pattern is a system shape — read-heavy, write-heavy, fan-out, long-running, real-time. Interviewers recognise these on sight and expect you to pattern-match fast. Each write-up covers when you reach for it, the canonical skeleton, a scaling path, and the failure modes that kill it.

Workload shape

How traffic is distributed between reads and writes drives every other decision.

Read-heavy

80m

Reads dominate writes by 10:1 or more. Every layer exists to keep the primary out of the hot path.

You see it when: The interviewer states or implies a read:write ratio of 10:1, 100:1, or higher

Write-heavy

90m

Writes arrive faster than any single node can persist them synchronously. The design is about absorbing, spreading, and deferring them.

You see it when: Write QPS is 10K+ per region and climbing

Classic request-response

60m

The boring default. Synchronous HTTP with a cache. Works for 70% of APIs and you should say so.

You see it when: Standard CRUD APIs — user profile, settings, admin consoles

Execution shape

Synchronous, asynchronous, real-time — how work flows through the system.

Long-running tasks

80m

Accept work with 202 + job_id, process asynchronously, and let clients track progress via poll, push, or webhook.

You see it when: Processing takes 10 seconds to hours — far longer than HTTP timeout budgets

Producer-consumer / work queue

35m

Decouple rate of production from rate of consumption with a durable queue and autoscaled workers.

You see it when: Synchronous call does work the user does not need to wait on (email send, image resize, index update)

Event-driven / saga

70m

Coordinate multi-service workflows via compensating transactions instead of distributed locks — choreography for simple flows, orchestration for everything else.

You see it when: Multi-service workflow (order → payment → inventory → shipping)

Real-time delivery

80m

Poll, long-poll, SSE, or WebSocket — the choice is about update frequency, direction, and how many persistent connections you can afford.

You see it when: Server-initiated updates to the client

Data movement & fan-out

One-to-many delivery, geographic distribution, and the trade-offs that come with them.

Fan-out: on write vs on read

75m

Where does the work live — at write time (push to every follower's inbox) or at read time (gather from each followed user)? Both break at the extremes; hybrids win.

You see it when: Social graph with asymmetric follow counts (power-law follower distribution)

Edge caching / CDN-first

70m

The cheapest request is the one that never hits your origin. Push static and near-static content to the edge and let the CDN absorb 80–99% of reads.

You see it when: Public, shareable content (product pages, articles, media)

Multi-region active-passive / active-active

75m

Geographic distribution for latency, DR, and compliance. Active-passive is operationally sane; active-active is a conflict-resolution project.

You see it when: Global user base with regional latency SLOs (<100 ms)

Specialised shapes

Problem-specific patterns you'll recognise on sight once you've seen them.

Search over content

Inverted index + ranking service. The hard part isn't indexing — it's relevance and keeping the index fresh.

You see it when: Full-text search over a corpus (millions+ of docs)

Geospatial / proximity lookup

Geohash / S2 / H3 for "nearby X" queries. Straight lat/lng on a B-tree dies at a few thousand rows.

You see it when: "Find N nearest" queries

Large file upload & blob handling

Chunked + resumable uploads direct to blob storage. Signed URLs. App server never touches the bytes.

You see it when: User-uploaded media (video, images, audio)

Content recommendation

Feature store + candidate generation + ranking. Offline training, online serving. Separate the cheap retrieval from the expensive scoring.

You see it when: "Recommended for you" feeds

Rate limiting / quota enforcement

Token bucket + distributed counter. The token bucket math is trivial; the distribution is the hard part.

You see it when: Public API with tiered plans (free, paid)

Reliability & infra

The disciplines that turn "it works" into "it stays working".

High Availability

70m

Redundancy + graceful degradation + operational discipline. You don't buy 99.99% — you earn it.

You see it when: Availability target >= 99.95% (4 hours downtime/year or less)

Leader election / consensus

Raft / Paxos via etcd / ZooKeeper. When exactly-one-of-N must do the thing, use a consensus service — don't roll your own.

You see it when: Distributed locks

Loading…