Classic request-response
When to reach for this
Reach for this when…
- Standard CRUD APIs — user profile, settings, admin consoles
- Latency SLO is > 100ms; payload < 100 KB
- Reads and writes are both bounded; neither dominates 100:1
- No real-time push, no fan-out, no long-running work
- Interviewer says "start simple" or "walk me through the basics"
Not really this pattern when…
- User needs push updates without polling (→ real-time)
- A single request triggers many downstream effects (→ event-driven)
- Single response would take > a few seconds (→ long-running-tasks)
- Read volume 100× writes and shared across users (→ read-heavy + CDN)
- Write bursts exceed what a single primary can handle (→ write-heavy)
Good vs bad answer
Interviewer probe
“Design an API for a small B2B SaaS app.”
Weak answer
"Kafka for events, gRPC between microservices, Redis Cluster, Elasticsearch for search, CDC with Debezium, Kubernetes for orchestration..."
Strong answer
"Monolith Postgres + Node/Go app behind a load balancer. Cache-aside with Redis on hot endpoints — 60s TTL, invalidate on write. Cursor-based pagination on every list endpoint. Idempotency-Key header on retry-able POSTs. Connection pool at 25 per app instance, 5 instances, pgBouncer between them with 50 backend connections. This handles thousands of tenants. When primary CPU climbs past 60% I'll add a read replica. When side-effects slow p95, I'll add a queue. Until then, boring is a feature."
Why it wins: Resists premature complexity. Names concrete numbers, pagination, idempotency, connection pooling, and the specific upgrade triggers.
Cheat sheet
- •Most APIs are request-response. Don't pretend otherwise.
- •LB → stateless app → Redis cache-aside → Postgres. That's the stack.
- •Paginate every list endpoint from day one. Cursor > offset.
- •Idempotency-Key on every retry-able POST. Non-negotiable for payments.
- •pgBouncer between app and DB. 25 conns per app, 50 real conns to Postgres.
- •Cache TTL = staleness budget from the requirements. Not a guess.
- •Invalidate cache on every write path. TTL is the safety net, not the strategy.
- •Scale reads: replicas. Scale writes: sharding. Scale side-effects: queue.
- •Graduate when a specific metric hits a specific limit. Not when it "feels" complex.
- •Instrument from day one: p99 latency, DB CPU, connection count, cache hit rate.
- •Resist Kafka until you have event-driven requirements, not just event-driven vibes.
- •The boring answer is usually the right answer. Say so with confidence.
Core concept
Most production APIs are request-response. A client sends an HTTP request, a server does some work, and the client gets a response — usually in under 500 ms. No WebSockets, no Kafka, no CQRS. The interview trap is reaching for complexity when the boring answer is right.
Client → LB → App → Cache / DB. The boring right answer for 70% of APIs.
The premature-complexity trap. Interviewers watch for it. You hear "design an API for a SaaS product" and instinctively sketch Kafka, event sourcing, microservices. Stop. Ask: is the user waiting for a response? Is the work under 500 ms? Are reads and writes roughly balanced? If yes to all three, the answer is synchronous request-response with a cache. Say so explicitly — "this is the boring CRUD pattern, and I'm choosing it on purpose because nothing in the requirements justifies complexity."
The canonical stack. Load balancer → stateless application instances → Redis (cache-aside) → Postgres. Each piece earns its place:
- Load balancer distributes requests across N identical app instances. Stateless means any instance handles any request — no sticky sessions needed.
- Redis cache-aside absorbs repeated reads. App checks Redis first; on miss, reads Postgres and populates Redis with a TTL. Target 80%+ hit rate on hot endpoints. TTL is your staleness budget made concrete.
- Postgres primary is the source of truth. Writes go here; reads go here only when cache misses and it's not worth a replica yet.
When to add read replicas. When the primary's CPU exceeds 60% sustained and read queries dominate the mix. Two replicas behind pgBouncer give you 3× read throughput without changing the app. Budget replication lag — use a write-recency cookie to pin a user's reads to primary for 5 seconds after their write, or write-through the cache so the user sees their own write immediately.
Connection pooling is non-negotiable. Postgres handles thousands of queries/s but only hundreds of connections. Five app instances × 25 connections = 125 connections to the DB. Put pgBouncer in front; it multiplexes to 50 actual backend connections. Without it, a traffic spike exhausts the pool and cascading timeouts follow.
Pagination from day one. Every list endpoint returns paginated results. Cursor-based pagination is strictly better than offset for live data: it's O(1) regardless of page depth, immune to row insertion/deletion shifts, and the client just passes back the opaque cursor. Offset-based pagination breaks under concurrent writes and gets slower as offset grows.
Idempotent writes are critical for retries. PUT is idempotent by design. POST is not — and clients retry. The Idempotency-Key pattern solves this: client sends a unique key in a header, server checks Redis for that key. If seen, return the cached result. If new, process the request, cache the result, return it. This is non-negotiable for payments, order creation, or any mutation where double-execution has real consequences.
Cache-aside is the default cache strategy. App reads Redis; on miss, reads DB, writes Redis with TTL. On every write path, invalidate the relevant cache key. TTL is your safety net — even if you miss an invalidation, the stale value expires. Keep TTLs short (30s–5min for most endpoints) rather than long. The most common bug is a write path that forgets to invalidate.
Graduation triggers — when to leave this pattern. The request-response pattern fails when a specific axis breaks, not when it "feels too simple":
- 1Push required — user needs real-time updates → WebSocket / SSE → real-time pattern.
- 2Reads dominate 100:1 — add CDN, read replicas, multi-tier cache → read-heavy pattern.
- 3Write bursts — single primary can't keep up → sharding, write-ahead queue → write-heavy pattern.
- 4Long-running work > 1s — user can't wait → 202 Accepted + async workers → long-running-tasks pattern.
- 5Fan-out — one write triggers notifications/feeds for N users → event-driven pattern.
Five specific triggers that push you to a more complex pattern.
The discipline is naming the exact trigger before you graduate. "We need Kafka because our industry uses it" is wrong. "Primary CPU is at 72% sustained and 85% of queries are reads — I'm adding a read replica and cache-aside" is right. Start boring. Stay boring. Graduate only when a number tells you to.
Canonical examples
- →User profile GET / PUT
- →Settings and preferences endpoints
- →Small-to-medium B2B SaaS backends
- →Admin consoles and internal tools
- →Standard REST / gRPC CRUD APIs
Variants
Monolith CRUD
Single server, single DB. The starting point for every service.
One process, one DB. Good enough for thousands of users.
The monolith CRUD variant is what every service should start as. One application process — Rails, Django, Express, Spring Boot — talking to one Postgres instance. No cache, no queue, no load balancer. The deployment is a single binary or container; the infrastructure is one VM or one managed DB instance.
This is not a compromise. Modern Postgres on a db.r6g.xlarge (4 vCPU, 32 GB RAM) handles 5,000-10,000 transactions per second for a typical CRUD workload with proper indexing. That covers the first two years of most B2B SaaS products, most internal tools, and many consumer products up to 50k DAU. The single-server variant is also the easiest to debug (one log stream), the cheapest to operate (one instance), and the fastest to iterate on (no service boundaries to negotiate).
One process, one DB. Good enough for thousands of users.
The mistake is leaving this variant too early. Teams add Redis at 12% DB CPU because "caching is best practice." They add a load balancer at 200 RPM because "we should be stateless." Both are correct principles applied at the wrong time. Every component you add is a component you must monitor, fail over, and debug at 3 AM. Stay here until a specific metric — CPU, connection count, p99 latency — tells you to move.
When to leave. When any of these are true: (a) DB CPU exceeds 50% sustained and growing, (b) peak connection count approaches the DB's max_connections limit, (c) you need horizontal app scaling for availability (multiple AZs), (d) a hot endpoint's latency would benefit measurably from caching. These are concrete, measurable triggers — not vibes.
Pros
- +Simplest possible deployment — one process, one DB
- +No cache invalidation, no replication lag, no distributed debugging
- +Handles 5-10k TPS with proper indexing — years of headroom for most products
- +One log stream, one metrics dashboard, one on-call runbook
Cons
- −Single point of failure unless you add a standby
- −Vertical scaling has a ceiling (largest instance ≠ infinite)
- −No read scaling independent of write scaling
Choose this variant when
- Greenfield product with < 50k DAU
- B2B SaaS with < 1,000 tenants
- Internal tools and admin consoles
- Any service where DB CPU is below 50%
With cache layer
Add Redis cache-aside when hot endpoints justify it.
Horizontal app tier, Redis cache-aside, single primary.
The cache variant adds Redis in a cache-aside configuration between the app and the primary DB. This is the first scaling move most services make, and it's the right one when: (a) DB CPU is climbing, (b) a subset of reads are hot (80/20 rule), and (c) stale data for 30-60 seconds is acceptable on those endpoints.
Horizontal app tier, Redis cache-aside, single primary.
Cache-aside contract: app reads Redis; on miss, reads Postgres, writes to Redis with a TTL. On every write to a cached entity, the app invalidates the cache key. TTL is the safety net — even if a write path forgets to invalidate, the stale value expires. Keep TTLs short (30s-5min for most endpoints) rather than optimistically long.
Hit-rate is the metric. Target 80%+ on hot endpoints; alert if it drops below 70%. A declining hit rate means your working set has outgrown cache RAM, your key design is too granular, or your invalidation is too aggressive. Instrument per-endpoint hit rates, not just the aggregate.
The operational cost is real: you now have a Redis cluster to provision, monitor, and fail over. Redis Sentinel or Redis Cluster for HA; memory sizing based on working set + headroom; eviction policy (allkeys-lru for cache-aside). And you inherit the invalidation tax: every write path must know which cache keys to clear. Miss one and users see stale data. This tax is manageable when the cache covers 5-10 entity types; it becomes a full-time job at 50+.
Connection pooling matters more now. With cache in play, each request potentially makes two network calls (Redis + Postgres) instead of one. The app's connection pool to both Redis and Postgres must be sized to handle peak concurrency without queueing.
Pros
- +Drops read latency from ~10ms (DB) to ~1ms (Redis) for cached data
- +Reduces DB CPU by 50-80% on read-heavy endpoints
- +Redis is cheap per GB — far less than adding DB replicas for the same load
Cons
- −Cache invalidation is now your problem on every write path
- −Another service to provision, monitor, and fail over (Redis HA)
- −Debugging stale-data bugs requires checking both DB and cache state
Choose this variant when
- DB CPU exceeding 50% and reads are the dominant query type
- A few endpoints account for most of the read traffic (hot endpoints)
- 30-60 second staleness on cached data is acceptable
With read replicas + pgBouncer
Scale reads at the DB tier when cache alone is not enough.
Replicas absorb reads; pgBouncer multiplexes connections.
When the cache hit rate is already 85%+ but the primary's CPU still climbs because of cache misses and the long-tail of uncacheable queries, read replicas are the next move. Two Postgres replicas behind pgBouncer give you 3× the read throughput without changing application code — the connection proxy routes reads to replicas and writes to the primary based on the query.
Replicas absorb reads; pgBouncer multiplexes connections.
pgBouncer is the multiplexer. Five app instances × 25 connections each = 125 app-side connections. pgBouncer collapses these into 50 real backend connections to Postgres. Without it, scaling the app tier eventually hits the DB's max_connections wall (default 100 on many managed Postgres instances). pgBouncer's transaction pooling mode gives you the best connection reuse; session pooling if you use prepared statements or SET commands that need session affinity.
Replication lag is the design conversation. Streaming replication in Postgres typically lags by milliseconds under normal load, but can spike to seconds during write bursts or vacuum operations. You need a read-your-writes strategy documented before you ship:
- Write-recency cookie: after a user writes, set a cookie with a timestamp. For reads within 5 seconds of that timestamp, route to primary. After the window, route to replica.
- Cache write-through: on write, populate the cache immediately. The user reads from cache (fresh) while the replica catches up.
- Semi-synchronous replication: one replica must acknowledge before the primary confirms the write. Adds ~1ms latency to writes; guarantees at least one replica is current.
Pick one. Document it. Don't let different teams pick different strategies — you'll get inconsistent behaviour that's impossible to debug.
The operational surface grows: you now manage primary + N replicas + pgBouncer + Redis. Failover playbooks are mandatory. Automated replica promotion on primary failure. Monitoring replication lag with alerting at your SLO threshold.
Pros
- +3× read throughput per replica without app changes
- +pgBouncer solves connection exhaustion once and for all
- +Replicas serve the long-tail queries that cache cannot help
Cons
- −Replication lag requires a read-your-writes strategy
- −More infrastructure to operate — primary, replicas, pgBouncer, Redis
- −Replicas cost full DB instances — not cheap
Choose this variant when
- Cache hit rate is already 80%+ but primary CPU still high
- Long-tail uncacheable queries dominate the miss path
- You need read scaling beyond what a single DB can handle
Graduated hybrid (sync + async)
Return 200 synchronously; fire side-effects through a queue.
Return 200 immediately; enqueue side-effects for workers.
The graduated hybrid is the natural evolution when some work triggered by a request doesn't need to complete before the user gets a response. The canonical example: user places an order → the API validates, persists to DB, returns 200 with the order ID → then enqueues side-effects: send confirmation email, notify warehouse, update analytics, charge payment asynchronously if pre-authorised.
Return 200 immediately; enqueue side-effects for workers.
The contract is simple. The synchronous path does the minimum: validate, persist, respond. Everything else goes to a queue (SQS, RabbitMQ, Redis Streams — pick the one your team knows). Workers consume the queue and process side-effects. If a worker fails, the message retries. The user never sees the failure because they already got their 200.
This variant is not event-driven architecture. You're not building a pub-sub mesh or an event store. You're putting a queue between "thing the user cares about" and "things that happen because of it." The app is still a monolith; the queue is a tactical tool, not an architectural philosophy.
When to use this inside request-response. When your p95 latency budget is 300ms but the full request lifecycle takes 1.2 seconds because of email sending, PDF generation, webhook delivery, or third-party API calls. Move the slow parts to the queue; the synchronous path drops to 50ms.
Failure isolation is the real win. If the email service is down, the order still succeeds. If the analytics pipeline is backed up, users don't notice. Each side-effect retries independently. Compare this to the monolith where a slow SMTP server makes the entire POST /orders endpoint timeout.
The cost is eventual consistency on the side-effects. The user gets their 200 but the confirmation email might take 30 seconds. For most products, that's fine. For anything where the user needs to see the result of the side-effect immediately (e.g., "your payment was charged"), keep that step synchronous and queue only the truly deferrable work.
Pros
- +Response latency drops to just the persistence step (~50ms)
- +Side-effect failures don't block the user-facing response
- +Each side-effect retries independently — partial failures are isolated
- +Still a monolith — no microservice coordination overhead
Cons
- −Side-effects are eventually consistent — user may wait for email
- −Queue adds operational surface: dead-letter queues, retry policies, monitoring
- −Debugging requires correlating request logs with worker logs (use trace IDs)
Choose this variant when
- POST endpoints do > 500ms of non-critical work (emails, webhooks, analytics)
- Third-party API calls in the hot path hurt p95 latency
- You need failure isolation between the user-facing response and downstream effects
Scaling path
v1 — single server monolith
Ship the product. One DB handles everything.
One process, one DB. Good enough for thousands of users.
Modern Postgres on a properly-sized instance handles 5,000-10,000 TPS for a typical CRUD workload. Zero cache, zero replica, zero operational surface beyond one DB. The discipline at this stage is resisting "best practice" that adds components before the numbers justify them.
One process, one DB. Good enough for thousands of users.
At this stage, instrument everything: DB CPU, connection count, p99 latency per endpoint, slow query log. These numbers are what tell you when to move to v2. Without them you'll add complexity based on fear instead of data.
What triggers the next iteration
- DB CPU climbs past 50% sustained — read queries are the dominant cost
- Connection pool saturates under traffic spikes
- Single-AZ deployment means maintenance windows = downtime
v2 — LB + cache + horizontal app tier
Scale the app tier horizontally; absorb hot reads in Redis.
Horizontal app tier, Redis cache-aside, single primary.
Add a load balancer in front of 2+ stateless app instances. Add Redis cache-aside on the top 5-10 hottest endpoints. This alone typically reduces DB CPU by 50-70%.
Horizontal app tier, Redis cache-aside, single primary.
The LB also gives you multi-AZ deployment for availability. The cache gives you sub-millisecond reads on hot data. The cost: cache invalidation is now your problem, and Redis is another service to operate.
What triggers the next iteration
- Cache hit rate below 80% — working set outgrew Redis memory
- Primary CPU still high from uncacheable long-tail queries
- Connection count approaching max_connections on Postgres
v3 — read replicas + pgBouncer
Scale reads at the DB tier; multiplex connections.
Replicas absorb reads; pgBouncer multiplexes connections.
Add 2 read replicas behind pgBouncer. Route reads to replicas, writes to primary. pgBouncer collapses 100+ app connections into 50 backend connections.
Replicas absorb reads; pgBouncer multiplexes connections.
Document your read-your-writes strategy before shipping. Replication lag is milliseconds normally but can spike during vacuum or write bursts. Write-recency cookie (pin to primary for 5s after write) is the simplest approach.
What triggers the next iteration
- Write throughput hits the primary's ceiling — reads are fine but writes back up
- Replication lag spikes during write bursts break read-your-writes guarantees
- Operational complexity: primary + replicas + pgBouncer + Redis + LB
v4 — sync + async hybrid
Decouple side-effects from the response path.
Return 200 immediately; enqueue side-effects for workers.
When response latency is dominated by side-effects (emails, webhooks, analytics writes), move them to a queue. The synchronous path does validate → persist → respond. Workers handle the rest.
Return 200 immediately; enqueue side-effects for workers.
This is the last stop before you leave the request-response pattern entirely. If even the core persist step can't keep up, you're looking at write-heavy or event-driven patterns. If users need push updates, you need real-time. Name the trigger explicitly.
What triggers the next iteration
- Primary write throughput is the bottleneck — need sharding or write-heavy pattern
- User needs push updates — need WebSocket / SSE → real-time pattern
- Single trigger fans to N downstream systems — need event-driven architecture
Deep dives
Connection pooling
N app instances × M conns collapses to P actual DB connections.
Postgres handles thousands of queries per second but only hundreds of connections. Each connection costs ~10 MB of RAM on the server and a process fork. When you have 10 app instances each opening 25 connections, that's 250 backend connections — above the default max_connections of 100 on most managed Postgres instances.
N app instances × M conns collapses to P actual DB connections.
pgBouncer sits between app and DB, multiplexing N app connections into M backend connections where M << N. In transaction pooling mode, a backend connection is assigned only for the duration of a transaction, then returned to the pool. This means 250 app connections can share 50 backend connections if the average transaction holds a connection for < 20% of the time.
Sizing the pool. Rule of thumb: set pgBouncer's backend pool size to 2-4× the number of CPU cores on the DB instance. A 4-core instance works well with 16-32 backend connections. More connections than cores means context switching and lock contention — you get worse throughput, not better.
App-side pool sizing. Each app instance should open just enough connections to saturate its throughput without queueing. 20-30 per instance is typical. If your app does async I/O (Node.js, Go, async Python), you need far fewer because each connection handles many concurrent queries via pipelining.
Failure mode. When the pool is full and a new request arrives, the request blocks until a connection frees up. If the timeout is too short, the request fails. If it's too long, upstream clients timeout first and retry — making things worse. Set the pool wait timeout to less than your HTTP timeout to fail fast at the right layer.
Cursor-based pagination
Client sends cursor; server returns page + next cursor.
Every list endpoint must be paginated from day one. The two approaches are offset-based (OFFSET 20 LIMIT 10) and cursor-based (WHERE id > last_seen_id LIMIT 10). Cursor-based wins on every axis that matters in production.
Client sends cursor; server returns page + next cursor.
Why offset breaks. OFFSET N forces the DB to scan and discard N rows before returning the page. At page 1000 with 10 items per page, the DB scans 10,000 rows to return 10. As the dataset grows, deep pages get slower linearly. Worse: if rows are inserted or deleted between page fetches, rows shift — users see duplicates or miss items entirely.
Cursor contract. The server returns each page with an opaque cursor (typically the last row's ID, base64-encoded). The client passes it back to get the next page. The server query is WHERE id > decoded_cursor ORDER BY id LIMIT page_size. This is O(1) regardless of depth because the DB seeks to the cursor via the index, not by scanning.
Compound cursors. If sorting by created_at (not unique), the cursor must include both created_at and id: WHERE (created_at, id) > (cursor_ts, cursor_id). This handles ties correctly. The index must match: CREATE INDEX idx ON table(created_at, id).
Client contract. Return a JSON envelope: { "data": [...], "next_cursor": "abc123", "has_more": true }. Clients iterate by passing next_cursor until has_more is false. Never expose raw database IDs as cursors in public APIs — encode them.
Cache integration. Paginated responses are cacheable if the cursor is part of the cache key. First-page results (no cursor) cache well because many users fetch page 1. Deep pages are long-tail and rarely worth caching.
Idempotent writes
Redis tracks seen keys; retries return cached result.
PUT is idempotent by HTTP specification — applying it twice produces the same result. POST is not. But clients retry POSTs: network timeout, 502 from the LB, mobile app backgrounded mid-request. Without idempotency, a retried POST /orders creates two orders. For payments, this is a billing incident.
Redis tracks seen keys; retries return cached result.
The Idempotency-Key pattern. Client generates a UUID and sends it in the Idempotency-Key header. Server flow:
- 1Check Redis for the key. If found, return the cached response immediately — this is a retry.
- 2If not found, set the key in Redis with status "processing" and a TTL (24-48 hours).
- 3Process the request. Persist to DB.
- 4Store the response body in Redis under the same key.
- 5Return the response.
If the server crashes between steps 2 and 4, the key is in "processing" state. The retry hits step 1, sees "processing," and either waits briefly or returns 409 Conflict. The client retries again after the TTL, and the key is gone — safe to reprocess.
What to store. Cache the full HTTP response (status code + body). The retry must return exactly what the original request returned — same status, same body, same headers. This is critical for clients that parse the response.
Scope. Apply idempotency to mutations that have real consequences: order creation, payment charges, account provisioning. Don't apply it to idempotent-by-design operations (GET, PUT, DELETE) — they don't need it. The overhead is one Redis check per request; apply it where the cost of double-execution exceeds that overhead.
Cache-aside strategy
App reads cache first. On miss, fetch from DB and populate cache.
Cache-aside is the default caching strategy for request-response services. The app owns both the read and write paths; the cache is a passive store the app manages explicitly.
App reads cache first. On miss, fetch from DB and populate cache.
Read path: App checks Redis → hit: return → miss: read from Postgres, write to Redis with TTL, return to client. Two round-trips on miss (Redis + Postgres), one on hit (Redis only).
Write path: App writes to Postgres → invalidate (DEL) the Redis key. Do not update the cache on write — that's write-through, which has stronger consistency but higher complexity. Invalidation is simpler: the next read will miss and repopulate.
TTL is your staleness budget. The product requirement "user profiles can be 60 seconds stale" translates directly to TTL=60s. Pick the number from the requirements, not from intuition. Short TTLs (30s-2min) are safe defaults; long TTLs (hours) require aggressive invalidation on every write path.
The invalidation tax. Every write path that touches a cached entity must invalidate. Miss one and users see stale data for up to TTL. Real teams defend with three layers: (1) explicit invalidation in every write handler, (2) short TTLs as a safety net, (3) CDC-based invalidation for write paths the app team didn't write (admin tools, migrations, scripts).
Thundering herd on miss. When a hot key expires, every concurrent reader misses simultaneously and all hit the DB. Defences: (a) single-flight — the first reader loads from DB while others wait on a shared future, (b) probabilistic early refresh — readers randomly refresh before TTL expires, spreading the load, (c) background refresh — a separate process refreshes hot keys before they expire.
Sizing Redis. Estimate working set: number of cached entities × average serialised size. Add 2× headroom for fragmentation and overhead. A 10 GB Redis instance covers most SaaS products' hot data sets. Monitor memory usage and eviction rate — if evictions are happening, you need more memory or shorter TTLs.
API design discipline
The request-response pattern lives or dies on API design. A well-designed API is cacheable, paginated, idempotent, and versioned. A badly designed one creates the problems that push you to "need" more complex patterns.
Resource-oriented URLs. GET /users/123, POST /users, PUT /users/123, DELETE /users/123. Not GET /getUserById?id=123. Resource URLs cache better (the URL is the cache key), are self-documenting, and follow the HTTP method semantics that clients and CDNs understand.
Consistent envelope. Every response: { "data": ..., "meta": { "request_id": "...", "pagination": { ... } } }. Request IDs enable end-to-end tracing. Pagination metadata enables clients to iterate. Consistent envelopes mean one client-side parser, not one per endpoint.
Versioning strategy. URL-path versioning (/v1/users) is the simplest and most visible. Header versioning (Accept: application/vnd.api+json; version=2) is more RESTful but harder to test in a browser. Pick one on day one; changing later is painful.
Error contract. { "error": { "code": "INVALID_EMAIL", "message": "Email format is invalid", "field": "email" } }. Machine-readable code for client logic; human-readable message for debugging. Never leak stack traces or internal IDs in production error responses.
Rate limiting on every endpoint. Even internal APIs. A misbehaving batch job that hammers an unprotected endpoint at 10k RPS will take down the database. Use token bucket per API key or per user ID. Return 429 with Retry-After header. Log rate-limited requests for visibility.
Request validation at the boundary. Validate request bodies (schema, types, ranges) before any business logic. Return 400 with specific field errors. This is the cheapest possible rejection — no DB query, no cache check, no side-effects. Every invalid request that makes it past validation is a wasted round-trip.
Graduation triggers
Five specific triggers that push you to a more complex pattern.
The request-response pattern is not forever. But leaving it should be a deliberate decision driven by a specific, measured trigger — not a vague sense that "we've outgrown it." Here are the five triggers and where they point.
Five specific triggers that push you to a more complex pattern.
1. Push required. The user needs to see updates without refreshing — chat messages, live scores, collaborative editing. Polling is a stopgap (and an expensive one at scale). The answer is WebSocket or Server-Sent Events, which means the real-time delivery pattern.
2. Reads dominate 100:1 or more. When a single write fans out to thousands or millions of reads — video metadata, product catalogs, URL shorteners — the read-heavy pattern with CDN, multi-tier cache, and read replicas is the right shape. Request-response with a Redis cache won't cut it because the cache key space is too large and the hit rate drops.
3. Write bursts exceed primary throughput. If your single Postgres primary can't keep up with write volume — IoT telemetry, event logging, high-frequency trading — you need the write-heavy pattern: write-ahead queues, batch inserts, sharding, or append-only stores.
4. Work takes > 1 second. If a single request requires running an ML model, generating a PDF, calling five third-party APIs, or processing a video, the user can't wait. Return 202 Accepted with a job ID, process asynchronously, and let the client poll or subscribe for completion. This is the long-running-tasks pattern.
5. One trigger → N downstream effects. When a single user action (post a tweet, place an order) must notify feeds, send emails, update analytics, trigger webhooks — and you need each effect to be independently retryable and loosely coupled — you've entered event-driven territory.
The discipline. Before graduating, write down: (a) which trigger you hit, (b) the metric that proves it, (c) which part of the system moves to the new pattern, (d) which parts stay as request-response. Most systems are hybrids — the order API stays request-response while the notification subsystem goes event-driven.
Case studies
Basecamp Rails monolith — 15+ years on one pattern
Basecamp is the canonical example of a request-response monolith that never graduated. A single Ruby on Rails application, one primary MySQL database, and a Redis cache serve millions of users. DHH (David Heinemeier Hansson) has been publicly vocal about this: they don't use microservices, they don't use Kafka, they don't use event sourcing.
Their stack is literally: Nginx → Rails app → MySQL + Redis. The app is deployed on their own hardware (37signals own their servers). They handle email ingestion, file storage, real-time updates (via Turbo/Hotwire — still HTTP, not WebSocket for most features), project management, and chat — all in one monolith.
The lesson isn't "monoliths are always right." The lesson is that a well-optimised request-response stack scales further than most teams believe. Basecamp's team invested in database optimisation, efficient queries, proper indexing, and cache-aside on hot paths. They didn't invest in distributed systems infrastructure because they didn't need to. Every hour not spent on Kafka configuration was spent on product features.
The key numbers: ~3.5 million accounts, tens of thousands of concurrent users, single MySQL primary with a few read replicas, Redis for caching and background jobs (via Sidekiq). Total engineering team: ~15 programmers.
Takeaway
A well-tuned request-response monolith serves millions of users with a small team — complexity tax is a real cost.
Shopify monolith-first — one of the largest Rails apps in the world
Shopify is one of the largest Ruby on Rails monoliths ever built. At peak (Black Friday/Cyber Monday 2023), they processed over $9.3 billion in sales over the weekend. The core of their architecture is still a monolith — albeit one that's been carefully modularised.
Their evolution followed the request-response scaling path almost textbook-style: single DB → read replicas → sharded DB (pods) → cache layers → async side-effects via background jobs. They use MySQL with ProxySQL (their equivalent of pgBouncer), Redis for caching, and Sidekiq/Kafka for async processing.
The critical insight from Shopify is their "pods" architecture: they shard by merchant, so each merchant's data lives on one pod (a set of DB shards). The request-response pattern stays the same within a pod — it's still LB → app → cache → DB. The sharding is transparent to the application code via their routing layer.
They resisted microservices for years. When they finally extracted services, they did it for specific, measured reasons: Storefront Renderer was extracted because it had different scaling characteristics than the admin. Checkout was extracted because it needed independent deployment. Each extraction was driven by a named trigger, not by architectural fashion.
Takeaway
Even at $9B+ GMV, the core is still request-response. Shard by tenant to scale writes; extract services only when a specific trigger demands it.
Stack Overflow — 1.3 billion page views/month on 9 web servers
Stack Overflow serves 1.3 billion page views per month on 9 web servers, 4 SQL Server instances (2 primary + 2 replicas), 2 Redis instances, and 3 Elasticsearch instances. The entire site runs on about 25 servers. For context, this is one of the top-50 most-visited websites in the world.
Their architecture is aggressively simple: IIS → ASP.NET MVC → Dapper (micro-ORM) → SQL Server + Redis. No microservices. No message queues for the core request path. No container orchestration. They deploy to bare metal that they own and operate.
The performance secret is not architecture — it's engineering discipline. They obsessively profile every query (using MiniProfiler, which they built and open-sourced). Every page load shows query count and timing in development. N+1 queries are caught immediately. Indexes are reviewed religiously. Redis is used surgically for hot data sets.
Key numbers from their public architecture posts: average page render time ~18ms. Total SQL Server queries per page: ~3-5. Redis operations per page: ~5-10. The entire site's daily SQL Server CPU averages 12%. They have so much headroom that they run Stack Overflow, all Stack Exchange sites, and several other properties on the same infrastructure.
Nick Craver (their infrastructure lead) has said repeatedly: "Performance is a feature." They chose boring technology, invested in making it fast, and achieved scale that makes most distributed systems look overengineered.
Takeaway
Top-50 website on 9 web servers. Boring technology + engineering discipline + aggressive profiling beats complex architecture.
Decision levers
Cache-aside vs write-through
Cache-aside (app reads cache, on miss reads DB + populates cache) is the default for request-response. Write-through (app writes cache and DB together) gives stronger consistency after writes but adds complexity. Choose write-through only when read-your-writes consistency is a hard requirement and write-recency pinning is insufficient.
Read replicas — when to add
When primary CPU > 60% sustained and reads dominate the query mix. Two replicas behind pgBouncer give 3× read throughput. Budget replication lag into your read-your-writes strategy: sticky-read to primary for N seconds after a write, or write-through the cache so users see their own writes immediately.
Connection pooling strategy
pgBouncer in transaction pooling mode is the default. Set backend pool to 2-4× DB CPU cores. App-side pool at 20-30 per instance. Session pooling only if you use prepared statements or SET commands. Without pooling, scaling the app tier hits max_connections and every traffic spike becomes a cascading timeout.
Idempotency scope
Apply Idempotency-Key to every POST that creates something with real consequences: orders, payments, account provisioning. Don't apply to GETs, PUTs, DELETEs (already idempotent by design). Store the full response in Redis with 24-48h TTL. The cost is one Redis check per POST — trivial compared to the cost of double-charging a customer.
Sync vs async boundary
The sync path should do: validate → persist → respond. Everything else — emails, webhooks, analytics, third-party API calls — goes to a queue. The trigger for introducing the queue is p95 latency exceeding the SLO because of side-effects in the hot path. Start fully sync; add the queue when you measure the need.
Failure modes
App fetches a list, then one query per item. Latency explodes linearly. Fix: eager loading, batched queries, or a single JOIN with proper indexing.
Traffic spike → all pool slots taken → new requests queue → upstream timeouts → retries make it worse. Fix: pgBouncer, bounded pool size, fail-fast when pool is full (don't queue indefinitely).
GET /orders returns all 100k records. Memory spikes, latency spikes, client crashes. Fix: cursor-based pagination on every list endpoint from day one. No exceptions.
Write updates DB but forgets to invalidate the cache key. Users see stale data for the full TTL. Fix: invalidate on every write path + short TTLs as safety net + CDC for paths you don't control.
Client retries a POST after a timeout. Server processes it twice — double order, double charge. Fix: Idempotency-Key header on all retry-able mutations. Check Redis before processing.
Sending email, calling webhooks, or writing analytics inline in the request handler. P95 latency balloons when any downstream is slow. Fix: move side-effects to a queue; the sync path only persists.
Hot key TTL expires → all concurrent readers miss simultaneously → DB gets the full uncached load. Fix: single-flight (one loader, others wait), probabilistic early refresh, or background refresh for known hot keys.
Decision table
When to use request-response vs neighbouring patterns
| Dimension | Request-Response | Read-Heavy | Write-Heavy | Event-Driven |
|---|---|---|---|---|
| Read:write ratio | 1:1 to 10:1 | 100:1+ | 1:100+ | varies |
| Latency SLO | 100-500ms | <50ms reads | writes can queue | eventual OK |
| User waits? | Yes, sync | Yes, cached | No, async ACK | No |
| Typical infra | LB+app+cache+DB | CDN+cache+replicas | Queue+batch+shard | Broker+consumers |
| Complexity cost | Low | Medium | Medium-High | High |
| Best for | CRUD APIs, SaaS | Catalogs, feeds | IoT, logging | Workflows, fan-out |
- Most systems start as request-response and graduate one axis at a time.
- Hybrid is common: request-response core with read-heavy or async subsystems.
Worked example
Worked example: B2B SaaS API for project management
Scenario. You're designing the backend API for a B2B project management tool — think Basecamp-lite. 500 companies, ~10,000 users, CRUD for projects, tasks, comments, and file attachments. SLO: p99 < 200ms. Availability target: 99.9%.
Step 1: Recognise the pattern. This is textbook request-response. Users send HTTP requests, wait for a response, and see the result. No real-time collaboration (v1), no fan-out, no long-running processing. Read:write ratio is ~5:1 — moderate, not extreme.
Client → LB → App → Cache / DB. The boring right answer for 70% of APIs.
Step 2: Start with the monolith. One Node.js (or Rails, or Django) app. One Postgres instance (db.r6g.large — 2 vCPU, 16 GB RAM). Connection pool at 20 per app instance. One app instance to start; scale to 2-3 behind a load balancer when you need multi-AZ availability.
Schema: normalised tables for organisations, projects, tasks, comments, users. Foreign keys everywhere. Indexes on every foreign key and every column used in WHERE clauses. Composite index on (project_id, created_at) for the task list endpoint.
Step 3: Add cache where it matters. The dashboard endpoint (GET /projects with task counts) is the hottest endpoint — every user hits it on login. Add Redis cache-aside with 60s TTL. Cache key: dashboard:{org_id}. Invalidate on any project or task mutation for that org. Expected hit rate: 90%+ because users check the dashboard repeatedly without changes.
Step 4: Pagination everywhere. GET /tasks?project_id=X returns cursor-paginated results. Cursor is (created_at, id) base64-encoded. Default page size: 50. Max page size: 200. The DB query: WHERE project_id = $1 AND (created_at, id) > ($2, $3) ORDER BY created_at, id LIMIT 51 — fetch one extra to determine has_more.
Step 5: Idempotency on mutations. POST /tasks accepts an Idempotency-Key header. The middleware checks Redis; if the key exists, return the cached response. If not, process and cache. TTL: 24 hours. This protects against mobile clients retrying on flaky connections.
Step 6: Instrument and wait. Structured logging with request IDs. Per-endpoint latency histograms. DB query count per request (alert if > 10). Slow query log at 100ms threshold. Cache hit rate per endpoint. Do NOT add replicas, sharding, or Kafka until a metric tells you to.
When to evolve. At 500 companies and 10,000 users, the monolith on a single Postgres instance will use maybe 5% of its capacity. You'll likely need a read replica when you hit 5,000 companies (50,000 users) — and even that depends on usage patterns. If you add real-time task updates (v2), extract that to a WebSocket layer. If you add file processing (thumbnails, previews), move that to async workers. Each evolution is driven by a measured trigger, not by architecture FOMO.
Interview playbook
When it comes up
- Standard CRUD API design — "design the backend for X"
- Any SaaS or admin console prompt without extreme scale
- Interviewer says "start simple" or "walk me through the basics first"
- Requirements mention REST, HTTP APIs, or standard web backends
Order of reveal
- 1Identify the pattern. This is synchronous request-response. User sends request, waits for response. No streaming, no push, no long-running work. The boring default is the right answer.
- 2Draw the skeleton. LB in front of stateless app instances. Postgres as the primary store. Redis cache-aside on hot endpoints. This handles thousands of TPS.
- 3Name the API contract. Resource-oriented URLs, cursor-based pagination on every list endpoint, Idempotency-Key on retry-able POSTs, consistent JSON envelope with request IDs.
- 4Show connection pooling. pgBouncer between app and DB. 25 connections per app instance, pgBouncer collapses to 50 backend connections. Without this, scaling the app tier hits max_connections.
- 5Add cache-aside. Redis cache-aside with 60s TTL on hot endpoints. Invalidate on write. Target 80%+ hit rate. TTL is the staleness budget from the requirements.
- 6Name the scaling path. When primary CPU exceeds 60% and reads dominate: add read replicas. When side-effects slow the response: add a queue for async work. Each step is triggered by a specific metric.
- 7Name graduation triggers. This pattern stops being enough when: push is needed (→ real-time), reads are 100:1 (→ read-heavy), writes burst (→ write-heavy), work > 1s (→ async), one trigger fans to N (→ event-driven). I'll name the trigger before I leave this pattern.
Signature phrases
- “Boring is a feature” — Shows you resist complexity tax and know when simple is right.
- “Paginate from day one” — Demonstrates API discipline — every senior engineer has debugged an unpaginated endpoint.
- “Idempotency-Key on retry-able mutations” — Shows awareness of real-world failure modes — networks are unreliable.
- “TTL is the staleness budget made concrete” — Connects cache design to product requirements, not arbitrary numbers.
- “Graduate when a number tells you to” — Shows metric-driven architecture — the opposite of resume-driven development.
- “pgBouncer between app and DB is non-negotiable” — Names a specific operational practice that many candidates skip.
Likely follow-ups
?“Why not microservices?”Reveal
Because this workload doesn't justify the coordination cost. Microservices add network calls, distributed tracing, deployment orchestration, and contract management. For a standard CRUD API, those costs buy you nothing. I'd extract a service only when a specific component has different scaling characteristics or a different team owns it.
?“What if the DB becomes the bottleneck?”Reveal
First: identify whether reads or writes are the bottleneck. If reads: add Redis cache-aside (usually the first move) and read replicas. If writes: optimise queries and indexes first, then consider write-ahead batching, and finally sharding by tenant if nothing else works. Measure before acting.
?“How do you handle cache invalidation?”Reveal
Three layers: (1) explicit invalidation in every write handler, (2) short TTLs as a safety net (30s-2min), (3) CDC-based invalidation for write paths the app team didn't write. The honest answer is all three, layered. No single approach is reliable alone.
?“How do you ensure consistency between cache and DB?”Reveal
Cache-aside with invalidation gives eventual consistency bounded by TTL. For read-your-writes after a user's own mutation: either write-through the cache or pin reads to primary for 5 seconds after a write (write-recency cookie). I'd document which approach we use and make it consistent across the team.
?“What about rate limiting?”Reveal
Token bucket per API key or per user ID on every endpoint. Return 429 with Retry-After header. Implement at the LB or API gateway level — not in the app. Rate limiting protects the DB from misbehaving clients and batch jobs, not just external abuse.
?“When do you move to async?”Reveal
When side-effects in the request path push p95 past the SLO. The sync path should only do validate → persist → respond. Everything else — emails, webhooks, analytics, third-party calls — goes to a queue. The trigger is latency, measured at the p95, exceeding the SLO.
Code snippets
import redis, json, functools
from typing import Any, Callable
r = redis.Redis(host="localhost", port=6379, decode_responses=True)
def cache_aside(prefix: str, ttl: int = 60):
"""Decorator: cache-aside with Redis. Invalidate separately on writes."""
def decorator(fn: Callable) -> Callable:
@functools.wraps(fn)
def wrapper(*args, **kwargs) -> Any:
key = f"{prefix}:{':'.join(str(a) for a in args)}"
cached = r.get(key)
if cached:
return json.loads(cached)
result = fn(*args, **kwargs)
r.setex(key, ttl, json.dumps(result))
return result
return wrapper
return decorator
@cache_aside("user", ttl=120)
def get_user(user_id: int) -> dict:
# DB query here
return {"id": user_id, "name": "Alice"}
def update_user(user_id: int, data: dict):
# DB write here
r.delete(f"user:{user_id}") # invalidateimport redis, json, hashlib
from fastapi import Request, Response
from starlette.middleware.base import BaseHTTPMiddleware
r = redis.Redis(host="localhost", port=6379, decode_responses=True)
IDEMP_TTL = 86400 # 24 hours
class IdempotencyMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next):
if request.method != "POST":
return await call_next(request)
key = request.headers.get("Idempotency-Key")
if not key:
return await call_next(request)
cache_key = f"idemp:{key}"
cached = r.get(cache_key)
if cached:
data = json.loads(cached)
return Response(
content=data["body"],
status_code=data["status"],
headers={"X-Idempotent-Replay": "true"},
)
response = await call_next(request)
body = b""
async for chunk in response.body_iterator:
body += chunk
r.setex(cache_key, IDEMP_TTL, json.dumps({
"status": response.status_code,
"body": body.decode(),
}))
return Response(
content=body,
status_code=response.status_code,
headers=dict(response.headers),
)import base64, json
from fastapi import FastAPI, Query
from typing import Optional
app = FastAPI()
def encode_cursor(created_at: str, id: int) -> str:
return base64.urlsafe_b64encode(
json.dumps({"ts": created_at, "id": id}).encode()
).decode()
def decode_cursor(cursor: str) -> dict:
return json.loads(base64.urlsafe_b64decode(cursor))
@app.get("/tasks")
async def list_tasks(
project_id: int,
cursor: Optional[str] = Query(None),
limit: int = Query(50, le=200),
):
if cursor:
c = decode_cursor(cursor)
rows = db.execute(
"""SELECT * FROM tasks
WHERE project_id = %s
AND (created_at, id) > (%s, %s)
ORDER BY created_at, id
LIMIT %s""",
(project_id, c["ts"], c["id"], limit + 1),
)
else:
rows = db.execute(
"""SELECT * FROM tasks
WHERE project_id = %s
ORDER BY created_at, id
LIMIT %s""",
(project_id, limit + 1),
)
has_more = len(rows) > limit
items = rows[:limit]
next_cursor = (
encode_cursor(items[-1]["created_at"], items[-1]["id"])
if has_more else None
)
return {"data": items, "next_cursor": next_cursor, "has_more": has_more}from psycopg2 import pool
# App-side pool: 5 min, 20 max connections per app instance
# With 5 app instances → 100 app-side conns → pgBouncer → 50 real conns
db_pool = pool.ThreadedConnectionPool(
minconn=5,
maxconn=20,
host="pgbouncer.internal", # point at pgBouncer, not Postgres
port=6432,
dbname="myapp",
user="app_user",
password="...",
options="-c statement_timeout=5000", # 5s query timeout
)
def get_user(user_id: int) -> dict:
conn = db_pool.getconn()
try:
with conn.cursor() as cur:
cur.execute("SELECT * FROM users WHERE id = %s", (user_id,))
row = cur.fetchone()
return dict(row) if row else None
finally:
db_pool.putconn(conn) # always return to pool-- Core tables for a B2B SaaS project management API
CREATE TABLE organisations (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
name TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE TABLE projects (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
org_id BIGINT NOT NULL REFERENCES organisations(id),
name TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Index for listing projects by org (cursor pagination)
CREATE INDEX idx_projects_org_created
ON projects(org_id, created_at, id);
CREATE TABLE tasks (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
project_id BIGINT NOT NULL REFERENCES projects(id),
title TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'open'
CHECK (status IN ('open', 'in_progress', 'done')),
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Index for listing tasks by project (cursor pagination)
CREATE INDEX idx_tasks_project_created
ON tasks(project_id, created_at, id);
-- Index for filtering by status within a project
CREATE INDEX idx_tasks_project_status
ON tasks(project_id, status);
-- Idempotency key tracking (alternative to Redis)
CREATE TABLE idempotency_keys (
key TEXT PRIMARY KEY,
response JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Auto-expire old keys (run pg_cron daily)
-- DELETE FROM idempotency_keys WHERE created_at < now() - interval '48 hours';Drills
When do you add a cache to a request-response service?Reveal
When DB CPU exceeds 50% sustained and a subset of reads are hot (80/20 rule). Not "because caching is best practice." Measure first: if the DB is at 12% CPU, Redis just adds invalidation tax for zero benefit.
Why cursor pagination over offset?Reveal
Offset is O(N) — the DB scans N rows before returning the page. Cursor is O(1) — it seeks via the index. Offset also breaks under concurrent writes (rows shift). Cursor is stable regardless of inserts/deletes.
A client retries a POST and creates a duplicate order. How do you prevent this?Reveal
Idempotency-Key header. Client sends a UUID; server checks Redis. If found, return cached response. If new, process and cache. TTL 24-48h. This makes POST safe to retry without double-execution.
Your service has 10 app instances each opening 25 DB connections. What happens?Reveal
250 connections to Postgres, which likely exceeds max_connections (default 100-200). Put pgBouncer in front; it multiplexes 250 app connections into 50 real backend connections. Without it, connection exhaustion → cascading timeouts.
Your team wants to add Kafka to a CRUD API. What do you say?Reveal
Kafka is right for event streams, log ingestion, and cross-service integration. It's wrong for "user updates profile, app saves to DB, returns 200." That's a synchronous operation. Adding Kafka here replaces a DB call with infrastructure that takes months to operate. Use Kafka when the coupling or volume justifies it.
Name three signals that you should graduate off request-response.Reveal
(1) User needs push updates without polling → real-time pattern. (2) Reads dominate 100:1 and the working set is huge → read-heavy pattern. (3) A single request triggers many downstream effects that should be independently retryable → event-driven pattern. Each is a specific, measurable trigger.