Real-time delivery
When to reach for this
Reach for this when…
- Server-initiated updates to the client
- Freshness budget measured in seconds or less
- Many concurrent clients watching the same data
- Updates arrive irregularly, not on a fixed schedule
- Bidirectional communication between client and server
- Presence or typing indicators required
Not really this pattern when…
- User-initiated refresh is acceptable — a simple GET with ETag works
- Updates are batched and rare (daily digest) — webhook or email is cheaper
- The client is a backend service — use a message queue, not WebSocket
Good vs bad answer
Interviewer probe
“Design the real-time delivery for a chat app with 10M daily users.”
Weak answer
"I'd use WebSocket between clients and the server. When a message arrives, the server pushes it to the recipient's WebSocket connection. For presence, I'd track who's connected and broadcast status changes."
Strong answer
"WebSocket from clients to an edge tier of ~60 stateless servers behind a sticky L4 load balancer. Each edge holds ~50K concurrent connections. The connection budget: 10M DAU × 30% peak online = 3M concurrent, divided by 50K per edge = 60 servers, provisioned for 4M to handle spikes.
Messages follow a dual-write path: the chat API writes to Postgres first (durability), then PUBLISHes to a Redis pub/sub channel keyed by conversation_id. Each edge subscribes to channels for the conversations of its connected clients. On publish, Redis routes the event to subscribing edges, which forward it over WebSocket to the relevant clients.
Reconnect: the client tracks last_seen_id. On reconnect, it sends the id; the edge replays missed messages from a Redis ZSET cache (last 15 minutes per conversation), falling back to Postgres for longer gaps. The edge subscribes to the pub/sub channel before replaying, buffers incoming messages, and deduplicates by id to prevent double delivery.
Presence: heartbeat 30s, Redis key with 60s TTL. Expiry triggers a status_changed event to contacts via pub/sub. Typing indicators are ephemeral — fire-and-forget on the bus, no persistence.
Mobile: ping 25s for NAT keepalive. When backgrounded, switch to APNs/FCM push notifications. On foreground, reconnect and replay the gap.
Deploy safety: connection draining with 30s GOAWAY, client jittered backoff (100ms base, 0-1s jitter, 30s cap), load balancer admission control at 500 new connections/second per edge."
Why it wins: Names the transport, sizes the edge tier with math, describes the dual-write durability path, handles reconnect with dedup, covers presence and typing, addresses mobile keepalive, and mitigates thundering herd on deploy.
Cheat sheet
- •Four transports: poll (>5s tolerance), long-poll (medium), SSE (one-way push), WebSocket (bidirectional).
- •Default to SSE; upgrade to WebSocket only if the client sends at high frequency.
- •Connection budget: peak_concurrent = DAU × peak_online_fraction. Size the edge tier from this.
- •50K idle WebSocket connections per edge server is a practical ceiling (memory-bound).
- •Edge tier + pub/sub bus = clean horizontal scale. Sticky sessions alone do not scale.
- •Dual-write: DB first (durability), pub/sub second (speed). Pub/sub is ephemeral.
- •Reconnect: client sends last_seen_id, server replays from cache/DB, dedup by id.
- •Subscribe to pub/sub before replaying from DB to prevent gaps.
- •Presence: heartbeat 30s, Redis key TTL 60s, broadcast status_changed on expiry.
- •Typing indicators are ephemeral — never persist, never replay on reconnect.
- •Mobile: ping 25s (NAT timeout ~60s). Push notification fallback when backgrounded.
- •Deploy safety: GOAWAY + 30s drain + jittered backoff + admission control.
Core concept
Every system-design interview that involves server-initiated updates boils down to one question: which transport do you pick, and why? Most candidates jump straight to WebSocket because it sounds sophisticated. That instinct is often wrong — and interviewers know it. The right answer depends on four variables: direction (one-way or two-way), update frequency (seconds vs minutes), connection budget (how many persistent connections the edge tier can afford), and network reality (corporate proxies, mobile NAT, battery).
The four transports, in ascending order of operational complexity:
Short-poll — the client hits GET /updates?since=<ts> on a fixed interval, typically every 5-10 seconds. The server responds immediately with whatever is new (often nothing). This is the simplest possible approach and is correct for dashboards, leaderboards, or any UI where a 5-second delay is invisible. The cost is wasted requests: if updates arrive once per minute, 11 out of 12 polls return empty. At 100K clients polling every 5 seconds, that is 20K requests per second — all returning 304 or empty 200s.
Long-poll — the client sends a GET, the server holds the connection open until an event arrives or a timeout (typically 30 seconds) expires. The client immediately re-requests after each response. Latency is near-instant for the first event after the request; subsequent events queue until the next request cycle. The main cost is connection count: each waiting client holds one HTTP connection open on the server. At 50K concurrent clients, that is 50K threads or coroutines, which is manageable on modern async servers but starts hurting at 500K.
Server-Sent Events (SSE) — a single long-lived HTTP response (text/event-stream) over which the server pushes events. The protocol is dead simple: each event is a few lines of text (`event: msg data: {...} id: 42
). The browser's EventSource API handles reconnection automatically and sends the last received id as Last-Event-ID` on reconnect, so the server can replay the gap. SSE is one-way: server-to-client only. If the client needs to send data back, it uses normal HTTP requests alongside the SSE stream. This is the right choice for notification feeds, live comment streams, and any scenario where the client mostly listens.
WebSocket — a full-duplex, persistent connection upgraded from HTTP. Both sides can send frames at any time. This is the only option when the client needs to send at high frequency (chat messages, cursor positions, game inputs). The operational cost is highest: the load balancer must support long-lived connections, the edge tier must track connection state, and horizontal scaling requires a pub/sub bus to route events across edge servers.
Clients hold persistent connections to edge servers; a pub/sub bus routes events to the correct edge based on subscription.
The architecture that emerges at scale is the edge-tier + pub/sub bus pattern. Clients connect to a fleet of stateless edge servers via WebSocket (or SSE). Each edge server subscribes to pub/sub channels corresponding to the rooms or conversations its connected clients belong to. When a publisher emits an event, the pub/sub bus (Redis, NATS, or Kafka) routes it to every subscribing edge, which forwards it to the relevant connected clients. This decouples the publish path from the connection path — the publisher does not need to know which edge server holds which client.
Connection budget math is the interview differentiator. Take a chat app with 10M DAU. If 30% are online at peak, that is 3M concurrent connections. A single edge server on modern hardware handles ~50K idle WebSocket connections (the bottleneck is memory for connection state, not CPU). That means 3M / 50K = 60 edge servers at peak. This is the sizing number interviewers want to hear — not "we use WebSocket" but "we need 60 edge servers in the connection tier, each subscribing to the pub/sub channels of its connected users."
Real-time delivery is not durable delivery. The pub/sub bus is ephemeral — if a client is disconnected when an event fires, the event is lost from the client's perspective. Durability requires a separate write path: the publisher writes the message to a persistent store (Postgres, DynamoDB) first, then publishes to the bus. On reconnect, the client sends its last-seen cursor and the server replays the gap from the store. This dual-write pattern — durable store + ephemeral pub/sub — is the canonical architecture for chat, notifications, and any system where missing a message is unacceptable.
Cross-references: the producer-consumer pattern covers the durable queue side of message delivery; the fan-out pattern covers pub/sub fan-out at the data layer. Real-time delivery is the transport layer that sits on top of both.
Canonical examples
- →Chat / messaging (Slack, Discord, WhatsApp)
- →Presence / typing indicators
- →Live scores / sports tickers
- →Collaborative editors (Figma, Google Docs)
- →Trading / price tickers
Variants
Short-poll
Client polls on a fixed interval. Simplest transport; correct when latency tolerance exceeds the poll period.
Client polls GET /updates every 5 seconds. Simple but wasteful when there are no updates.
Short-polling is the right answer more often than candidates think. If the product can tolerate a 5-10 second update delay — dashboards, leaderboards, inventory counts, delivery tracking — polling eliminates every operational headache of persistent connections: no WebSocket upgrade negotiation, no connection state tracking, no pub/sub bus, no reconnect logic, no mobile keepalive.
The implementation is one HTTP endpoint: GET /updates?since=<timestamp>&limit=50. The server queries the database (or cache) for rows newer than the timestamp, returns them with the latest timestamp for the next poll. The client sets a setInterval and fires the request every N seconds. HTTP caching (ETag, 304 Not Modified) eliminates redundant payload transfer when nothing has changed.
Client polls GET /updates every 5 seconds. Simple but wasteful when there are no updates.
The failure mode is obvious: wasted requests. At 100K clients polling every 5 seconds, the server handles 20K req/s even when nothing is happening. This is fine for a CDN-cacheable endpoint (all clients get the same leaderboard) but ruinous for per-user endpoints (each client has a different since value, so caching doesn't help). The break-even point is roughly: if the update frequency is higher than 1 update per poll interval, polling becomes efficient because every response carries data. If updates are rare (1/minute), 90%+ of polls are empty.
When to choose: dashboards with >5s freshness tolerance, public data that can be CDN-cached, systems where simplicity outweighs latency, or as a last-resort fallback when SSE and WebSocket are blocked.
When to move on: latency requirement drops below the poll interval, or per-user poll traffic exceeds the server's request budget.
Server-Sent Events (SSE)
One-way server push over a single HTTP stream with built-in reconnect and Last-Event-ID.
Server pushes events over a single long-lived HTTP response. Client reconnects automatically with Last-Event-ID.
SSE is the most underused transport in system design interviews. Candidates skip it because they think WebSocket is strictly better, but SSE wins on three axes: simplicity, proxy compatibility, and built-in reconnection.
The protocol is a single HTTP response with Content-Type: text/event-stream. The server writes events as newline-delimited text: `event: message data: {"text":"hello"} id: 42
. The browser's EventSource API handles parsing, buffering, and automatic reconnection. On disconnect, the browser reconnects and sends Last-Event-ID: 42` as a header, allowing the server to replay missed events from a durable store.
Server pushes events over a single long-lived HTTP response. Client reconnects automatically with Last-Event-ID.
SSE works through HTTP/2 multiplexing, which means multiple SSE streams can share a single TCP connection — eliminating the "6 connections per domain" limit that plagued HTTP/1.1 SSE. It works through corporate proxies that block WebSocket upgrades. It works through CDNs that support streaming responses (most do). The only thing it cannot do is send data from client to server on the same connection — the client uses normal POST/PUT requests alongside the SSE stream.
This makes SSE the correct choice for: notification feeds (server → client only), live comment streams, stock tickers, CI/CD build logs, and any scenario where the client is a consumer, not a producer. Even in chat applications, SSE can handle the downstream path (receiving messages) while regular HTTP handles the upstream path (sending messages) — but at that point, WebSocket's single-connection model is simpler.
When to choose: one-directional server push, need proxy compatibility, want built-in reconnect without custom client code, or when HTTP/2 is available and you need multiple event streams.
When to move on: client sends data at high frequency (typing indicators, cursor positions), or the product requires true bidirectional streaming.
WebSocket
Full-duplex persistent connection. Required for bidirectional, high-frequency communication.
Full-duplex persistent connections. Edge tier fans out via pub/sub bus.
WebSocket is the right answer when the client and server both need to send data at high frequency on the same connection. Chat messages, collaborative editing cursors, multiplayer game state, real-time trading — these all require bidirectional streaming that SSE cannot provide.
The connection starts as an HTTP upgrade (Upgrade: websocket). Once upgraded, both sides send binary or text frames with minimal overhead (2-14 bytes per frame vs ~100 bytes for HTTP headers). The connection is persistent — no per-message handshake, no header repetition, no HTTP parsing overhead.
Full-duplex persistent connections. Edge tier fans out via pub/sub bus.
The operational cost is where most candidates underestimate WebSocket. Every persistent connection is state that must be tracked: which user is on which edge server, which rooms they have joined, what their last-seen message is. The load balancer must support long-lived connections (L4 sticky by connection, not L7 round-robin per request). Horizontal scaling requires a pub/sub bus — when a message arrives at one edge server, it must be forwarded to all other edge servers that hold connections from the recipient's conversation members.
Connection lifecycle adds complexity: heartbeat/ping frames to detect dead connections, reconnect with gap replay, graceful shutdown during deploys (drain connections before killing the process), and admission control to prevent thundering herds.
The edge tier architecture that emerges: N stateless edge servers, each holding ~50K connections, subscribing to a Redis pub/sub bus. The publisher writes to the database, then publishes to the bus. Each edge receives events for the channels its clients are subscribed to, and forwards them over the WebSocket. Adding capacity = adding edge servers and subscribing them to the bus.
When to choose: bidirectional communication (chat, collaborative editing, gaming), sub-100ms latency requirement, or when the client sends data at high frequency.
When to move on: there is no simpler transport beyond this — WebSocket is the terminal transport. Evaluate whether you truly need bidirectional or if SSE + HTTP POST would suffice.
Hybrid with transport fallback
WebSocket primary, SSE fallback (corporate proxies), short-poll last resort. Feature detection at connect time.
In production, no single transport works for every client. Corporate proxies block WebSocket upgrades. Some CDNs buffer SSE responses. Older mobile browsers lack EventSource support. The production answer is a transport negotiation protocol that tries the best option first and falls back gracefully.
The negotiation sequence: (1) attempt WebSocket upgrade — if the handshake succeeds within 5 seconds, use WebSocket for the session. (2) If the upgrade fails (HTTP 400, proxy rejection, timeout), fall back to SSE — open an EventSource for downstream and use HTTP POST for upstream. (3) If SSE also fails (response buffered, no events received within 10 seconds), fall back to long-poll. (4) If long-poll also fails (unlikely), fall back to short-poll at 5-second intervals.
The server exposes a /negotiate endpoint that returns the supported transports and their URLs. The client tries them in order. Once a transport is established, the session is pinned to it — no mid-session switching, which would require state synchronisation. If the connection drops, the client re-negotiates from step 1.
Socket.IO popularised this pattern (though its implementation conflates transport negotiation with application protocol). The key engineering insight is that the application layer should be transport-agnostic: messages are messages regardless of whether they arrive over WebSocket frames, SSE events, or poll responses. The transport layer handles framing, reconnection, and buffering; the application layer handles message routing and business logic.
Feature flags per client segment reduce complexity. If analytics show that 95% of clients successfully connect via WebSocket, the remaining 5% (enterprise customers behind aggressive proxies) can be routed to an SSE-only edge tier, avoiding the complexity of runtime negotiation for the majority.
When to choose: any production system with diverse client environments — enterprise customers behind proxies, mobile users on unreliable networks, or global users with varying ISP configurations.
Scaling path
V1: Short-poll, single server
Ship a working real-time feature with zero infrastructure beyond the app server.
Client polls GET /updates every 5 seconds. Simple but wasteful when there are no updates.
The simplest real-time delivery: client polls GET /updates?since=<ts> every 5 seconds. The server queries the database for new rows and returns them. No persistent connections, no pub/sub, no connection tracking. A single app server handles this with standard request/response routing.
This works for the first prototype and for low-traffic dashboards. At 1K concurrent clients polling every 5 seconds, that is 200 req/s — trivial. The latency floor is the poll interval: a message sent 0.1s after a poll waits 4.9s to be seen. For a chat app, this is unacceptable; for a delivery-tracking dashboard, it is invisible.
The transition trigger is clear: when the product demands sub-5-second latency, or when empty polls exceed 80% of traffic, short-poll becomes wasteful and you need a persistent-connection transport.
What triggers the next iteration
- Latency floor = poll interval — cannot deliver sub-second updates
- Wasted requests dominate when updates are sparse
- Per-user poll queries are not cacheable
- No server-initiated push — client must initiate every exchange
V2: Long-poll, single server
Near-instant delivery without the operational cost of persistent connections.
Client sends GET, server holds until an event arrives or a 30s timeout. Repeat.
Upgrade from short-poll: the client sends GET /updates, the server holds the connection open until either (a) a new event arrives, or (b) a 30-second timeout expires. The client immediately re-requests after each response. Latency drops from the poll interval to near-instant for the first event.
Implementation requires an async event loop: the server registers each waiting request in a lookup table keyed by user or room. When a new event arrives (via database trigger, internal pub, or polling), the server finds the matching requests and responds. Frameworks like Node.js, Go, and Python asyncio handle this naturally.
At 10K concurrent clients, this means 10K held connections. An async server handles this comfortably. At 100K+ concurrent connections, the memory and file-descriptor overhead starts to matter, and you need to move to a dedicated connection tier with SSE or WebSocket.
What triggers the next iteration
- Each waiting client holds one connection — 100K clients = 100K connections
- Timeout-and-retry creates periodic request storms
- No multiplexing — one event per request/response cycle
- Load balancers may terminate held connections prematurely
V3: SSE/WebSocket with edge tier + Redis pub/sub
Handle millions of concurrent connections with horizontal scaling.
Full-duplex persistent connections. Edge tier fans out via pub/sub bus.
The production architecture: clients connect via WebSocket (or SSE) to a fleet of stateless edge servers. Each edge subscribes to a Redis pub/sub bus for the rooms/channels its connected clients belong to. When a publisher emits an event, Redis routes it to every subscribing edge, which forwards it to the relevant clients.
Connection budget: 50K idle connections per edge server. For 3M concurrent connections (10M DAU × 30% online), that is 60 edge servers. Each edge is stateless — if it crashes, clients reconnect to any other edge and re-subscribe. The pub/sub bus is the coordination layer, not the edge.
Durability: the publisher writes to Postgres first, then publishes to Redis. If the pub/sub message is lost, the data is still in the DB. On reconnect, the client sends its last-seen cursor and the edge replays the gap from the database.
This architecture handles millions of concurrent connections, sub-100ms delivery latency, and survives edge failures. The next iteration adds multi-region edge deployment and global pub/sub.
What triggers the next iteration
- Single Redis pub/sub is a bottleneck above ~500K messages/sec
- All edge servers in one region — cross-region clients have high latency
- Deploy drops all connections on an edge — thundering herd on reconnect
- Presence requires a separate heartbeat system
V4: Multi-region edge with global pub/sub
Sub-50ms delivery for global users with presence and connection draining.
Deploy edge servers in every major region (US-East, EU-West, AP-Southeast). Each region has its own Redis pub/sub cluster. A global pub/sub tier (NATS JetStream or Kafka with MirrorMaker) replicates events across regions. Clients connect to the nearest edge via GeoDNS or Anycast.
Presence is promoted to a first-class service: a dedicated presence cluster receives heartbeats from all edge servers, maintains TTL-based online status in Redis, and broadcasts status changes to contacts via the pub/sub bus. Typing indicators are handled as ephemeral events — no persistence, no gap replay, just fire-and-forget over the bus.
Connection draining on deploy: before shutting down an edge server, it sends a GOAWAY frame to all connected clients, then waits 30 seconds for in-flight messages to drain. Clients reconnect with jittered backoff to prevent thundering herds. The load balancer's health check removes the draining server from the pool immediately, so new connections go elsewhere.
At this stage, the system handles tens of millions of concurrent connections across multiple regions, with sub-50ms delivery within a region and sub-200ms cross-region.
What triggers the next iteration
- Cross-region pub/sub replication lag adds 50-150ms for inter-region messages
- Global presence requires merging heartbeats across regions
- Operational cost of managing edge clusters in 3+ regions
- Client-side complexity for region failover when GeoDNS is stale
Deep dives
Transport choice: the decision matrix
Side-by-side view of the four real-time transport options and their trade-offs.
The transport decision is a function of four variables, and interviewers test whether you can navigate it rather than defaulting to WebSocket.
Direction: Does the client only receive (notifications, live scores), or does it also send at high frequency (chat messages, cursor positions)? If the client only receives, SSE is simpler and more proxy-compatible than WebSocket. If the client sends infrequently (one message per minute), SSE for downstream + HTTP POST for upstream is viable. WebSocket is only necessary when the client sends at high frequency.
Latency: What is the freshness budget? If 5-10 seconds is acceptable, short-poll is correct and eliminates all persistent-connection complexity. If sub-second is required, you need SSE or WebSocket. Long-poll sits in between — near-instant for the first event, then poll-interval latency for subsequent events within the same response cycle.
Connection cost: Every persistent connection consumes memory (connection state, buffers, kernel file descriptors). A server handles ~50K idle WebSocket connections before memory pressure becomes noticeable. At 10K concurrent users, connection cost is irrelevant. At 10M, it dominates the architecture. Short-poll has zero persistent connection cost; SSE and WebSocket have one connection per client.
Proxy compatibility: Corporate proxies routinely block WebSocket upgrades (they don't understand the Upgrade header). SSE works through any HTTP-compatible proxy. Short-poll works everywhere. If your users include enterprise customers behind restrictive proxies, you need a fallback strategy or SSE as the primary transport.
Side-by-side view of the four real-time transport options and their trade-offs.
The decision matrix: choose short-poll if latency tolerance > 5s and the data is cacheable. Choose SSE if the client is a consumer (server → client only) and you want built-in reconnection. Choose WebSocket if bidirectional streaming is required. Choose hybrid fallback in production when client environments are diverse.
Edge tier and pub/sub subscription management
N edge servers each holding K connections. Pub/sub routes events to the right edge. Adding a new edge = subscribe to channels.
The edge tier is the core architectural innovation that makes real-time delivery scale. Without it, every publisher needs to know which server holds which client — a coupling that makes horizontal scaling impossible.
Each edge server maintains a local subscription table: a map from channel/room id to a set of connected WebSocket handles. When a client connects and joins room "conv:123", the edge server adds the client to its local table and subscribes to the Redis pub/sub channel "conv:123". When a message arrives on that channel, the edge iterates its local subscribers and writes the message frame to each.
Adding an edge server is trivial: deploy it, register with the load balancer. It starts with zero subscriptions. As clients connect and join rooms, it subscribes to channels incrementally. Removing an edge: drain connections (GOAWAY), unsubscribe from all channels, deregister from the load balancer.
N edge servers each holding K connections. Pub/sub routes events to the right edge. Adding a new edge = subscribe to channels.
The subscription management has a subtle efficiency concern: if 10,000 clients on the same edge are in the same room, the edge receives one copy of each message from the pub/sub bus and fans it out locally to 10,000 WebSocket connections. This is efficient. But if each of those 10,000 clients is in a unique room, the edge subscribes to 10,000 channels. Redis pub/sub handles millions of channels with O(1) subscribe/unsubscribe, so this is fine — but the edge server now receives messages for 10,000 channels and must dispatch each to the correct client.
Connection count per edge is bounded by memory: each idle WebSocket connection costs ~5-10 KB (kernel buffers, userspace state). At 50K connections, that is 250-500 MB — well within a modern server's capacity. The CPU cost of message fan-out is negligible for idle connections; the bottleneck is the burst when a popular channel fires (e.g., a group chat with 1,000 members on the same edge). This is handled by buffering and batching writes: accumulate frames for 1-5ms, then write them in a single syscall (writev / scatter-gather I/O).
Reconnect and missed-message catch-up
Client disconnects, reconnects with last_seen_id, server replays the gap from durable store.
Networks drop. Mobile radios sleep. Deploys restart edge servers. Every real-time system must handle reconnection, and the critical question is: what about the messages the client missed while disconnected?
The protocol: every message has a monotonically increasing id (typically a database sequence or a Snowflake id). The client tracks the id of the last message it received. On reconnect, it sends last_seen_id as a parameter. The edge server (or a dedicated catch-up service) queries the message store for all messages with id > last_seen_id, replays them in order, then switches to live streaming from the pub/sub bus.
Client disconnects, reconnects with last_seen_id, server replays the gap from durable store.
The gap between "replay from DB" and "live from pub/sub" is the tricky part. While the server is querying the DB for missed messages, new messages may arrive on the pub/sub channel. If the server subscribes to the channel before replaying, some messages appear twice (once from DB replay, once from live). If it subscribes after replaying, some messages may be missed (arrived between the DB query and the subscription).
The solution: subscribe to the pub/sub channel first, buffer incoming messages, then replay from the DB. After replay is complete, drain the buffer, deduplicating by message id. Any message in the buffer with an id already replayed from the DB is skipped. This ensures exactly-once delivery from the client's perspective.
For SSE, this is handled automatically by the Last-Event-ID header — the browser sends it on reconnect, and the server replays from that point. For WebSocket, you must build this protocol yourself: a reconnect handshake that includes the cursor, followed by a gap-fill phase before switching to live mode.
Performance: the gap-fill query should hit a recent-message cache (Redis sorted set, keyed by room, with the last 1,000 messages) rather than the primary database. Only if the gap exceeds the cache window should the query fall back to the database. In practice, most disconnections are brief (seconds to minutes), so the cache handles 99%+ of reconnections without touching the DB.
Presence and typing indicators
Client pings every 30s. Presence service sets TTL 60s. Expiry = offline. Broadcast status to contacts.
Presence ("Alice is online") and typing indicators ("Bob is typing...") are the two most common ephemeral real-time features. They seem simple but have subtle scaling implications that interviewers love to probe.
Presence via heartbeat + TTL: The client sends a heartbeat ping over the WebSocket connection every 30 seconds. The edge server forwards this to a presence service, which sets a Redis key presence:{user_id} with a TTL of 60 seconds. If the TTL expires without a refresh, the user is considered offline. On status change (online → offline or vice versa), the presence service publishes a status_changed event to the pub/sub bus, and the edge tier forwards it to the user's contacts.
Client pings every 30s. Presence service sets TTL 60s. Expiry = offline. Broadcast status to contacts.
The scaling concern is fan-out of status changes. When a user with 500 contacts goes online, the presence service must notify 500 other users. If those 500 users are spread across 50 edge servers, that is 50 pub/sub messages. At 10M online users, status changes fire at a rate of thousands per second (users constantly going online/offline). The presence service must batch these changes and rate-limit broadcasts.
Typing indicators are simpler because they are ephemeral — no persistence, no gap replay, no catch-up on reconnect. When a user starts typing, the client sends a "typing" event over WebSocket. The edge forwards it to the conversation's pub/sub channel. All members of the conversation see "X is typing..." for 3 seconds (refreshed if the typing continues, cleared on timeout or on message send).
The key design decision: typing indicators should never be persisted and never be included in missed-message replay. They are fire-and-forget. If a user reconnects and someone was typing while they were disconnected, the typing indicator is irrelevant. This simplifies the system: typing events bypass the message store entirely and go straight to the pub/sub bus.
Presence can be batched for efficiency: instead of broadcasting every individual status change, the presence service batches changes over a 2-second window and sends a single "batch status update" with all changed users. Clients that care about real-time presence (the active chat window) can subscribe to individual status channels for instant updates; the contact list uses the batched channel.
Thundering herd: the deploy problem
Deploy drops all connections. All clients reconnect at once. Fix: jittered backoff staggers reconnects.
The thundering herd is the most predictable failure in real-time systems, and one of the easiest to prevent — yet most candidates miss it entirely.
The scenario: you deploy a new version of the edge server. The rolling deploy kills edge server 1, which holds 50K WebSocket connections. All 50K clients detect the disconnection (via TCP RST or close frame) and immediately reconnect. If the load balancer distributes them evenly across the remaining 59 edge servers, each receives ~850 new connections in under a second — manageable. But if the reconnect is truly simultaneous (within 100ms), the connection establishment rate spikes: TLS handshakes, WebSocket upgrades, room re-subscriptions, and gap-fill queries all hit at once.
Deploy drops all connections. All clients reconnect at once. Fix: jittered backoff staggers reconnects.
At scale, a rolling deploy that takes down 5 edge servers over 5 minutes means 250K reconnections concentrated in 5 bursts of 50K each. Without mitigation, each burst saturates the remaining edges' CPU (TLS) and the message store (gap-fill queries).
Client-side fix: jittered exponential backoff. On disconnect, the client waits a random delay before reconnecting: delay = min(base * 2^attempt + random_jitter, max_delay). With a base of 100ms, jitter of 0-1000ms, and max of 30s, 50K clients spread their reconnections over 1-2 seconds instead of 100ms. This alone reduces the peak reconnection rate by 10-20x.
Server-side fix: connection draining. Before killing an edge server, send a GOAWAY frame (WebSocket close with code 1001 "Going Away") and wait for clients to disconnect gracefully. The edge stops accepting new connections but continues serving existing ones for a drain period (30-60 seconds). During this window, clients receive the close frame and reconnect with backoff. The drain period spreads reconnections over the entire window rather than concentrating them at the kill instant.
Server-side fix: admission control. The load balancer rate-limits new WebSocket connections per edge server. If an edge receives more than N connection attempts per second, excess requests receive HTTP 503 with a Retry-After header. The client backs off and retries. This prevents any single edge from being overwhelmed during a reconnection storm.
The combination of all three — jittered backoff on the client, connection draining on the server, admission control on the load balancer — makes deploys invisible to users. This is the level of operational maturity interviewers look for in a staff-level answer.
Mobile network: NAT timeout, keepalive, and push fallback
NAT gateways silently drop idle connections after ~60s. Ping frames every 25s keep the mapping alive.
Mobile networks introduce three problems that don't exist on desktop: NAT timeout, battery constraints, and backgrounding.
NAT timeout: Mobile carriers use Network Address Translation (NAT) gateways to share public IPs across devices. These gateways maintain a mapping table: (device private IP:port) → (public IP:port). The mapping has a timeout — typically 30-120 seconds, varying by carrier. If no traffic flows through the mapping within the timeout, the NAT gateway silently drops it. The WebSocket connection is now broken, but neither the client nor the server knows — the TCP connection appears alive until the next write attempt, which may not happen for minutes.
NAT gateways silently drop idle connections after ~60s. Ping frames every 25s keep the mapping alive.
Fix: WebSocket ping frames. The client sends a ping frame every 25 seconds (comfortably within any carrier's NAT timeout). The server responds with a pong. This keeps the NAT mapping alive. The ping/pong is at the WebSocket protocol level — it does not hit the application layer, so it is cheap. Without pings, the connection silently dies, and the user stops receiving messages until the app detects the failure (often only when the user tries to send).
Battery impact: Sending a ping every 25 seconds means the mobile radio wakes up every 25 seconds. On LTE, the radio consumes ~1W when transmitting and ~0.01W when idle. Waking the radio 2,400 times per day (every 25s × 16 waking hours) adds measurable battery drain. The trade-off is acceptable for a chat app (users expect instant delivery) but may not be for a news feed (where 30-second latency is fine and SSE or long-poll with longer intervals saves battery).
Backgrounding: When the user switches to another app, the OS suspends the app's network connections after a brief grace period (iOS: ~30s, Android: varies). The WebSocket connection dies. On resume, the app reconnects with gap replay. But what about messages that arrive while the app is backgrounded?
Fix: push notification fallback. When the edge server detects that a client's connection has been idle for >60 seconds (no pings received), it marks the user as "push-eligible." New messages for that user are sent via APNs (iOS) or FCM (Android) as push notifications. The push notification wakes the app, which reconnects, replays the gap, and resumes the WebSocket. This dual-path (WebSocket when foregrounded, push when backgrounded) is the standard pattern for every mobile messaging app.
The server must track delivery state: a message is "delivered via WebSocket" or "delivered via push" or "pending." This prevents duplicate deliveries (push + WebSocket) and ensures no message is lost in the transition between states.
Case studies
WebSocket gateway and connection management at scale
Slack's real-time messaging infrastructure handles approximately 6 million concurrent WebSocket connections at peak. The architecture is a textbook edge-tier + pub/sub pattern.
Clients connect via WebSocket to a fleet of gateway servers (Slack calls them "flannel" servers). Each gateway holds tens of thousands of connections. Behind the gateways sits a message service that writes to MySQL (sharded by workspace), then publishes to a channel-based pub/sub system. Each gateway subscribes to channels corresponding to the workspaces of its connected clients.
When a user sends a message, the flow is: client → gateway → message service → MySQL (persist) → pub/sub → gateway(s) → WebSocket → recipient clients. The write path and the delivery path are fully separated — durability is in MySQL, speed is in pub/sub.
Reconnection uses a "latest" cursor model: the client stores the timestamp of the last event it received. On reconnect, it sends the cursor to the gateway, which queries MySQL for events newer than the cursor and replays them before switching to live streaming.
Slack's engineering blog details a key operational insight: connection storms during deploys were their biggest reliability challenge. They solved it with graduated connection draining (30-second drain window), client-side jittered backoff (exponential with random 0-5s jitter), and gateway-level admission control (max 500 new connections/second per gateway). Together, these made rolling deploys invisible to users.
Takeaway
Slack proves the edge-tier + pub/sub architecture works at 6M concurrent connections. Their biggest operational lesson: connection lifecycle management (draining, backoff, admission) matters more than raw throughput.
Gateway servers with binary encoding and guild-based pub/sub
Discord's real-time infrastructure handles over 10 million concurrent WebSocket connections. Their architecture extends the standard edge-tier pattern with two innovations: binary encoding and guild-scoped pub/sub.
Gateway servers are the edge tier. Each gateway handles approximately 1 million connections (much higher than Slack's per-gateway count, achieved through aggressive memory optimisation in Elixir/Erlang's BEAM VM). Clients connect via WebSocket and receive events in ETF (Erlang Term Format) — a binary encoding that is 30-40% smaller than JSON for typical Discord payloads. Large guilds (servers with >10K members) use zlib-compressed streams, reducing bandwidth by another 50-70%.
The pub/sub layer is scoped by guild (Discord's term for a server/community). Each guild is a pub/sub channel. When a user sends a message in a guild's text channel, the message service publishes to the guild's pub/sub channel. Every gateway that has at least one connected member of that guild receives the event and fans it out locally.
Presence is handled via a heartbeat_interval negotiated during the WebSocket handshake (typically 41.25 seconds). The client sends a heartbeat; the server responds with a heartbeat ACK. If the client misses two consecutive ACKs, it considers the connection dead and reconnects. The server-side presence TTL is 2× the heartbeat interval. This is more sophisticated than a fixed 60-second TTL — it adapts to the negotiated interval.
Discord's gateway resume protocol is notable: on reconnect, the client sends a RESUME opcode with a session_id and a sequence number (seq). The gateway replays all events since that sequence number from an in-memory buffer. If the buffer has been evicted (disconnection was too long), the gateway sends an INVALID_SESSION and the client must re-identify and re-subscribe — losing presence state but not messages (which are durable in the database).
Takeaway
Discord shows how binary encoding (ETF) and per-guild pub/sub reduce bandwidth and fan-out scope at massive scale. Their resume protocol with sequence numbers is a production-grade reconnect implementation.
WebSocket for CRDT-based collaborative editing
Figma's multiplayer editing system uses WebSocket connections to synchronise design state across all users viewing the same file. The real-time requirements are uniquely demanding: cursor positions update at 60fps, selection state changes instantly, and design operations (move, resize, recolor) must converge across all clients without conflicts.
The transport layer is WebSocket between the browser and Figma's multiplayer servers. Unlike chat (where messages are discrete events), collaborative editing transmits operations — CRDT (Conflict-free Replicated Data Type) operations that describe mutations to the shared document state. These operations are small (tens of bytes) but frequent (every mouse movement generates an operation for cursor position).
Figma's architecture separates operational data from presence data. Design operations go through a durable path: WebSocket → multiplayer server → operation log (persistent) → broadcast to other clients. Cursor and selection state go through an ephemeral path: WebSocket → multiplayer server → broadcast (no persistence). If a client disconnects and reconnects, they replay design operations from the log but not cursor positions — other users' cursors simply appear at their current positions.
The bandwidth challenge is unique: a design file with 20 simultaneous editors generates thousands of operations per second. Figma uses operational batching (aggregate operations over a 50ms window), delta encoding (send only the changed fields of a design node), and selective fan-out (cursor events go only to users viewing the same page of the design).
Presence in Figma shows not just "who's online" but "who's viewing which page" and "where their cursor is." This requires a finer-grained presence model than chat: page-level subscription (not file-level) to avoid broadcasting cursor positions to users viewing a different page.
Takeaway
Figma demonstrates that real-time delivery extends beyond chat to collaborative editing. CRDT operations over WebSocket, separated into durable (design ops) and ephemeral (cursors) paths, with page-level presence scoping.
Decision levers
Transport selection
The transport is the highest-impact decision and the one interviewers test first. Default heuristic: start with SSE (simplest persistent transport), upgrade to WebSocket only if the client needs to send at high frequency. Use short-poll as a fallback or for CDN-cacheable data. Never reach for WebSocket without articulating why SSE doesn't suffice.
Connection budget and edge tier sizing
Size the edge tier from peak concurrent connections, not DAU. Formula: peak_concurrent = DAU × peak_online_fraction. Edge_servers = ceil(peak_concurrent / connections_per_edge). A conservative connections_per_edge is 50K for WebSocket (memory-bound). If the number exceeds your budget, consider SSE (which shares TCP connections via HTTP/2) or push the low-frequency users to short-poll.
Pub/sub topology
Redis pub/sub is the default for single-region deployments — simple, fast, ephemeral. NATS is better for multi-region (JetStream provides durable pub/sub with at-least-once delivery). Kafka is appropriate when you also need event sourcing or replay from the pub/sub layer itself (not just the database). Do not over-engineer: Redis pub/sub handles hundreds of thousands of messages per second on a single node.
Reconnect strategy
Every persistent transport needs a reconnect protocol. SSE has it built in (Last-Event-ID). WebSocket requires a custom protocol: client sends last_seen_id on reconnect, server replays from a recent-message cache (Redis ZSET), falling back to the database for long gaps. The cache should hold the last ~15 minutes of messages per room — enough for 99%+ of reconnections.
Presence granularity
Presence can be binary (online/offline), status-based (online/away/busy/offline), or location-aware (viewing page X, editing component Y). Each level adds fan-out cost. Binary presence with a 30s heartbeat and 60s TTL is the baseline. Finer-grained presence (Figma-style page-level) should be scoped to the relevant context to limit broadcast fan-out.
Failure modes
A rolling deploy kills an edge server. 50K clients reconnect simultaneously, overwhelming the remaining edges with TLS handshakes and gap-fill queries. Fix: connection draining (GOAWAY + 30s drain), client jittered backoff (random 0-5s delay), load balancer admission control (max 500 new connections/second per edge).
Mobile carrier NAT drops the connection after 60s of inactivity. Neither client nor server detects it — the TCP connection appears alive. Messages are sent into the void. Fix: WebSocket ping 25s keeps the NAT mapping alive. If pong is not received within 10s, close and reconnect.
Client disconnects and reconnects, but the gap between DB replay and pub/sub subscription causes message loss. Fix: subscribe to pub/sub first, buffer incoming messages, then replay from DB. After replay, drain buffer with deduplication by message id.
Redis pub/sub goes down — all edge servers lose their event feed. No messages are delivered until recovery. Fix: Redis Sentinel or Cluster for HA. Fallback: edges fall back to polling the database every 1s if pub/sub is unavailable. Circuit breaker on the pub/sub path.
After a deploy, 50K users go offline→online within seconds. Each status change broadcasts to their contacts — potentially millions of presence events. Fix: batch presence updates over a 2-5 second window. Debounce rapid online→offline→online transitions.
An edge server accepts connections beyond its memory capacity. Kernel OOM killer terminates the process — all connections drop at once. Fix: per-edge connection limit enforced at the load balancer. Edge reports current connection count; LB stops routing new connections when the limit is reached.
A popular room (10K members online, all on different edges) receives a burst of messages. Each message fans out to all edges, then each edge fans out to its local subscribers. At 100 messages/second, this is 100 × (number of edges) pub/sub deliveries per second. Fix: message batching at the pub/sub layer (aggregate 50ms windows), client-side frame coalescing.
Decision table
Real-time transport comparison
| Transport | Direction | Latency | Connection cost | Best for |
|---|---|---|---|---|
| Short-poll | Client → Server → Client | = poll interval (seconds) | None — stateless HTTP | Dashboards, leaderboards, >5s tolerance |
| Long-poll | Client → Server (held) → Client | Near-instant (first event) | 1 held connection per client | Moderate traffic, proxy-constrained envs |
| SSE | Server → Client (one-way) | Near-instant (streaming) | 1 persistent HTTP stream per client | Notifications, feeds, live logs |
| WebSocket | Bidirectional (full-duplex) | Frame-level (~ms) | 1 persistent connection + state tracking | Chat, collaboration, gaming, trading |
| Hybrid fallback | Depends on negotiated transport | Varies by transport | Varies — WS preferred, SSE/poll fallback | Production systems with diverse clients |
- SSE over HTTP/2 multiplexes streams on one TCP connection — no 6-connection limit
- WebSocket connection budget: 50K idle connections per server is a practical ceiling
- Long-poll is a valid middle ground when SSE is blocked and WebSocket is overkill
Worked example
Worked example: design real-time delivery for a chat app with 10M DAU
Step 1: Transport choice
The app requires bidirectional communication — clients send messages and receive them. Latency budget is sub-second. WebSocket is the correct transport. SSE could handle the downstream path, but a single WebSocket is simpler than SSE + HTTP POST for upstream.
Step 2: Connection budget
10M DAU. At peak, 30% are online simultaneously = 3M concurrent connections. A single edge server handles ~50K idle WebSocket connections (memory-bound: ~10KB per connection × 50K = 500MB). Edge servers needed: 3M / 50K = 60 edge servers. Provision for 80 to handle spikes and rolling deploys (during a deploy, some edges are draining).
Step 3: Edge tier + pub/sub
Deploy 60 stateless edge servers behind an L4 load balancer with sticky routing (by connection, not by request). Each edge subscribes to Redis pub/sub channels for the conversations of its connected clients.
When client A sends a message in conversation 123:
- 1Client A sends a WebSocket frame to its edge server.
- 2The edge forwards the message to the Chat API.
- 3The Chat API writes to Postgres (INSERT INTO messages ...).
- 4The Chat API publishes to Redis: PUBLISH conv:123 {message payload}.
- 5Redis routes the event to every edge subscribed to conv:123.
- 6Each edge forwards the message to its locally connected members of conv:123.
Clients hold persistent connections to edge servers; a pub/sub bus routes events to the correct edge based on subscription.
Step 4: Durability and reconnect
Pub/sub is ephemeral — if a client is disconnected when a message fires, it misses it. The durable path is Postgres. On reconnect, the client sends last_seen_id. The edge queries a recent-message cache (Redis ZSET per conversation, last 1,000 messages) for id > last_seen_id. If the gap exceeds the cache, fallback to Postgres. After replay, switch to live streaming from the pub/sub.
To prevent the gap between replay and subscription, the edge subscribes to the pub/sub channel before replaying from the cache. Incoming messages are buffered during replay and deduplicated by message id after replay completes.
Step 5: Presence
Each client sends a heartbeat every 30 seconds over the WebSocket. The edge forwards it to the Presence service, which sets a Redis key presence:{user_id} with a 60-second TTL. If the key expires (no heartbeat for 60s), the user is offline. On status change, the Presence service publishes a status_changed event to the user's contacts via pub/sub.
Typing indicators: ephemeral events. Client sends "typing" over WebSocket → edge publishes to conv:123 → other members see "X is typing..." for 3 seconds. No persistence, no gap replay.
Step 6: Mobile
Ping frame every 25 seconds to survive mobile NAT timeout (typically 60s). If the edge detects no heartbeat for 60s, the user is marked "push-eligible." New messages are sent via APNs/FCM. On app foreground, the client reconnects, replays the gap, and resumes WebSocket.
Step 7: Thundering herd mitigation
Rolling deploy kills one edge at a time. Before shutdown: send GOAWAY (WebSocket close code 1001), wait 30s for drain. Clients reconnect with jittered exponential backoff: delay = min(100ms × 2^attempt + random(0, 1000ms), 30s). Load balancer enforces admission control: max 500 new WebSocket connections per second per edge.
Summary numbers
- Transport: WebSocket (bidirectional)
- Edge tier: 80 servers × 50K connections = 4M capacity (3M peak)
- Pub/sub: Redis pub/sub, keyed by conversation_id
- Durability: Postgres (write-ahead) + Redis ZSET cache (recent messages)
- Presence: heartbeat 30s, TTL 60s, Redis keys
- Mobile: ping 25s, push notification fallback
- Deploy: GOAWAY + 30s drain + jittered backoff + admission control
Interview playbook
When it comes up
- Prompt mentions "real-time", "live updates", "push notifications to clients"
- The system has server-initiated events (chat, scores, tickers, presence)
- Freshness requirement is under 5 seconds
- The prompt explicitly names WebSocket or streaming
Order of reveal
- 1Name the transport and why. Start by comparing the four transports (poll, long-poll, SSE, WebSocket) and pick one based on direction, latency, and connection budget. Most candidates skip this — do not.
- 2Connection budget math. Calculate peak concurrent connections from DAU and online fraction. Size the edge tier. This is the number interviewers want to hear.
- 3Edge tier + pub/sub. Describe the edge-tier architecture: stateless edge servers subscribing to a pub/sub bus. Explain why sticky sessions alone do not scale.
- 4Dual-write for durability. Write to DB first (durability), then publish to pub/sub (speed). Pub/sub is ephemeral — if the bus loses a message, the data is in the DB.
- 5Reconnect with gap replay. Client sends last_seen_id, server replays from cache/DB, then switches to live. Subscribe before replay, buffer and dedup to avoid gaps.
- 6Presence and typing. Heartbeat-based presence with TTL. Typing indicators are ephemeral — no persistence, fire-and-forget on the bus.
- 7Mobile and deploy safety. Ping frames for NAT keepalive. Push notification fallback when backgrounded. Jittered backoff + connection draining on deploy to prevent thundering herds.
Signature phrases
- “Connection budget, not DAU” — Shows you know the edge tier is sized by concurrent connections, not total users.
- “Dual-write: DB first, pub/sub second” — Demonstrates understanding that pub/sub is ephemeral and durability requires a separate write path.
- “Subscribe before replay, dedup after” — Solves the subtle gap between DB replay and live streaming that most candidates miss.
- “Jittered backoff is mandatory, not optional” — Names the thundering herd problem and the standard mitigation.
- “Typing indicators are ephemeral — never persist them” — Shows you can distinguish between durable events and fire-and-forget events.
- “Ping every 25s — mobile NAT dies at 60s” — Demonstrates mobile-specific knowledge that interviewers test for.
Likely follow-ups
?“Why not just use WebSocket for everything?”Reveal
WebSocket has the highest operational cost: connection tracking, pub/sub, edge tier management, reconnect logic. SSE is simpler for one-way push (notifications, feeds) and works through proxies that block WebSocket upgrades. Short-poll is correct when latency tolerance exceeds 5 seconds and eliminates all persistent-connection infrastructure. The right answer is the simplest transport that meets the requirements.
?“How do you handle 1 million concurrent connections?”Reveal
Edge tier with ~20 servers at 50K connections each. Each edge subscribes to a Redis pub/sub bus for the channels of its connected clients. Adding capacity = adding edge servers. Connection count is tracked per-edge; the load balancer stops routing new connections when an edge hits its limit. Memory is the bottleneck (50K × 10KB = 500MB per edge), not CPU.
?“What happens during a deploy?”Reveal
A deploy kills an edge server, dropping 50K connections. Without mitigation, all 50K clients reconnect simultaneously (thundering herd). Fix: (1) connection draining — send GOAWAY, wait 30s; (2) client jittered backoff — random 0-5s delay; (3) admission control — load balancer caps new connections per second per edge. Together, reconnections spread over 5-10 seconds instead of 100ms.
?“How does presence work at scale?”Reveal
Heartbeat every 30s, Redis key with 60s TTL. Expiry = offline. Status changes broadcast to contacts via pub/sub. The scaling challenge is fan-out of status changes — a user with 500 contacts triggers 500 notifications. Batch status updates over 2-5 second windows to reduce event volume. Debounce rapid online→offline→online transitions.
?“How do you handle missed messages on reconnect?”Reveal
Client tracks last_seen_id. On reconnect: (1) edge subscribes to pub/sub channel first, (2) buffers incoming messages, (3) replays from cache/DB where id > last_seen_id, (4) after replay, drains buffer with deduplication by message id. This ensures exactly-once delivery from the client perspective. The cache holds last 15 minutes per room — covers 99%+ of reconnections without hitting the database.
?“What about mobile battery and background apps?”Reveal
Ping frames every 25s keep mobile NAT alive but wake the radio 2,400 times/day — acceptable for chat, expensive for low-priority feeds. When the OS backgrounds the app, the WebSocket dies. The server marks the user "push-eligible" after 60s of no heartbeat. New messages go via APNs/FCM. On foreground, the app reconnects, replays the gap, and resumes WebSocket. Delivery state tracking prevents duplicate notifications.
Code snippets
import asyncio
import json
import aioredis
from websockets import serve, WebSocketServerProtocol
# Room subscriptions: room_id -> set of websocket connections
rooms: dict[str, set[WebSocketServerProtocol]] = {}
redis: aioredis.Redis | None = None
async def subscribe_to_redis():
"""Listen to Redis pub/sub and fan out to local WebSocket clients."""
global redis
redis = await aioredis.from_url("redis://localhost:6379")
pubsub = redis.pubsub()
while True:
# Dynamically subscribe to channels for active rooms
active_channels = [f"room:{rid}" for rid in rooms if rooms[rid]]
if active_channels:
await pubsub.subscribe(*active_channels)
async for msg in pubsub.listen():
if msg["type"] != "message":
continue
channel = msg["channel"].decode()
data = msg["data"].decode()
room_id = channel.split(":", 1)[1]
# Fan out to local WebSocket connections in this room
if room_id in rooms:
dead = set()
for ws in rooms[room_id]:
try:
await ws.send(data)
except Exception:
dead.add(ws)
rooms[room_id] -= dead
async def handle_client(ws: WebSocketServerProtocol):
"""Handle a single WebSocket client connection."""
joined_rooms: set[str] = set()
try:
async for raw in ws:
msg = json.loads(raw)
if msg["type"] == "join":
room_id = msg["room_id"]
rooms.setdefault(room_id, set()).add(ws)
joined_rooms.add(room_id)
elif msg["type"] == "message":
room_id = msg["room_id"]
payload = json.dumps({
"type": "message",
"room_id": room_id,
"text": msg["text"],
})
# Publish to Redis — all edge servers receive it
if redis:
await redis.publish(f"room:{room_id}", payload)
finally:
for rid in joined_rooms:
rooms.get(rid, set()).discard(ws)
async def main():
asyncio.create_task(subscribe_to_redis())
async with serve(handle_client, "0.0.0.0", 8765):
await asyncio.Future() # run forever
if __name__ == "__main__":
asyncio.run(main())from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
import asyncio
import json
app = FastAPI()
# In-memory event buffer (production: Redis ZSET per channel)
event_buffer: list[dict] = []
event_id_counter = 0
async def event_generator(request: Request, last_event_id: int | None):
"""Generate SSE events, replaying missed events on reconnect."""
global event_id_counter
# Replay missed events if reconnecting
if last_event_id is not None:
for event in event_buffer:
if event["id"] > last_event_id:
yield format_sse(event)
# Stream live events
while True:
if await request.is_disconnected():
break
# In production, await a pub/sub message instead of polling
await asyncio.sleep(0.1)
# Check for new events in buffer
if event_buffer and event_buffer[-1]["id"] > (last_event_id or 0):
for event in event_buffer:
if event["id"] > (last_event_id or 0):
yield format_sse(event)
last_event_id = event["id"]
def format_sse(event: dict) -> str:
"""Format a dict as an SSE event string."""
lines = [f"id: {event['id']}"]
if "event" in event:
lines.append(f"event: {event['event']}")
lines.append(f"data: {json.dumps(event['data'])}")
return "
".join(lines) + "
"
@app.get("/stream")
async def stream(request: Request):
last_event_id = request.headers.get("Last-Event-ID")
last_id = int(last_event_id) if last_event_id else None
return StreamingResponse(
event_generator(request, last_id),
media_type="text/event-stream",
headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"},
)import redis
import json
import time
r = redis.Redis(host="localhost", port=6379, decode_responses=True)
HEARTBEAT_TTL = 60 # seconds — 2x the client heartbeat interval (30s)
PRESENCE_CHANNEL = "presence:changes"
def handle_heartbeat(user_id: str) -> None:
"""Process a heartbeat from a connected client."""
key = f"presence:{user_id}"
was_offline = not r.exists(key)
# Set/refresh the presence key with TTL
r.setex(key, HEARTBEAT_TTL, "online")
if was_offline:
# User just came online — broadcast to contacts
broadcast_status_change(user_id, "online")
def check_and_broadcast_offline(user_id: str) -> None:
"""Called by a TTL-expiry listener or periodic sweep."""
key = f"presence:{user_id}"
if not r.exists(key):
broadcast_status_change(user_id, "offline")
def broadcast_status_change(user_id: str, status: str) -> None:
"""Publish presence change to the pub/sub bus."""
event = json.dumps({
"type": "presence",
"user_id": user_id,
"status": status,
"timestamp": time.time(),
})
# In production, publish to the user's contacts' channels
r.publish(PRESENCE_CHANNEL, event)
def get_online_contacts(user_id: str, contact_ids: list[str]) -> list[str]:
"""Check which contacts are currently online."""
pipe = r.pipeline(transaction=False)
for cid in contact_ids:
pipe.exists(f"presence:{cid}")
results = pipe.execute()
return [cid for cid, exists in zip(contact_ids, results) if exists]import random
import asyncio
import logging
logger = logging.getLogger(__name__)
BASE_DELAY_MS = 100
MAX_DELAY_MS = 30_000
JITTER_MAX_MS = 1_000
def calculate_backoff(attempt: int) -> float:
"""Calculate delay in seconds with exponential backoff + jitter.
Formula: min(base * 2^attempt + random_jitter, max_delay)
This prevents thundering herd on mass reconnect (e.g., after deploy).
"""
exponential = BASE_DELAY_MS * (2 ** attempt)
jitter = random.uniform(0, JITTER_MAX_MS)
delay_ms = min(exponential + jitter, MAX_DELAY_MS)
return delay_ms / 1000.0
async def reconnect_with_backoff(
connect_fn,
max_attempts: int = 10,
) -> object | None:
"""Attempt to reconnect with jittered exponential backoff.
Args:
connect_fn: async callable that attempts a WebSocket connection.
max_attempts: give up after this many failures.
Returns:
The connection object on success, None on exhaustion.
"""
for attempt in range(max_attempts):
try:
conn = await connect_fn()
if attempt > 0:
logger.info(f"Reconnected after {attempt} retries")
return conn
except Exception as exc:
delay = calculate_backoff(attempt)
logger.warning(
f"Connect attempt {attempt + 1} failed: {exc}. "
f"Retrying in {delay:.1f}s"
)
await asyncio.sleep(delay)
logger.error(f"Failed to reconnect after {max_attempts} attempts")
return Noneimport asyncio
import signal
import logging
from websockets import WebSocketServerProtocol
logger = logging.getLogger(__name__)
DRAIN_TIMEOUT_S = 30
connections: set[WebSocketServerProtocol] = set()
draining = False
async def graceful_shutdown():
"""Drain all WebSocket connections before shutting down.
Steps:
1. Stop accepting new connections (health check returns 503).
2. Send GOAWAY (close code 1001) to all connected clients.
3. Wait up to DRAIN_TIMEOUT_S for clients to disconnect.
4. Force-close any remaining connections.
"""
global draining
draining = True
logger.info(
f"Draining {len(connections)} connections "
f"(timeout: {DRAIN_TIMEOUT_S}s)"
)
# Send close frame to all clients — code 1001 = "Going Away"
close_tasks = []
for ws in list(connections):
close_tasks.append(
asyncio.create_task(
ws.close(1001, "Server shutting down — please reconnect")
)
)
# Wait for clients to disconnect gracefully
if close_tasks:
done, pending = await asyncio.wait(
close_tasks, timeout=DRAIN_TIMEOUT_S
)
if pending:
logger.warning(
f"Force-closing {len(pending)} connections after drain timeout"
)
for task in pending:
task.cancel()
# Force-close any stragglers
for ws in list(connections):
try:
await ws.close(1001, "Drain timeout exceeded")
except Exception:
pass
connections.clear()
logger.info("All connections drained — shutting down")
def setup_signal_handlers(loop: asyncio.AbstractEventLoop):
"""Register SIGTERM/SIGINT handlers for graceful shutdown."""
for sig in (signal.SIGTERM, signal.SIGINT):
loop.add_signal_handler(
sig,
lambda: asyncio.create_task(graceful_shutdown()),
)Drills
When should you choose SSE over WebSocket?Reveal
When the communication is one-directional (server → client only): notifications, live feeds, stock tickers, CI logs. SSE has built-in reconnection with Last-Event-ID, works through corporate proxies that block WebSocket upgrades, and multiplexes over HTTP/2. If the client also needs to send data, regular HTTP POST alongside the SSE stream works — but if the client sends at high frequency, WebSocket is simpler.
How do you size the edge tier for 10M DAU?Reveal
Calculate peak concurrent connections: 10M × 30% peak online = 3M concurrent. Each edge handles ~50K idle WebSocket connections (memory-bound at ~10KB per connection = 500MB). Edge servers needed: 3M / 50K = 60. Provision for 80 to handle spikes and rolling deploys. This is the math interviewers want — not "we add servers as needed."
Why is pub/sub not sufficient for message durability?Reveal
Redis pub/sub is fire-and-forget: if no subscriber is listening when a message is published, it is lost. A disconnected client misses every message published during the disconnection. The fix is dual-write: persist to the database first (guarantees durability), then publish to pub/sub (provides speed to connected clients). On reconnect, the client replays missed messages from the database, not from pub/sub.
Describe the reconnect catch-up protocol for WebSocket.Reveal
Client tracks last_seen_id (monotonically increasing message id). On reconnect: (1) edge subscribes to the pub/sub channel for the room, (2) buffers incoming pub/sub messages, (3) queries cache/DB for messages where id > last_seen_id and sends them to the client, (4) after replay, drains the buffer, skipping any message already sent during replay (dedup by id). This ensures exactly-once delivery with no gap between replay and live streaming.
How do you prevent a thundering herd when deploying new edge servers?Reveal
Three layers: (1) Server-side: send GOAWAY (close code 1001) and drain connections over 30 seconds before killing the process. (2) Client-side: jittered exponential backoff on reconnect — delay = min(100ms × 2^attempt + random(0, 1000ms), 30s). (3) Load balancer: admission control — cap new WebSocket connections at 500/second per edge. Together, reconnections spread over 5-10 seconds instead of 100ms.
How does presence work without polling the database?Reveal
Heartbeat-based with TTL: client sends a heartbeat 30s over WebSocket. The presence service sets a Redis key presence:{user_id} with a 60-second TTL. If the key expires without refresh, the user is offline. On status change, the presence service publishes a status_changed event to the user's contacts via pub/sub. No database polling — Redis key expiry is the detection mechanism.