Networking fundamentals
DNS, TCP/TLS handshakes, HTTP lifecycle, protocol choice, latency by hop.
You can't design distributed systems without understanding how bytes travel from client to server and back. DNS, TCP, TLS, HTTP — these aren't trivia. They're the latency budget, the failure modes, and the protocol choices that underpin every design decision you make.
Read this if your last attempt…
- You can't trace what happens between a client clicking Submit and seeing a response
- You don't know the latency cost of a TCP + TLS handshake
- You aren't sure when to reach for gRPC vs REST vs WebSocket vs SSE
- Your design includes 5 serial cross-region calls without budgeting the latency
The concept
When a user hits your API for the first time, the request travels through a precise sequence. DNS resolution to find the IP (50-200 ms cold). A TCP three-way handshake to establish a reliable connection (1 round-trip). A TLS handshake to encrypt the channel (1 more round-trip with TLS 1.3). Only then does the HTTP request flow. That cold start costs 150 ms or more before your server processes a single byte. Subsequent requests on the same connection skip all of that. Just HTTP streams.
This matters because every design choice has networking consequences. Put a service in another region? Add 60-120 ms per call. Add a serial dependency? Multiply latency. Choose WebSocket over SSE? You gain bidirectional communication but add stateful connection management that complicates your load balancer, failover, and scaling story.
Every cold request pays these costs in sequence. Subsequent requests on the same connection skip DNS + TCP + TLS. Understanding this pipeline is the foundation of every latency budget.
Protocol choice — default to REST unless you have a specific reason not to.
| Protocol | Transport | Direction | Best for | Avoid when |
|---|---|---|---|---|
| REST / HTTP | TCP | Request-response | Public APIs, browser clients, simplicity | High-throughput internal calls where JSON overhead matters |
| gRPC | HTTP/2 | Bi-directional streaming | Service-to-service, binary data, streaming | Browser-facing APIs (no native browser support) |
| WebSocket | TCP (upgraded) | Bidirectional frames | Chat, live collaboration, gaming | Unidirectional updates (SSE is simpler) |
| SSE | HTTP (long-lived) | Server → client push | Notifications, live feeds, dashboards | Client needs to push data frequently |
| GraphQL | HTTP | Request-response | Diverse clients needing different data shapes | Simple CRUD, performance-critical paths |
How interviewers grade this
- You can trace the full request lifecycle: DNS → TCP → TLS → HTTP → response.
- You know the latency cost of each hop and factor it into your design.
- You choose the right protocol (REST vs gRPC vs WebSocket vs SSE) with specific justification.
- You use DNS features (GeoDNS, failover, weighted routing) in your global architecture.
- You understand connection pooling and why opening a new connection per request is wasteful.
Variants
TCP vs UDP
TCP guarantees delivery and ordering. UDP is fast but fire-and-forget.
TCP is the workhorse: it establishes a connection via a three-way handshake (SYN, SYN-ACK, ACK), guarantees that data arrives in order, retransmits lost packets, and provides flow and congestion control. The cost is overhead — each connection has setup latency and per-packet acknowledgements.
UDP is connectionless: you send a datagram and hope it arrives. No setup, no acknowledgement, no ordering guarantee. It's faster because there's no overhead, but the application must handle loss and reordering.
When to use UDP in a system design interview: real-time media (VoIP, video streaming, gaming) where a dropped frame is better than a delayed one, high-volume telemetry where occasional loss is acceptable, and DNS lookups (simple query-response, retried at the application level).
For everything else — web APIs, databases, file transfer, messaging — TCP is the default. Most interviewers will assume TCP unless you explicitly propose UDP and justify it.
Modern compromise: QUIC (the protocol under HTTP/3) runs on UDP but adds reliability, encryption, and multiplexing. Each stream has independent loss recovery, so a dropped packet in one stream doesn't block others — solving TCP's head-of-line blocking problem. QUIC is particularly good on mobile (survives network switches without reconnection).
Pros
- +TCP: reliable, ordered, well-understood — the default
- +UDP: lower latency, no connection overhead
- +QUIC: best of both — reliable + no HoL blocking
Cons
- −TCP: head-of-line blocking (one lost packet stalls all streams)
- −UDP: application must handle loss and ordering
- −QUIC: newer, some firewalls block it, less tooling
Choose this variant when
- TCP: everything except real-time media and telemetry
- UDP: VoIP, video streaming, gaming, DNS
- QUIC/HTTP/3: mobile clients, high-latency networks
REST vs gRPC
REST for public APIs and simplicity. gRPC for internal service-to-service where performance matters.
REST uses HTTP + JSON. It's well-understood, works everywhere (including browsers), and interviewers expect it as the default. Resource-oriented (GET /users/123, POST /orders), uses standard HTTP verbs and status codes. Easy to debug with curl.
gRPC uses HTTP/2 + Protocol Buffers (binary serialization). It's significantly faster than REST — some benchmarks show 10x throughput improvement. Strong typing via .proto definitions means errors are caught at compile time rather than runtime. gRPC natively supports streaming (client, server, and bidirectional), deadlines for timeout propagation, and client-side load balancing.
The common pattern in production: REST for external/public APIs (browser-compatible, easy to integrate), gRPC for internal service-to-service calls (fast, typed, streaming). Many companies run this dual-protocol architecture.
In interviews, default to REST. Propose gRPC when the interviewer probes on internal service performance, when you have a microservices architecture with high call volumes between services, or when you need streaming. Don't propose gRPC just to sound sophisticated — justify it with a specific performance or streaming need.
Choose this variant when
- REST: public APIs, browser clients, simple CRUD, most interview scenarios
- gRPC: internal service-to-service, high throughput, streaming, binary data
WebSocket vs SSE
WebSocket for bidirectional real-time. SSE for server-push-only updates.
SSE (Server-Sent Events) is a one-way channel: the client opens an HTTP connection, and the server pushes messages down it as they happen. It's built on standard HTTP, works with existing load balancers and proxies, and auto-reconnects on disconnection. Think of it as a long-lived HTTP response that sends multiple messages over time.
WebSocket upgrades an HTTP connection to a persistent, bidirectional TCP channel. Both client and server can send messages at any time. It's necessary when the client needs to push data frequently (chat, collaborative editing, gaming).
The key decision: if you only need server-to-client push (notifications, live feeds, dashboards), SSE is simpler and should be your default. WebSockets add significant infrastructure complexity — you need stateful connection management, special load balancer handling, reconnection logic, and your horizontal scaling story gets harder because connections are pinned to specific servers.
In interviews, proposing WebSocket without justifying bidirectional need is a red flag. Many candidates default to WebSocket for anything "real-time" when SSE would be sufficient and simpler. Name the direction of data flow, then pick the simpler option.
Pros
- +SSE: simple, works with existing HTTP infrastructure, auto-reconnect
- +WebSocket: true bidirectional, low-latency framing, good for chat and games
Cons
- −SSE: server-to-client only, some misbehaving proxies buffer responses
- −WebSocket: stateful connections complicate LB and failover, more infrastructure work
Choose this variant when
- SSE: notifications, live feeds, auction updates, score boards
- WebSocket: chat, collaborative editing, multiplayer gaming, anything needing client push
DNS as infrastructure
DNS is not just name resolution — it's load balancing, failover, and traffic shaping.
GeoDNS (latency-based routing) resolves the same domain to different IPs based on the client's location. Users in Europe get the EU region IP, users in Asia get APAC. This is the first layer of global load balancing — before your request even reaches a server.
DNS failover: health-check endpoints probe your regions. When a region is unhealthy, DNS stops returning its IP. Combined with short TTLs (60s), failover happens within a minute.
Weighted routing: 95% of DNS responses point to the stable deployment, 5% to canary. Cheap traffic splitting for canary deploys without touching load balancers.
TTL tradeoffs: short TTL (60s) = fast failover but more DNS queries and more load on resolvers. Long TTL (3600s) = less DNS traffic but slow failover. Some ISP resolvers ignore TTLs entirely. Don't rely on DNS alone for fast failover — layer it with health checks at the load balancer.
Connection pooling: every new TCP connection costs a 3-way handshake plus TLS setup — 2-4 round-trips of pure latency. At 1000 QPS, that's 1000 handshakes/sec wasted. Connection pools pre-establish and reuse connections. HTTP/2 multiplexes many requests over one connection, making pools even more efficient. This is not optional at scale — it's table stakes.
Choose this variant when
- Multi-region deployments need GeoDNS
- Canary deploys need weighted routing
- Any service-to-service communication needs connection pooling
Worked example
Scenario: your interviewer asks "what happens when a user in Tokyo loads your app?"
DNS (50ms): Browser checks local cache → miss → ISP resolver → Route53 latency-based routing returns Tokyo-region ALB IP.
TCP (2ms): Three-way handshake to Tokyo ALB (same region = low latency).
TLS (4ms): TLS 1.3, 1 round-trip. ALPN negotiates HTTP/2. Encrypted channel ready.
HTTP/2 request: GET /api/feed → ALB routes to app server in same AZ.
App processing (15ms): JWT validation from cached public key (1ms) → Redis cache hit for feed (1ms) → partial cache miss → DB query (8ms) → response assembly (5ms).
Response (5ms): 10KB JSON, compressed with Brotli. Streamed back over the existing HTTP/2 connection.
Total cold start: ~76ms. Subsequent requests skip DNS + TCP + TLS — just HTTP/2 streams at ~20ms each.
If Tokyo region goes down: Route53 health check fails → DNS stops returning Tokyo IP within 60s TTL → next DNS lookup returns US-West IP → latency jumps to ~200ms (cross-Pacific) but service continues.
Good vs bad answer
Interviewer probe
“What happens between a client clicking Submit and seeing the response?”
Weak answer
"The request goes to the server, gets processed, and the response comes back."
Strong answer
"First DNS resolution — cached after the first lookup for TTL seconds. Then TCP handshake, 1 round-trip. Then TLS 1.3, 1 more round-trip. Only then does the HTTP request flow — over HTTP/2 so subsequent requests multiplex on the same connection. Server-side: ALB routes to an app instance, JWT validates from a cached key, Redis serves the hot data, DB handles cache misses. Response is Brotli-compressed. Cold start: ~150ms. Warm connection: ~20ms. The biggest lever is keeping connections alive and caching DNS."
Why it wins: Traces the full lifecycle with latency at each stage, identifies the optimization levers, and distinguishes cold vs warm requests.
When it comes up
- When the interviewer asks "walk me through a single request end-to-end"
- When proposing a protocol (REST / gRPC / WebSocket / SSE)
- During latency budget discussions — every hop has a networking cost
- When the design goes multi-region and cross-continent latency becomes the bottleneck
- When designing real-time features (chat, live feeds, notifications)
Order of reveal
- 1Trace the request lifecycle. "DNS → TCP → TLS → HTTP. Cold start is ~150 ms before any app logic runs. Warm connections skip DNS + TCP + TLS."
- 2Pick the protocol with a reason. "REST for public/browser, gRPC for internal service-to-service, WebSocket for bidirectional, SSE for server push. Default REST unless specific need."
- 3Budget latency by region topology. "Same AZ 1 ms, cross-AZ 2 ms, cross-region 60 ms, cross-continent 100 ms. Serial hops multiply — parallelise what you can."
- 4Name connection reuse strategy. "Connection pool + HTTP/2 multiplexing. New TCP+TLS per request costs 60+ ms; unacceptable at scale."
- 5Address DNS as infrastructure. "GeoDNS for global routing, 60-300 s TTL for failover speed. DNS is part of the load balancer, not just name resolution."
- 6Flag cross-region cost before asked. "The cross-Pacific RTT is ~150 ms. No engineering beats physics — need regional deployments with local data."
Signature phrases
- “Cross-region latency is physics, not engineering” — Forces honest conversation about what optimisation can and can't fix.
- “Default REST, justify gRPC” — Prevents the reflexive gRPC-to-sound-senior trap.
- “SSE before WebSocket unless client needs to push” — Sharp heuristic that catches a common over-reach.
- “TLS 1.3 saves one round-trip over 1.2” — Concrete latency detail showing depth.
- “Connection pool everything” — Core scaling discipline stated in three words.
- “QUIC eliminates head-of-line blocking” — Mobile-first latency insight few candidates mention.
Likely follow-ups
?“Why is cross-continent latency so stubbornly high?”Reveal
Physics. The speed of light through optical fiber is roughly 200,000 km/s (2/3 the vacuum speed because of fiber index of refraction). The distance from New York to London is ~5,500 km. Round trip = 11,000 km / 200,000 km/s = 55 ms minimum. Add queuing, routing, and re-buffering through ~20 hops and you get 80–120 ms in practice.
No optimisation beats physics. The only remedies are:
- 1Regional deployments — serve users from a PoP / region near them.
- 2Edge caching (CDN) — move cacheable content as close to users as possible.
- 3Connection reuse + keep-alive — amortise the handshake cost across many requests.
- 4Async / optimistic UI — don't make the user wait for the round-trip if you can avoid it.
For sub-100 ms p95 globally you need regional data, not one region serving the world.
?“Your design opens a new TCP connection for every downstream call. What's wrong?”Reveal
TCP + TLS handshake costs ~30–60 ms of pure latency per new connection. At 1000 QPS of outbound calls, that's 1000 handshakes/s and 1 full second of CPU/network time wasted every second.
The fix: connection pooling.
- Each process maintains a pool of N open connections per downstream
- Requests reuse existing connections from the pool
- HTTP/2 multiplexes many streams over one connection — one connection often suffices
- Idle connections get refreshed with heartbeats
Example: a Node.js http.Agent with keepAlive: true or a Go http.Client with default transport gives you pooling for free. gRPC clients pool natively. SQL drivers (PgBouncer, HikariCP) pool DB connections.
The cardinal sin: fetch() in Lambda without pooling — each cold invocation opens a new TLS connection to the DB. Seen in production.
?“How do you choose between WebSocket and SSE for a real-time feature?”Reveal
Ask: does the client need to push data back frequently?
SSE (default, simpler):
- Server-to-client one-way push over HTTP
- Works with existing HTTP infrastructure (LBs, proxies, CDNs)
- Auto-reconnects on disconnect (browser built-in)
- Stateless at the LB layer (just a long-lived HTTP response)
WebSocket (only when needed):
- Full bidirectional framing over a persistent TCP connection
- Required for chat, collaborative editing, gaming
- Needs special LB handling (sticky sessions, L7 with WS upgrade support)
- Stateful — connection affinity complicates failover and scaling
Rule of thumb: if the client sends an occasional message (sending a chat message at human typing speed), SSE + a regular POST is fine. If the client sends rapid or continuous frames (collaborative editing keystrokes, game inputs), WebSocket is worth the complexity.
Defaulting to WebSocket because it sounds "real-time" is the common mistake. SSE serves live feeds, notifications, score boards, and trading tickers beautifully.
Common mistakes
Cold DNS is 50-200ms. For mobile clients switching networks, DNS is re-resolved. Keep TTLs reasonable (60-300s) and factor this into your cold-start budget.
TCP + TLS costs 2-4 round-trips of pure latency. Connection pools and HTTP/2 multiplexing eliminate this. Opening a new connection per request is a performance anti-pattern.
WebSocket adds significant complexity (stateful connections, special LB handling, reconnection). If you only need server-to-client push, SSE is simpler and works with existing infrastructure.
Speed of light through fiber: US-East to US-West = ~60ms RTT. No optimization beats physics. If you need <50ms globally, you need regional deployments with local data.
Practice drills
Your app has 200ms p99 latency in US-East. A user in Singapore reports 800ms. Why?Reveal
Cross-Pacific RTT alone is ~200ms. TCP handshake: 200ms. TLS 1.2 handshake: 200ms more (2 RTTs). App processing: 200ms (includes round-trips to US-East DB). Total: ~800ms. Fix: regional deployment in APAC, cache hot data locally, terminate TLS at the edge, upgrade to TLS 1.3 (saves 1 RTT). Or at minimum: CDN for static assets and connection keepalive.
When would you use gRPC over REST?Reveal
Internal service-to-service calls where: (1) binary serialization matters (protobuf is smaller and faster to parse than JSON), (2) you need streaming (gRPC supports client, server, and bidirectional streaming natively), (3) strong typing and code generation across languages helps your dev velocity. NOT for browser-facing APIs — browsers don't natively support gRPC. The common production pattern: REST at the public edge, gRPC between internal services.
SSE or WebSocket for a live auction price feed?Reveal
SSE. The data flow is unidirectional: server pushes price updates to all connected bidders. Bidders don't push data back on this channel (bids go through the REST API). SSE works with existing HTTP infrastructure, auto-reconnects, and doesn't require special load balancer handling. WebSocket would add complexity for no benefit here. You'd only need WebSocket if bidders were sending rapid messages back — like a chat alongside the auction.
Cheat sheet
- •Cold request: DNS (50ms) + TCP (30ms) + TLS (30ms) + processing = ~150ms+.
- •Warm request (same connection): just processing. Often <20ms.
- •Same AZ: ~0.5ms. Cross-AZ: ~2ms. Cross-region: ~60ms. Cross-continent: ~100ms.
- •Default protocol: REST. Internal high-perf: gRPC. Bidirectional real-time: WebSocket. Push-only: SSE.
- •Connection pool everything. New connections per request is a cardinal sin.
- •GeoDNS for global routing. 60-300s TTL for failover speed.
- •HTTP/2 multiplexes streams — one connection per destination is often enough.
- •QUIC (HTTP/3) eliminates TCP head-of-line blocking. Good for mobile.