corearchitecture

Networking fundamentals

DNS, TCP/TLS handshakes, HTTP lifecycle, protocol choice, latency by hop.

You can't design distributed systems without understanding how bytes travel from client to server and back. DNS, TCP, TLS, HTTP — these aren't trivia. They're the latency budget, the failure modes, and the protocol choices that underpin every design decision you make.

Read this if your last attempt…

You can't trace what happens between a client clicking Submit and seeing a response
You don't know the latency cost of a TCP + TLS handshake
You aren't sure when to reach for gRPC vs REST vs WebSocket vs SSE
Your design includes 5 serial cross-region calls without budgeting the latency

The concept

When a user hits your API for the first time, the request travels through a precise sequence. DNS resolution to find the IP (50-200 ms cold). A TCP three-way handshake to establish a reliable connection (1 round-trip). A TLS handshake to encrypt the channel (1 more round-trip with TLS 1.3). Only then does the HTTP request flow. That cold start costs 150 ms or more before your server processes a single byte. Subsequent requests on the same connection skip all of that. Just HTTP streams.

This matters because every design choice has networking consequences. Put a service in another region? Add 60-120 ms per call. Add a serial dependency? Multiply latency. Choose WebSocket over SSE? You gain bidirectional communication but add stateful connection management that complicates your load balancer, failover, and scaling story.

Architecture diagram· Full request lifecycle: DNS → TCP → TLS → HTTP → response

Every cold request pays these costs in sequence. Subsequent requests on the same connection skip DNS + TCP + TLS. Understanding this pipeline is the foundation of every latency budget.

Protocol choice — default to REST unless you have a specific reason not to.

Protocol	Transport	Direction	Best for	Avoid when
REST / HTTP	TCP	Request-response	Public APIs, browser clients, simplicity	High-throughput internal calls where JSON overhead matters
gRPC	HTTP/2	Bi-directional streaming	Service-to-service, binary data, streaming	Browser-facing APIs (no native browser support)
WebSocket	TCP (upgraded)	Bidirectional frames	Chat, live collaboration, gaming	Unidirectional updates (SSE is simpler)
SSE	HTTP (long-lived)	Server → client push	Notifications, live feeds, dashboards	Client needs to push data frequently
GraphQL	HTTP	Request-response	Diverse clients needing different data shapes	Simple CRUD, performance-critical paths

How interviewers grade this

You can trace the full request lifecycle: DNS → TCP → TLS → HTTP → response.
You know the latency cost of each hop and factor it into your design.
You choose the right protocol (REST vs gRPC vs WebSocket vs SSE) with specific justification.
You use DNS features (GeoDNS, failover, weighted routing) in your global architecture.
You understand connection pooling and why opening a new connection per request is wasteful.

Variants

TCP vs UDP

TCP guarantees delivery and ordering. UDP is fast but fire-and-forget.

TCP is the workhorse: it establishes a connection via a three-way handshake (SYN, SYN-ACK, ACK), guarantees that data arrives in order, retransmits lost packets, and provides flow and congestion control. The cost is overhead — each connection has setup latency and per-packet acknowledgements.

UDP is connectionless: you send a datagram and hope it arrives. No setup, no acknowledgement, no ordering guarantee. It's faster because there's no overhead, but the application must handle loss and reordering.

When to use UDP in a system design interview: real-time media (VoIP, video streaming, gaming) where a dropped frame is better than a delayed one, high-volume telemetry where occasional loss is acceptable, and DNS lookups (simple query-response, retried at the application level).

For everything else — web APIs, databases, file transfer, messaging — TCP is the default. Most interviewers will assume TCP unless you explicitly propose UDP and justify it.

Modern compromise: QUIC (the protocol under HTTP/3) runs on UDP but adds reliability, encryption, and multiplexing. Each stream has independent loss recovery, so a dropped packet in one stream doesn't block others — solving TCP's head-of-line blocking problem. QUIC is particularly good on mobile (survives network switches without reconnection).

Pros

+TCP: reliable, ordered, well-understood — the default
+UDP: lower latency, no connection overhead
+QUIC: best of both — reliable + no HoL blocking

Cons

−TCP: head-of-line blocking (one lost packet stalls all streams)
−UDP: application must handle loss and ordering
−QUIC: newer, some firewalls block it, less tooling

Choose this variant when

TCP: everything except real-time media and telemetry
UDP: VoIP, video streaming, gaming, DNS
QUIC/HTTP/3: mobile clients, high-latency networks

REST vs gRPC

REST for public APIs and simplicity. gRPC for internal service-to-service where performance matters.

REST uses HTTP + JSON. It's well-understood, works everywhere (including browsers), and interviewers expect it as the default. Resource-oriented (GET /users/123, POST /orders), uses standard HTTP verbs and status codes. Easy to debug with curl.

gRPC uses HTTP/2 + Protocol Buffers (binary serialization). It's significantly faster than REST — some benchmarks show 10x throughput improvement. Strong typing via .proto definitions means errors are caught at compile time rather than runtime. gRPC natively supports streaming (client, server, and bidirectional), deadlines for timeout propagation, and client-side load balancing.

The common pattern in production: REST for external/public APIs (browser-compatible, easy to integrate), gRPC for internal service-to-service calls (fast, typed, streaming). Many companies run this dual-protocol architecture.

In interviews, default to REST. Propose gRPC when the interviewer probes on internal service performance, when you have a microservices architecture with high call volumes between services, or when you need streaming. Don't propose gRPC just to sound sophisticated — justify it with a specific performance or streaming need.

Choose this variant when

REST: public APIs, browser clients, simple CRUD, most interview scenarios
gRPC: internal service-to-service, high throughput, streaming, binary data

WebSocket vs SSE

WebSocket for bidirectional real-time. SSE for server-push-only updates.

SSE (Server-Sent Events) is a one-way channel: the client opens an HTTP connection, and the server pushes messages down it as they happen. It's built on standard HTTP, works with existing load balancers and proxies, and auto-reconnects on disconnection. Think of it as a long-lived HTTP response that sends multiple messages over time.

WebSocket upgrades an HTTP connection to a persistent, bidirectional TCP channel. Both client and server can send messages at any time. It's necessary when the client needs to push data frequently (chat, collaborative editing, gaming).

The key decision: if you only need server-to-client push (notifications, live feeds, dashboards), SSE is simpler and should be your default. WebSockets add significant infrastructure complexity — you need stateful connection management, special load balancer handling, reconnection logic, and your horizontal scaling story gets harder because connections are pinned to specific servers.

In interviews, proposing WebSocket without justifying bidirectional need is a red flag. Many candidates default to WebSocket for anything "real-time" when SSE would be sufficient and simpler. Name the direction of data flow, then pick the simpler option.

Pros

+SSE: simple, works with existing HTTP infrastructure, auto-reconnect
+WebSocket: true bidirectional, low-latency framing, good for chat and games

Cons

−SSE: server-to-client only, some misbehaving proxies buffer responses
−WebSocket: stateful connections complicate LB and failover, more infrastructure work

Choose this variant when

SSE: notifications, live feeds, auction updates, score boards
WebSocket: chat, collaborative editing, multiplayer gaming, anything needing client push

DNS as infrastructure

DNS is not just name resolution — it's load balancing, failover, and traffic shaping.

GeoDNS (latency-based routing) resolves the same domain to different IPs based on the client's location. Users in Europe get the EU region IP, users in Asia get APAC. This is the first layer of global load balancing — before your request even reaches a server.

DNS failover: health-check endpoints probe your regions. When a region is unhealthy, DNS stops returning its IP. Combined with short TTLs (60s), failover happens within a minute.

Weighted routing: 95% of DNS responses point to the stable deployment, 5% to canary. Cheap traffic splitting for canary deploys without touching load balancers.

TTL tradeoffs: short TTL (60s) = fast failover but more DNS queries and more load on resolvers. Long TTL (3600s) = less DNS traffic but slow failover. Some ISP resolvers ignore TTLs entirely. Don't rely on DNS alone for fast failover — layer it with health checks at the load balancer.

Connection pooling: every new TCP connection costs a 3-way handshake plus TLS setup — 2-4 round-trips of pure latency. At 1000 QPS, that's 1000 handshakes/sec wasted. Connection pools pre-establish and reuse connections. HTTP/2 multiplexes many requests over one connection, making pools even more efficient. This is not optional at scale — it's table stakes.

Choose this variant when

Multi-region deployments need GeoDNS
Canary deploys need weighted routing
Any service-to-service communication needs connection pooling

Worked example

Scenario: your interviewer asks "what happens when a user in Tokyo loads your app?"

DNS (50ms): Browser checks local cache → miss → ISP resolver → Route53 latency-based routing returns Tokyo-region ALB IP.

TCP (2ms): Three-way handshake to Tokyo ALB (same region = low latency).

TLS (4ms): TLS 1.3, 1 round-trip. ALPN negotiates HTTP/2. Encrypted channel ready.

HTTP/2 request: GET /api/feed → ALB routes to app server in same AZ.

App processing (15ms): JWT validation from cached public key (1ms) → Redis cache hit for feed (1ms) → partial cache miss → DB query (8ms) → response assembly (5ms).

Response (5ms): 10KB JSON, compressed with Brotli. Streamed back over the existing HTTP/2 connection.

Total cold start: ~76ms. Subsequent requests skip DNS + TCP + TLS — just HTTP/2 streams at ~20ms each.

If Tokyo region goes down: Route53 health check fails → DNS stops returning Tokyo IP within 60s TTL → next DNS lookup returns US-West IP → latency jumps to ~200ms (cross-Pacific) but service continues.

Good vs bad answer

Interviewer probe

“What happens between a client clicking Submit and seeing the response?”

Weak answer

"The request goes to the server, gets processed, and the response comes back."

Strong answer

"First DNS resolution — cached after the first lookup for TTL seconds. Then TCP handshake, 1 round-trip. Then TLS 1.3, 1 more round-trip. Only then does the HTTP request flow — over HTTP/2 so subsequent requests multiplex on the same connection. Server-side: ALB routes to an app instance, JWT validates from a cached key, Redis serves the hot data, DB handles cache misses. Response is Brotli-compressed. Cold start: ~150ms. Warm connection: ~20ms. The biggest lever is keeping connections alive and caching DNS."

Why it wins: Traces the full lifecycle with latency at each stage, identifies the optimization levers, and distinguishes cold vs warm requests.

Interview playbook5–10 min, usually inside protocol choice or latency budget discussion

When it comes up

When the interviewer asks "walk me through a single request end-to-end"
When proposing a protocol (REST / gRPC / WebSocket / SSE)
During latency budget discussions — every hop has a networking cost
When the design goes multi-region and cross-continent latency becomes the bottleneck
When designing real-time features (chat, live feeds, notifications)

Order of reveal

1
Trace the request lifecycle. "DNS → TCP → TLS → HTTP. Cold start is ~150 ms before any app logic runs. Warm connections skip DNS + TCP + TLS."
2
Pick the protocol with a reason. "REST for public/browser, gRPC for internal service-to-service, WebSocket for bidirectional, SSE for server push. Default REST unless specific need."
3
Budget latency by region topology. "Same AZ 1 ms, cross-AZ 2 ms, cross-region 60 ms, cross-continent 100 ms. Serial hops multiply — parallelise what you can."
4
Name connection reuse strategy. "Connection pool + HTTP/2 multiplexing. New TCP+TLS per request costs 60+ ms; unacceptable at scale."
5
Address DNS as infrastructure. "GeoDNS for global routing, 60-300 s TTL for failover speed. DNS is part of the load balancer, not just name resolution."
6
Flag cross-region cost before asked. "The cross-Pacific RTT is ~150 ms. No engineering beats physics — need regional deployments with local data."

Signature phrases

“Cross-region latency is physics, not engineering”

“Default REST, justify gRPC”

“SSE before WebSocket unless client needs to push”

“TLS 1.3 saves one round-trip over 1.2”

“Connection pool everything”

“QUIC eliminates head-of-line blocking”

“Cross-region latency is physics, not engineering” — Forces honest conversation about what optimisation can and can't fix.
“Default REST, justify gRPC” — Prevents the reflexive gRPC-to-sound-senior trap.
“SSE before WebSocket unless client needs to push” — Sharp heuristic that catches a common over-reach.
“TLS 1.3 saves one round-trip over 1.2” — Concrete latency detail showing depth.
“Connection pool everything” — Core scaling discipline stated in three words.
“QUIC eliminates head-of-line blocking” — Mobile-first latency insight few candidates mention.

Likely follow-ups

?“Why is cross-continent latency so stubbornly high?”Reveal

Physics. The speed of light through optical fiber is roughly 200,000 km/s (2/3 the vacuum speed because of fiber index of refraction). The distance from New York to London is ~5,500 km. Round trip = 11,000 km / 200,000 km/s = 55 ms minimum. Add queuing, routing, and re-buffering through ~20 hops and you get 80–120 ms in practice.

No optimisation beats physics. The only remedies are:

1Regional deployments — serve users from a PoP / region near them.
2Edge caching (CDN) — move cacheable content as close to users as possible.
3Connection reuse + keep-alive — amortise the handshake cost across many requests.
4Async / optimistic UI — don't make the user wait for the round-trip if you can avoid it.

For sub-100 ms p95 globally you need regional data, not one region serving the world.

?“Your design opens a new TCP connection for every downstream call. What's wrong?”Reveal

TCP + TLS handshake costs ~30–60 ms of pure latency per new connection. At 1000 QPS of outbound calls, that's 1000 handshakes/s and 1 full second of CPU/network time wasted every second.

The fix: connection pooling.

Each process maintains a pool of N open connections per downstream
Requests reuse existing connections from the pool
HTTP/2 multiplexes many streams over one connection — one connection often suffices
Idle connections get refreshed with heartbeats

Example: a Node.js http.Agent with keepAlive: true or a Go http.Client with default transport gives you pooling for free. gRPC clients pool natively. SQL drivers (PgBouncer, HikariCP) pool DB connections.

The cardinal sin: fetch() in Lambda without pooling — each cold invocation opens a new TLS connection to the DB. Seen in production.

?“How do you choose between WebSocket and SSE for a real-time feature?”Reveal

Ask: does the client need to push data back frequently?

SSE (default, simpler):

Server-to-client one-way push over HTTP
Works with existing HTTP infrastructure (LBs, proxies, CDNs)
Auto-reconnects on disconnect (browser built-in)
Stateless at the LB layer (just a long-lived HTTP response)

WebSocket (only when needed):

Full bidirectional framing over a persistent TCP connection
Required for chat, collaborative editing, gaming
Needs special LB handling (sticky sessions, L7 with WS upgrade support)
Stateful — connection affinity complicates failover and scaling

Rule of thumb: if the client sends an occasional message (sending a chat message at human typing speed), SSE + a regular POST is fine. If the client sends rapid or continuous frames (collaborative editing keystrokes, game inputs), WebSocket is worth the complexity.

Defaulting to WebSocket because it sounds "real-time" is the common mistake. SSE serves live feeds, notifications, score boards, and trading tickers beautifully.

Common mistakes

Ignoring DNS latency in your budget

Cold DNS is 50-200ms. For mobile clients switching networks, DNS is re-resolved. Keep TTLs reasonable (60-300s) and factor this into your cold-start budget.

New TCP connection per request

TCP + TLS costs 2-4 round-trips of pure latency. Connection pools and HTTP/2 multiplexing eliminate this. Opening a new connection per request is a performance anti-pattern.

WebSocket for everything "real-time"

WebSocket adds significant complexity (stateful connections, special LB handling, reconnection). If you only need server-to-client push, SSE is simpler and works with existing infrastructure.

Forgetting that cross-region latency is physicsAdvanced

Speed of light through fiber: US-East to US-West = ~60ms RTT. No optimization beats physics. If you need <50ms globally, you need regional deployments with local data.

Practice drills

Your app has 200ms p99 latency in US-East. A user in Singapore reports 800ms. Why?Reveal

Cross-Pacific RTT alone is ~200ms. TCP handshake: 200ms. TLS 1.2 handshake: 200ms more (2 RTTs). App processing: 200ms (includes round-trips to US-East DB). Total: ~800ms. Fix: regional deployment in APAC, cache hot data locally, terminate TLS at the edge, upgrade to TLS 1.3 (saves 1 RTT). Or at minimum: CDN for static assets and connection keepalive.

When would you use gRPC over REST?Reveal

Internal service-to-service calls where: (1) binary serialization matters (protobuf is smaller and faster to parse than JSON), (2) you need streaming (gRPC supports client, server, and bidirectional streaming natively), (3) strong typing and code generation across languages helps your dev velocity. NOT for browser-facing APIs — browsers don't natively support gRPC. The common production pattern: REST at the public edge, gRPC between internal services.

SSE or WebSocket for a live auction price feed?Reveal

SSE. The data flow is unidirectional: server pushes price updates to all connected bidders. Bidders don't push data back on this channel (bids go through the REST API). SSE works with existing HTTP infrastructure, auto-reconnects, and doesn't require special load balancer handling. WebSocket would add complexity for no benefit here. You'd only need WebSocket if bidders were sending rapid messages back — like a chat alongside the auction.

Cheat sheet

•Cold request: DNS (50ms) + TCP (30ms) + TLS (30ms) + processing = ~150ms+.
•Warm request (same connection): just processing. Often <20ms.
•Same AZ: ~0.5ms. Cross-AZ: ~2ms. Cross-region: ~60ms. Cross-continent: ~100ms.
•Default protocol: REST. Internal high-perf: gRPC. Bidirectional real-time: WebSocket. Push-only: SSE.
•Connection pool everything. New connections per request is a cardinal sin.
•GeoDNS for global routing. 60-300s TTL for failover speed.
•HTTP/2 multiplexes streams — one connection per destination is often enough.
•QUIC (HTTP/3) eliminates TCP head-of-line blocking. Good for mobile.

8% complete

Current

Read this if

Step 1 of 12

The concept

Jump to next

Protocol

Transport

Direction

Best for

Avoid when

REST / HTTP

TCP

Request-response

Public APIs, browser clients, simplicity

High-throughput internal calls where JSON overhead matters

gRPC

HTTP/2

Bi-directional streaming

Service-to-service, binary data, streaming

Browser-facing APIs (no native browser support)

WebSocket

TCP (upgraded)

Bidirectional frames

Chat, live collaboration, gaming

Unidirectional updates (SSE is simpler)

SSE

HTTP (long-lived)

Server → client push

Notifications, live feeds, dashboards

Client needs to push data frequently

GraphQL

HTTP

Request-response

Diverse clients needing different data shapes

Simple CRUD, performance-critical paths

TCP vs UDP

TCP guarantees delivery and ordering. UDP is fast but fire-and-forget.

For everything else — web APIs, databases, file transfer, messaging — TCP is the default. Most interviewers will assume TCP unless you explicitly propose UDP and justify it.

Pros

+TCP: reliable, ordered, well-understood — the default
+UDP: lower latency, no connection overhead
+QUIC: best of both — reliable + no HoL blocking

Cons

−TCP: head-of-line blocking (one lost packet stalls all streams)
−UDP: application must handle loss and ordering
−QUIC: newer, some firewalls block it, less tooling

Choose this variant when

TCP: everything except real-time media and telemetry
UDP: VoIP, video streaming, gaming, DNS
QUIC/HTTP/3: mobile clients, high-latency networks