Load balancing & traffic routing
L4 vs L7, session affinity, health checks, global routing.
L4 vs L7 is not a trivia question — it's about whether the LB can make decisions based on the request content. One is dumb and fast; the other is smart and expensive. Most prompts want L7 at the edge and L4 between services.
Read this if your last attempt…
- You said "we'll use a load balancer" without specifying L4 or L7
- You proposed sticky sessions without naming the consistency cost
- You couldn't say how the LB detects a dead backend
The concept
A load balancer has two jobs: distribute traffic and remove dead backends. Everything else (TLS termination, path routing, rate limiting) is a feature layered on top.
The core split is L4 vs L7:
L4 routes by IP/port at connection time. L7 parses HTTP/gRPC and routes by path, header, or cookie — smarter but more expensive per request.
Load-balancing algorithms — when each wins.
| Algorithm | Good for | Avoid when |
|---|---|---|
| Round-robin | Uniform, stateless backends | Backends have different capacities or warm caches |
| Least-connections | Long-lived connections (WebSocket, DB pool) | Short-lived HTTP where the count is always ~0 |
| Consistent hashing | Sticky routing to cache shards; stateful nodes | You want uniform load and backends are identical |
| Weighted | Canary / phased rollout of a new version | Baseline traffic; over-engineering for the common case |
| Power of two choices | Large fleets where full state is costly | Small fleets (<10) where the cost of picking 2 randomly dominates |
How interviewers grade this
- You name L4 vs L7 explicitly and justify the choice.
- You state the health-check cadence and failure threshold (e.g. "2 fails in 5s → drained").
- You name the algorithm (round-robin, least-connections, consistent-hash) and when each wins.
- You distinguish global (DNS/anycast) from local (in-region) load balancing.
- If you propose sticky sessions, you name the cost (uneven load, failure-recovery pain).
Variants
L4 at the edge
Transport-layer LB; forwards TCP/UDP without parsing.
The right choice for non-HTTP workloads (game servers, TCP-based RPC, WebRTC signalling). Cheap, fast, and supports any protocol — but you lose path-based routing and per-request observability.
Pros
- +Very low latency overhead
- +Protocol-agnostic (TCP/UDP/QUIC)
- +Simple operational model
Cons
- −No path/header routing
- −No TLS termination at the LB
- −Limited observability (no request logs)
Choose this variant when
- Non-HTTP workload
- Lowest-latency requirement
- WebSocket or long-lived TCP
L7 at the edge
Application-aware LB; parses HTTP/gRPC and routes by content.
Default for anything web-facing. TLS termination, path-based routing (/api/* → api-tier, /static/* → CDN), header-based A/B, per-route rate limits — all unlocked at this layer.
Pros
- +Path/header/cookie routing
- +TLS termination + HTTP/2/3 features
- +Rich observability (per-request logs)
Cons
- −More CPU per request
- −Bound to HTTP family
- −More attack surface (request parsing)
Choose this variant when
- HTTP / gRPC traffic
- Multiple services behind one hostname
- Need TLS termination + request-level visibility
Consistent-hash LB
Hash(request_key) → shard; minimal reshuffling on node changes.
Use when stateful nodes (cache shards, session stores) must receive the same keys repeatedly. Virtual nodes prevent hotspots. Rebalance cost on node add/remove is O(keys / shards) rather than O(keys).
Pros
- +Stable key-to-node mapping
- +Preserves cache warmth on scaling
- +Minimal reshuffle on membership change
Cons
- −Uneven load if keys are skewed
- −Operationally more complex (ring state)
- −Failures can cascade if the hash re-routes to an overloaded peer
Choose this variant when
- Sharded caches
- Affinity to a stateful node
- Keys are high-cardinality and well-distributed
Worked example
Scenario: You're designing a chat app and someone asks about load balancing.
Public edge (clients → our system):
- L7 LB (ALB / Envoy) terminates TLS, routes by path: /ws → WebSocket gateway, /api → REST API, /media → object-storage proxy.
- Health checks: HTTP GET /health every 2s, 2 failures → drain.
- Algorithm: least-connections (WebSocket connections are long-lived; round-robin creates pile-ups).
Service-to-service (within the cluster):
- L4 or service-mesh sidecar (Envoy). mTLS between services.
- Algorithm: round-robin or least-connections; consistent hash only when calling a sharded component (e.g. the user-state service, which is shard-keyed by user_id).
Global (across regions):
- Route53 latency-based DNS → closest region.
- Anycast IP for WebSocket connection setup so clients land near-optimally.
Stated trade-off: "I'm using least-connections at the WS gateway, not round-robin, because connections are long-lived. If one gateway gets 10K connections and another gets 2K, round-robin keeps adding to the overloaded one; least-connections fixes it."
Good vs bad answer
Interviewer probe
“Why L7 instead of L4 at the edge?”
Weak answer
"Because L7 is smarter. It's what everyone uses."
Strong answer
"Because I need path-based routing (/api vs /ws vs /static route to different tiers), TLS termination (clients are on HTTP/2, backends on HTTP/1.1), and per-request metrics for the gateway. L4 would force one LB per path or a weird port scheme, and I'd lose per-request observability. The CPU cost is real but marginal at our QPS. If this were a raw TCP protocol like a game server, I'd go L4."
Why it wins: Names the concrete features L7 unlocks (path routing, TLS, per-request metrics), costs them out, and names the condition that flips the decision.
When it comes up
- The moment the interviewer draws a client and asks how it reaches your services
- When you introduce a WebSocket tier, real-time layer, or multi-service edge
- When "how do we scale the stateless tier?" appears in deep-dive
- When health checks, failover, or sticky sessions get questioned
- When the topic pivots to multi-region — LB choice changes above and below the region
Order of reveal
- 1Name the split upfront. "Two LB tiers: an L7 at the public edge for HTTP/gRPC traffic, and either L4 or a service-mesh sidecar for intra-cluster. Different layers, different jobs."
- 2Justify L7 at the edge with features, not adjectives. "L7 because I need path-based routing, TLS termination, per-route rate limits, and per-request observability. L4 can't do any of that."
- 3Pick the algorithm on evidence. "Round-robin is fine for stateless short-lived requests. For the WebSocket gateway I'd use least-connections because connections are long-lived and round-robin creates pile-ups."
- 4Specify health checks with numbers. "HTTP GET /health every 2 s, 2 consecutive failures drains. Detection to drain ≈ 5 s. Requests in flight finish; new ones route elsewhere."
- 5Address sticky sessions before asked. "No sticky sessions. Session state lives in Redis so any instance can serve any request. Sticky sessions are a crutch that creates cohort-level outages."
- 6Call out the LB itself as HA. "Managed ALB/NLB is already HA across AZs. If we were self-hosting, active-active behind anycast."
- 7Bridge to global when multi-region is needed. "Global traffic management via Route53 latency routing or anycast. Local L7 in each region."
Signature phrases
- “L4 is dumb and fast; L7 is smart and expensive” — Memorable framing that forces a conscious choice.
- “L7 at the edge, L4 or mesh between services” — Standard senior-engineer topology.
- “Round-robin for short requests, least-connections for long-lived” — Catches the WebSocket pile-up trap.
- “Health checks: endpoint, cadence, threshold — always cite all three” — Prevents the hand-wave.
- “Sticky sessions are a crutch, not a feature” — Takes a position instead of listing pros/cons.
- “The LB is not magic HA” — Forces you to name redundancy for the LB itself.
Likely follow-ups
?“Walk me through what the LB does when a backend dies mid-request.”Reveal
Depends on the LB layer:
L4 (NLB / HAProxy TCP mode):
- The TCP connection to the dead backend errors.
- Client sees a connection-reset or a response abort.
- Client must retry — the LB can't retry on behalf of the client because it didn't parse the request.
- For idempotent requests this is safe; for non-idempotent (POST without idempotency key) it's a double-submit risk.
L7 (ALB / NGINX / Envoy):
- The LB sees the upstream connection reset or 5xx and can retry within the same client request to a different backend — but only if:
1. The request is idempotent (GET, or explicitly marked retriable), OR 2. The request carries an idempotency key that the backend honours.
- The retry budget is bounded (usually 1-3 attempts, capped at the request deadline).
Health-check interaction:
- The LB marks the backend unhealthy after N failed health checks (typically 2 × 2 s = 4 s).
- During that window, some requests will hit the dead backend before it's drained. This is why retries matter.
- For a 5 s drain window at 10K QPS, up to ~50K requests may hit the dead backend before it's removed.
Graceful shutdown:
- A backend being deployed should fail health checks before it stops accepting connections. That lets the LB drain traffic before the process exits. Typical pattern: /health starts returning 503, wait 2× health-check interval, then shut down.
?“You're routing to a 10-shard cache cluster. Should the LB use consistent hashing? What's the algorithm tradeoff?”Reveal
Yes — consistent hashing keyed on the cache key (not the user IP). Here's why each alternative fails:
Round-robin: every request has a 1-in-10 chance of hitting the shard that owns the key. The other 9 times, either you do a cross-shard lookup (wasted) or you miss. Cache warmth is destroyed.
Least-connections: doesn't help — even distribution of requests doesn't match the key-to-shard mapping the cache needs.
Consistent hashing on key: the LB deterministically routes a key to its owning shard, preserving cache warmth. When you add or remove a shard, only ~1/N of keys remap (the ones whose arc now crosses a new node).
Virtual nodes (128-256 per physical shard): mandatory to avoid uneven load when the ring is small. Without them, 3 shards might split 60/20/20 by chance.
Edge case — hot keys: consistent hashing routes a hot key to one shard deterministically. That shard can still be pinned at 100% CPU while others idle. Consistent hashing doesn't fix workload imbalance — only structural. For hot keys you still need replication, L1 cache, or key salting. See the Consistent hashing lesson for the full treatment.
?“Interviewer: "we're at 100K QPS and the L7 LB is the bottleneck". What do you do?”Reveal
Three levers in order of effort:
1. Scale out the L7 tier (easiest). Managed L7 LBs (ALB, Cloud Load Balancing) autoscale transparently. Self-hosted NGINX/Envoy — add instances behind an L4 front. Most L7 proxies push 20-50K RPS per core; 100K QPS on 4-8 cores is a starting point, not a ceiling.
2. Offload TLS to the edge/CDN. TLS handshakes dominate L7 CPU. Terminate at the CDN (Cloudflare, CloudFront, Fastly) so the LB receives plaintext HTTP. Typical CPU savings: 40-60% on the LB tier. Keep connections backend-side on HTTP/2 for reuse.
3. Bypass L7 for internal traffic. Service-to-service calls don't need path routing — they know where they're going. Use a service mesh sidecar (Envoy, Linkerd) that does L4 + mTLS + client-side load balancing. The L7 LB stays out of the internal hot path entirely.
4. (If truly needed) Tune the L7 proxy itself. HTTP/2 multiplexing, connection keep-alive, compiled filters only, disable per-request logging (sample instead). Usually unnecessary if 1-3 worked.
Watch out for: treating "add CPU to the LB" as the first answer. Architectural moves (2 and 3) give orders of magnitude more headroom than vertical scaling.
?“Global traffic management — how does geo-DNS actually work, and when is anycast better?”Reveal
Geo-DNS (Route53 latency routing, Cloudflare GeoSteering):
- Client's DNS resolver queries the authoritative server.
- Authoritative server looks at the resolver's IP (NOT the client's) and returns the IP of the nearest region's LB.
- Works well when most users' resolvers are geographically close to them (typical consumer internet).
- Fails when users use a remote DNS resolver (VPN, public DNS farther away than the user).
- TTL tradeoff: short TTL (30-60 s) for fast failover, long TTL (5 min) for cache-friendliness. 60 s is a common default.
Anycast (Cloudflare, Google's frontend, AWS Global Accelerator):
- Same IP address advertised from multiple physical PoPs.
- BGP routing delivers the client's packet to the topologically-closest PoP automatically.
- No DNS trickery — the IP resolution returns the same answer globally.
- Better for latency-sensitive TCP/UDP traffic (initial connection setup, DNS itself, DDoS mitigation).
- Requires owning IP space that can be advertised, typically a cloud-managed feature.
When to use which:
- Geo-DNS for the long tail of services — cheap, works everywhere, TTL-based failover.
- Anycast for latency-critical global services (DNS, CDN, API edge) and for DDoS resilience (attacks spread across PoPs instead of concentrating).
- Both in practice — anycast for TCP connection setup, geo-DNS inside a region to pick a specific tier.
Anycast IP delivers the TCP handshake to the closest PoP. Geo-DNS steers HTTPS to the regional L7. Inside each region a local L7 balances across service instances.
?“Sticky sessions — when is it actually the right answer?”Reveal
Rarely, but here are the legitimate cases:
1. WebSocket / long-lived TCP. The session IS the connection. You can't "share" a live WebSocket across backends — the client is bound to one. This isn't really "sticky sessions" in the web sense; it's the natural state of a long-lived connection.
2. Legacy apps that hold per-user state in memory. The backend stores user state locally and can't/won't externalise it. Sticky is a compatibility crutch. The correct long-term fix is to move state to a shared store (Redis, session DB). Sticky is a bandage while you migrate.
3. Performance optimization for consistent-cache affinity. You want a user's requests to hit the same backend so in-memory caches stay warm. Cheaper than a shared cache for small-scale or bursty workloads. Break-glass technique; doesn't scale.
Why it's usually wrong:
- One slow backend now serves its cohort of users until they expire or reconnect — cohort-scale outage.
- Load is naturally uneven — sticky to the worker that was randomly cold at cookie-set time.
- Deploys are harder — draining traffic means forcing a cohort to re-auth elsewhere.
- Scale-down is painful — you can't remove a backend without terminating its sticky sessions.
The default answer: "Externalise session state. Any backend can serve any request." Sticky is a named exception, not a default.
Code examples
upstream ws_gateway {
least_conn; # long-lived WS connections → not round-robin
keepalive 64;
server 10.0.1.11:8080 max_fails=2 fail_timeout=10s;
server 10.0.1.12:8080 max_fails=2 fail_timeout=10s;
server 10.0.1.13:8080 max_fails=2 fail_timeout=10s;
}
server {
listen 443 ssl http2;
server_name api.example.com;
location /ws {
proxy_pass http://ws_gateway;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_read_timeout 3600s; # WS idle tolerance
proxy_next_upstream error timeout http_502 http_503;
}
# /health on origin returns 503 during graceful shutdown so LB drains
# before the process exits. Shutdown script: flip flag → sleep 2×check → stop.
}clusters:
- name: cache_shards
type: STRICT_DNS
connect_timeout: 0.25s
lb_policy: RING_HASH # consistent hashing with virtual nodes
ring_hash_lb_config:
minimum_ring_size: 1024 # virtual nodes — avoids small-ring skew
health_checks:
- timeout: 1s
interval: 2s
unhealthy_threshold: 2
healthy_threshold: 2
http_health_check: { path: "/health" }
load_assignment:
cluster_name: cache_shards
endpoints:
- lb_endpoints:
- endpoint: { address: { socket_address: { address: cache-0, port_value: 6379 }}}
- endpoint: { address: { socket_address: { address: cache-1, port_value: 6379 }}}
- endpoint: { address: { socket_address: { address: cache-2, port_value: 6379 }}}
# Route uses the cache key as the hash input — deterministic per-key affinity.
route:
cluster: cache_shards
hash_policy:
- header: { header_name: "x-cache-key" }Common mistakes
Sticky sessions (cookie or IP hash) mean one slow backend now serves a cohort of users forever. They're a crutch for stateful web tiers. Prefer: externalise session state to a store, so any backend can serve any request.
Saying "it does health checks" is not enough. State the endpoint (/health), the cadence (1–5s), the failure threshold (2–3 fails), and what "unhealthy" drains to (another AZ, a circuit breaker). Without specifics, your failover budget is undefined.
LB probes /health every 2s. Two consecutive failures flip the backend to "unhealthy" and drain it from the pool. In-flight requests finish; new ones route elsewhere.
Round-robin distributes requests equally, not load. With WebSocket or DB-pool-style long-lived connections, you need least-connections — otherwise pile-ups are guaranteed.
A single LB instance is one EC2 / one VM. Real deployments have LB redundancy: active-active pairs behind anycast, or a managed LB (ALB/NLB) which is already HA. Call this out.
Practice drills
Your interviewer asks "what happens when a backend dies mid-request?". What's the answer?Reveal
At L4: the TCP connection errors; client retries and hits a different backend on the next connection. At L7: the LB sees the upstream 5xx (or connect-fail) and can retry within the same request to a different backend — but only if the request is idempotent. State this: "L7 LB retries idempotent GETs automatically; for POST we rely on client-side retry with an idempotency key, which is designed at the API contract layer."
You're routing to a sharded cache. Round-robin or consistent-hash?Reveal
Consistent-hash keyed on the cache key. Round-robin means every request has a 1-in-N chance of hitting the shard that owns the key; the rest is wasted lookups. With consistent-hash, the LB deterministically picks the right shard, preserving cache warmth and reducing cross-shard traffic. Use virtual nodes (128–256 per physical node) to keep load balanced despite key skew.
Interviewer: "we're at 100K QPS and the L7 LB is the bottleneck." What do you do?Reveal
Three levers in order: (1) scale out the L7 tier — most managed L7 LBs are already HA and scale horizontally; (2) push TLS termination to the edge/CDN so the LB sees plaintext HTTP (big CPU saving); (3) for internal traffic, bypass L7 entirely and use a service mesh sidecar (L4 + mTLS, routing done at the client) so the LB isn't in the hot path.
Cheat sheet
- •L4 = transport (TCP/UDP). L7 = application (HTTP/gRPC).
- •Default web edge: L7. Default service-to-service: L4 or mesh sidecar.
- •Algorithms: round-robin (stateless short requests), least-connections (long-lived), consistent-hash (sharded/stateful).
- •Health check: endpoint + cadence + fail threshold. Always cite all three.
- •Global: GeoDNS / anycast. Local: L4 or L7 inside one region.
- •Sticky sessions = externalise session state instead; avoid where possible.
- •The LB is not magic HA — deploy it active-active or use a managed HA LB.
Practice this skill
These problems exercise Load balancing & traffic routing. Try one now to apply what you just learned.