intermediatescalability

Load balancing & traffic routing

L4 vs L7, session affinity, health checks, global routing.

L4 vs L7 is not a trivia question — it's about whether the LB can make decisions based on the request content. One is dumb and fast; the other is smart and expensive. Most prompts want L7 at the edge and L4 between services.

Read this if your last attempt…

You said "we'll use a load balancer" without specifying L4 or L7
You proposed sticky sessions without naming the consistency cost
You couldn't say how the LB detects a dead backend

The concept

A load balancer has two jobs: distribute traffic and remove dead backends. Everything else (TLS termination, path routing, rate limiting) is a feature layered on top.

The core split is L4 vs L7:

Architecture diagram· L4 vs L7 load balancers

L4 routes by IP/port at connection time. L7 parses HTTP/gRPC and routes by path, header, or cookie — smarter but more expensive per request.

Load-balancing algorithms — when each wins.

Algorithm	Good for	Avoid when
Round-robin	Uniform, stateless backends	Backends have different capacities or warm caches
Least-connections	Long-lived connections (WebSocket, DB pool)	Short-lived HTTP where the count is always ~0
Consistent hashing	Sticky routing to cache shards; stateful nodes	You want uniform load and backends are identical
Weighted	Canary / phased rollout of a new version	Baseline traffic; over-engineering for the common case
Power of two choices	Large fleets where full state is costly	Small fleets (<10) where the cost of picking 2 randomly dominates

How interviewers grade this

You name L4 vs L7 explicitly and justify the choice.
You state the health-check cadence and failure threshold (e.g. "2 fails in 5s → drained").
You name the algorithm (round-robin, least-connections, consistent-hash) and when each wins.
You distinguish global (DNS/anycast) from local (in-region) load balancing.
If you propose sticky sessions, you name the cost (uneven load, failure-recovery pain).

Variants

L4 at the edge

Transport-layer LB; forwards TCP/UDP without parsing.

The right choice for non-HTTP workloads (game servers, TCP-based RPC, WebRTC signalling). Cheap, fast, and supports any protocol — but you lose path-based routing and per-request observability.

Pros

+Very low latency overhead
+Protocol-agnostic (TCP/UDP/QUIC)
+Simple operational model

Cons

−No path/header routing
−No TLS termination at the LB
−Limited observability (no request logs)

Choose this variant when

Non-HTTP workload
Lowest-latency requirement
WebSocket or long-lived TCP

L7 at the edge

Application-aware LB; parses HTTP/gRPC and routes by content.

Default for anything web-facing. TLS termination, path-based routing (/api/* → api-tier, /static/* → CDN), header-based A/B, per-route rate limits — all unlocked at this layer.

Pros

+Path/header/cookie routing
+TLS termination + HTTP/2/3 features
+Rich observability (per-request logs)

Cons

−More CPU per request
−Bound to HTTP family
−More attack surface (request parsing)

Choose this variant when

HTTP / gRPC traffic
Multiple services behind one hostname
Need TLS termination + request-level visibility

Consistent-hash LB

Hash(request_key) → shard; minimal reshuffling on node changes.

Use when stateful nodes (cache shards, session stores) must receive the same keys repeatedly. Virtual nodes prevent hotspots. Rebalance cost on node add/remove is O(keys / shards) rather than O(keys).

Pros

+Stable key-to-node mapping
+Preserves cache warmth on scaling
+Minimal reshuffle on membership change

Cons

−Uneven load if keys are skewed
−Operationally more complex (ring state)
−Failures can cascade if the hash re-routes to an overloaded peer

Choose this variant when

Sharded caches
Affinity to a stateful node
Keys are high-cardinality and well-distributed

Worked example

Scenario: You're designing a chat app and someone asks about load balancing.

Public edge (clients → our system):

L7 LB (ALB / Envoy) terminates TLS, routes by path: /ws → WebSocket gateway, /api → REST API, /media → object-storage proxy.
Health checks: HTTP GET /health every 2s, 2 failures → drain.
Algorithm: least-connections (WebSocket connections are long-lived; round-robin creates pile-ups).

Service-to-service (within the cluster):

L4 or service-mesh sidecar (Envoy). mTLS between services.
Algorithm: round-robin or least-connections; consistent hash only when calling a sharded component (e.g. the user-state service, which is shard-keyed by user_id).

Global (across regions):

Route53 latency-based DNS → closest region.
Anycast IP for WebSocket connection setup so clients land near-optimally.

Stated trade-off: "I'm using least-connections at the WS gateway, not round-robin, because connections are long-lived. If one gateway gets 10K connections and another gets 2K, round-robin keeps adding to the overloaded one; least-connections fixes it."

Good vs bad answer

Interviewer probe

“Why L7 instead of L4 at the edge?”

Weak answer

"Because L7 is smarter. It's what everyone uses."

Strong answer

"Because I need path-based routing (/api vs /ws vs /static route to different tiers), TLS termination (clients are on HTTP/2, backends on HTTP/1.1), and per-request metrics for the gateway. L4 would force one LB per path or a weird port scheme, and I'd lose per-request observability. The CPU cost is real but marginal at our QPS. If this were a raw TCP protocol like a game server, I'd go L4."

Why it wins: Names the concrete features L7 unlocks (path routing, TLS, per-request metrics), costs them out, and names the condition that flips the decision.

Interview playbook1-2 minutes in HLD, plus ~1 minute in deep-dive when health checks, failure, or algorithm choice comes up.

When it comes up

The moment the interviewer draws a client and asks how it reaches your services
When you introduce a WebSocket tier, real-time layer, or multi-service edge
When "how do we scale the stateless tier?" appears in deep-dive
When health checks, failover, or sticky sessions get questioned
When the topic pivots to multi-region — LB choice changes above and below the region

Order of reveal

1
Name the split upfront. "Two LB tiers: an L7 at the public edge for HTTP/gRPC traffic, and either L4 or a service-mesh sidecar for intra-cluster. Different layers, different jobs."
2
Justify L7 at the edge with features, not adjectives. "L7 because I need path-based routing, TLS termination, per-route rate limits, and per-request observability. L4 can't do any of that."
3
Pick the algorithm on evidence. "Round-robin is fine for stateless short-lived requests. For the WebSocket gateway I'd use least-connections because connections are long-lived and round-robin creates pile-ups."
4
Specify health checks with numbers. "HTTP GET /health every 2 s, 2 consecutive failures drains. Detection to drain ≈ 5 s. Requests in flight finish; new ones route elsewhere."
5
Address sticky sessions before asked. "No sticky sessions. Session state lives in Redis so any instance can serve any request. Sticky sessions are a crutch that creates cohort-level outages."
6
Call out the LB itself as HA. "Managed ALB/NLB is already HA across AZs. If we were self-hosting, active-active behind anycast."
7
Bridge to global when multi-region is needed. "Global traffic management via Route53 latency routing or anycast. Local L7 in each region."

Signature phrases

“L4 is dumb and fast; L7 is smart and expensive”

“L7 at the edge, L4 or mesh between services”

“Round-robin for short requests, least-connections for long-lived”

“Health checks: endpoint, cadence, threshold — always cite all three”

“Sticky sessions are a crutch, not a feature”

“The LB is not magic HA”

“L4 is dumb and fast; L7 is smart and expensive” — Memorable framing that forces a conscious choice.
“L7 at the edge, L4 or mesh between services” — Standard senior-engineer topology.
“Round-robin for short requests, least-connections for long-lived” — Catches the WebSocket pile-up trap.
“Health checks: endpoint, cadence, threshold — always cite all three” — Prevents the hand-wave.
“Sticky sessions are a crutch, not a feature” — Takes a position instead of listing pros/cons.
“The LB is not magic HA” — Forces you to name redundancy for the LB itself.

Likely follow-ups

?“Walk me through what the LB does when a backend dies mid-request.”Reveal

Depends on the LB layer:

L4 (NLB / HAProxy TCP mode):

The TCP connection to the dead backend errors.
Client sees a connection-reset or a response abort.
Client must retry — the LB can't retry on behalf of the client because it didn't parse the request.
For idempotent requests this is safe; for non-idempotent (POST without idempotency key) it's a double-submit risk.

L7 (ALB / NGINX / Envoy):

The LB sees the upstream connection reset or 5xx and can retry within the same client request to a different backend — but only if:

1. The request is idempotent (GET, or explicitly marked retriable), OR 2. The request carries an idempotency key that the backend honours.

The retry budget is bounded (usually 1-3 attempts, capped at the request deadline).

Health-check interaction:

The LB marks the backend unhealthy after N failed health checks (typically 2 × 2 s = 4 s).
During that window, some requests will hit the dead backend before it's drained. This is why retries matter.
For a 5 s drain window at 10K QPS, up to ~50K requests may hit the dead backend before it's removed.

Graceful shutdown:

A backend being deployed should fail health checks before it stops accepting connections. That lets the LB drain traffic before the process exits. Typical pattern: /health starts returning 503, wait 2× health-check interval, then shut down.

?“You're routing to a 10-shard cache cluster. Should the LB use consistent hashing? What's the algorithm tradeoff?”Reveal

Yes — consistent hashing keyed on the cache key (not the user IP). Here's why each alternative fails:

Round-robin: every request has a 1-in-10 chance of hitting the shard that owns the key. The other 9 times, either you do a cross-shard lookup (wasted) or you miss. Cache warmth is destroyed.

Least-connections: doesn't help — even distribution of requests doesn't match the key-to-shard mapping the cache needs.

Consistent hashing on key: the LB deterministically routes a key to its owning shard, preserving cache warmth. When you add or remove a shard, only ~1/N of keys remap (the ones whose arc now crosses a new node).

Virtual nodes (128-256 per physical shard): mandatory to avoid uneven load when the ring is small. Without them, 3 shards might split 60/20/20 by chance.

Edge case — hot keys: consistent hashing routes a hot key to one shard deterministically. That shard can still be pinned at 100% CPU while others idle. Consistent hashing doesn't fix workload imbalance — only structural. For hot keys you still need replication, L1 cache, or key salting. See the Consistent hashing lesson for the full treatment.

?“Interviewer: "we're at 100K QPS and the L7 LB is the bottleneck". What do you do?”Reveal

Three levers in order of effort:

1. Scale out the L7 tier (easiest). Managed L7 LBs (ALB, Cloud Load Balancing) autoscale transparently. Self-hosted NGINX/Envoy — add instances behind an L4 front. Most L7 proxies push 20-50K RPS per core; 100K QPS on 4-8 cores is a starting point, not a ceiling.

2. Offload TLS to the edge/CDN. TLS handshakes dominate L7 CPU. Terminate at the CDN (Cloudflare, CloudFront, Fastly) so the LB receives plaintext HTTP. Typical CPU savings: 40-60% on the LB tier. Keep connections backend-side on HTTP/2 for reuse.

3. Bypass L7 for internal traffic. Service-to-service calls don't need path routing — they know where they're going. Use a service mesh sidecar (Envoy, Linkerd) that does L4 + mTLS + client-side load balancing. The L7 LB stays out of the internal hot path entirely.

4. (If truly needed) Tune the L7 proxy itself. HTTP/2 multiplexing, connection keep-alive, compiled filters only, disable per-request logging (sample instead). Usually unnecessary if 1-3 worked.

Watch out for: treating "add CPU to the LB" as the first answer. Architectural moves (2 and 3) give orders of magnitude more headroom than vertical scaling.

?“Global traffic management — how does geo-DNS actually work, and when is anycast better?”Reveal

Geo-DNS (Route53 latency routing, Cloudflare GeoSteering):

Client's DNS resolver queries the authoritative server.
Authoritative server looks at the resolver's IP (NOT the client's) and returns the IP of the nearest region's LB.
Works well when most users' resolvers are geographically close to them (typical consumer internet).
Fails when users use a remote DNS resolver (VPN, public DNS farther away than the user).
TTL tradeoff: short TTL (30-60 s) for fast failover, long TTL (5 min) for cache-friendliness. 60 s is a common default.

Anycast (Cloudflare, Google's frontend, AWS Global Accelerator):

Same IP address advertised from multiple physical PoPs.
BGP routing delivers the client's packet to the topologically-closest PoP automatically.
No DNS trickery — the IP resolution returns the same answer globally.
Better for latency-sensitive TCP/UDP traffic (initial connection setup, DNS itself, DDoS mitigation).
Requires owning IP space that can be advertised, typically a cloud-managed feature.

When to use which:

Geo-DNS for the long tail of services — cheap, works everywhere, TTL-based failover.
Anycast for latency-critical global services (DNS, CDN, API edge) and for DDoS resilience (attacks spread across PoPs instead of concentrating).
Both in practice — anycast for TCP connection setup, geo-DNS inside a region to pick a specific tier.

Architecture diagram· Global routing — geo-DNS + anycast + regional LBs

Anycast IP delivers the TCP handshake to the closest PoP. Geo-DNS steers HTTPS to the regional L7. Inside each region a local L7 balances across service instances.

?“Sticky sessions — when is it actually the right answer?”Reveal

Rarely, but here are the legitimate cases:

1. WebSocket / long-lived TCP. The session IS the connection. You can't "share" a live WebSocket across backends — the client is bound to one. This isn't really "sticky sessions" in the web sense; it's the natural state of a long-lived connection.

2. Legacy apps that hold per-user state in memory. The backend stores user state locally and can't/won't externalise it. Sticky is a compatibility crutch. The correct long-term fix is to move state to a shared store (Redis, session DB). Sticky is a bandage while you migrate.

3. Performance optimization for consistent-cache affinity. You want a user's requests to hit the same backend so in-memory caches stay warm. Cheaper than a shared cache for small-scale or bursty workloads. Break-glass technique; doesn't scale.

Why it's usually wrong:

One slow backend now serves its cohort of users until they expire or reconnect — cohort-scale outage.
Load is naturally uneven — sticky to the worker that was randomly cold at cookie-set time.
Deploys are harder — draining traffic means forcing a cohort to re-auth elsewhere.
Scale-down is painful — you can't remove a backend without terminating its sticky sessions.

The default answer: "Externalise session state. Any backend can serve any request." Sticky is a named exception, not a default.

Code examples

nginxNGINX upstream — least-connections + active health checks

upstream ws_gateway {
    least_conn;                     # long-lived WS connections → not round-robin
    keepalive 64;

    server 10.0.1.11:8080 max_fails=2 fail_timeout=10s;
    server 10.0.1.12:8080 max_fails=2 fail_timeout=10s;
    server 10.0.1.13:8080 max_fails=2 fail_timeout=10s;
}

server {
    listen 443 ssl http2;
    server_name api.example.com;

    location /ws {
        proxy_pass http://ws_gateway;
        proxy_http_version 1.1;
        proxy_set_header Upgrade    $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_read_timeout 3600s;          # WS idle tolerance
        proxy_next_upstream error timeout http_502 http_503;
    }

    # /health on origin returns 503 during graceful shutdown so LB drains
    # before the process exits. Shutdown script: flip flag → sleep 2×check → stop.
}

yamlEnvoy — consistent-hash routing for a sharded cache

clusters:
- name: cache_shards
  type: STRICT_DNS
  connect_timeout: 0.25s
  lb_policy: RING_HASH        # consistent hashing with virtual nodes
  ring_hash_lb_config:
    minimum_ring_size: 1024   # virtual nodes — avoids small-ring skew
  health_checks:
  - timeout: 1s
    interval: 2s
    unhealthy_threshold: 2
    healthy_threshold: 2
    http_health_check: { path: "/health" }
  load_assignment:
    cluster_name: cache_shards
    endpoints:
    - lb_endpoints:
      - endpoint: { address: { socket_address: { address: cache-0, port_value: 6379 }}}
      - endpoint: { address: { socket_address: { address: cache-1, port_value: 6379 }}}
      - endpoint: { address: { socket_address: { address: cache-2, port_value: 6379 }}}

# Route uses the cache key as the hash input — deterministic per-key affinity.
route:
  cluster: cache_shards
  hash_policy:
  - header: { header_name: "x-cache-key" }

Common mistakes

Sticky sessions as the first answer to "state at the edge"

Sticky sessions (cookie or IP hash) mean one slow backend now serves a cohort of users forever. They're a crutch for stateful web tiers. Prefer: externalise session state to a store, so any backend can serve any request.

No health-check detail

Saying "it does health checks" is not enough. State the endpoint (/health), the cadence (1–5s), the failure threshold (2–3 fails), and what "unhealthy" drains to (another AZ, a circuit breaker). Without specifics, your failover budget is undefined.

Architecture diagram· Health-check drain — how a dead backend is removed

LB probes /health every 2s. Two consecutive failures flip the backend to "unhealthy" and drain it from the pool. In-flight requests finish; new ones route elsewhere.

Round-robin on long-lived connections

Round-robin distributes requests equally, not load. With WebSocket or DB-pool-style long-lived connections, you need least-connections — otherwise pile-ups are guaranteed.

Forgetting the LB itself is a single point of failureAdvanced

A single LB instance is one EC2 / one VM. Real deployments have LB redundancy: active-active pairs behind anycast, or a managed LB (ALB/NLB) which is already HA. Call this out.

Practice drills

Your interviewer asks "what happens when a backend dies mid-request?". What's the answer?Reveal

At L4: the TCP connection errors; client retries and hits a different backend on the next connection. At L7: the LB sees the upstream 5xx (or connect-fail) and can retry within the same request to a different backend — but only if the request is idempotent. State this: "L7 LB retries idempotent GETs automatically; for POST we rely on client-side retry with an idempotency key, which is designed at the API contract layer."

You're routing to a sharded cache. Round-robin or consistent-hash?Reveal

Consistent-hash keyed on the cache key. Round-robin means every request has a 1-in-N chance of hitting the shard that owns the key; the rest is wasted lookups. With consistent-hash, the LB deterministically picks the right shard, preserving cache warmth and reducing cross-shard traffic. Use virtual nodes (128–256 per physical node) to keep load balanced despite key skew.

Interviewer: "we're at 100K QPS and the L7 LB is the bottleneck." What do you do?Reveal

Three levers in order: (1) scale out the L7 tier — most managed L7 LBs are already HA and scale horizontally; (2) push TLS termination to the edge/CDN so the LB sees plaintext HTTP (big CPU saving); (3) for internal traffic, bypass L7 entirely and use a service mesh sidecar (L4 + mTLS, routing done at the client) so the LB isn't in the hot path.

Cheat sheet

•L4 = transport (TCP/UDP). L7 = application (HTTP/gRPC).
•Default web edge: L7. Default service-to-service: L4 or mesh sidecar.
•Algorithms: round-robin (stateless short requests), least-connections (long-lived), consistent-hash (sharded/stateful).
•Health check: endpoint + cadence + fail threshold. Always cite all three.
•Global: GeoDNS / anycast. Local: L4 or L7 inside one region.
•Sticky sessions = externalise session state instead; avoid where possible.
•The LB is not magic HA — deploy it active-active or use a managed HA LB.

Practice this skill

These problems exercise Load balancing & traffic routing. Try one now to apply what you just learned.

url shortener chat system rate limiter

7% complete

Current

Read this if

Step 1 of 14

The concept

Jump to next

Algorithm

Good for

Avoid when

Round-robin

Uniform, stateless backends

Backends have different capacities or warm caches

Least-connections

Long-lived connections (WebSocket, DB pool)

Short-lived HTTP where the count is always ~0

Consistent hashing

Sticky routing to cache shards; stateful nodes

You want uniform load and backends are identical

Weighted

Canary / phased rollout of a new version

Baseline traffic; over-engineering for the common case

Power of two choices

Large fleets where full state is costly

Small fleets (<10) where the cost of picking 2 randomly dominates

upstream ws_gateway { least_conn; # long-lived WS connections → not round-robin keepalive 64; server 10.0.1.11:8080 max_fails=2 fail_timeout=10s; server 10.0.1.12:8080 max_fails=2 fail_timeout=10s; server 10.0.1.13:8080 max_fails=2 fail_timeout=10s; } server { listen 443 ssl http2; server_name api.example.com; location /ws { proxy_pass http://ws_gateway; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade"; proxy_read_timeout 3600s; # WS idle tolerance proxy_next_upstream error timeout http_502 http_503; } # /health on origin returns 503 during graceful shutdown so LB drains # before the process exits. Shutdown script: flip flag → sleep 2×check → stop. }

clusters: - name: cache_shards type: STRICT_DNS connect_timeout: 0.25s lb_policy: RING_HASH # consistent hashing with virtual nodes ring_hash_lb_config: minimum_ring_size: 1024 # virtual nodes — avoids small-ring skew health_checks: - timeout: 1s interval: 2s unhealthy_threshold: 2 healthy_threshold: 2 http_health_check: { path: "/health" } load_assignment: cluster_name: cache_shards endpoints: - lb_endpoints: - endpoint: { address: { socket_address: { address: cache-0, port_value: 6379 }}} - endpoint: { address: { socket_address: { address: cache-1, port_value: 6379 }}} - endpoint: { address: { socket_address: { address: cache-2, port_value: 6379 }}} # Route uses the cache key as the hash input — deterministic per-key affinity. route: cluster: cache_shards hash_policy: - header: { header_name: "x-cache-key" }