Load Balancing — System Design Pattern

Architecture Diagram

How It Works

A load balancer sits between clients and a pool of backend servers. Every incoming request is routed to one of the servers based on a chosen algorithm. The load balancer continuously monitors server health and removes unhealthy instances from the rotation.

Layer 4 (Transport) vs Layer 7 (Application)

L4 Load Balancing

Operates on TCP/UDP. Routes based on IP + port. Very fast — no packet inspection. Can't make content-based routing decisions. Examples: AWS NLB, HAProxy (TCP mode).

L7 Load Balancing

Operates on HTTP/HTTPS. Can route based on URL path, headers, cookies. Enables sticky sessions, A/B testing, canary deployments. Examples: AWS ALB, Nginx, Envoy.

Balancing Algorithms

Round Robin — Requests are distributed sequentially across servers. Simple, stateless. Works well when servers are homogeneous.
Weighted Round Robin — Assign weights to servers based on capacity. A 4-core server gets 2× the traffic of a 2-core.
Least Connections — Route to the server with the fewest active connections. Best for long-lived connections (WebSocket, DB).
Least Response Time — Route to the server with the lowest average response time + fewest connections.
IP Hash — Hash client IP → server. Ensures same client hits same server (poor man's sticky sessions).
Random — Surprisingly effective at scale. With the "power of two choices," pick 2 random servers and choose the less loaded one.

Health Checks

The LB periodically pings each server (TCP connect, HTTP GET /health, or gRPC health check). If a server fails N consecutive checks, it's removed from the pool. Once it passes again, it's re-added. This is what makes load balancing provide high availability.

Key Design Decisions

⚡

L4 vs L7: L4 is faster (no packet inspection, ~μs overhead) but blind to HTTP semantics. L7 adds latency (~1ms) but enables content routing, SSL termination, and request transformation. Most web apps need L7.

🔄

Sticky sessions vs Stateless: Sticky sessions (via cookies) pin a user to one server — simpler app code but kills even distribution and complicates scaling. Better approach: externalize state to Redis/DB and go fully stateless.

🏗️

Single LB vs Multiple: A single LB is a SPOF. Use active-passive or active-active pairs with a floating IP (VRRP/keepalived) or DNS failover. Cloud LBs (ALB, GCP LB) handle this for you.

🔒

SSL termination at LB vs passthrough: Terminating SSL at the LB simplifies cert management and offloads crypto from app servers. But traffic between LB and backend is unencrypted unless you add mutual TLS (mTLS).

When to Use

If an interviewer asks you to design any scalable web service, load balancing is step one. Mention it early.

"Design a URL shortener" — LB in front of stateless redirect servers
"Design Twitter" — L7 LB routing /api vs /static to different services
"How would you handle 10× traffic?" — Add more servers behind the LB
"How do you achieve high availability?" — LB + health checks + auto-scaling

Interview signal: The interviewer wants to see you can separate traffic distribution from application logic and explain the tradeoffs of different algorithms.

Real-World Examples

Netflix — Uses Zuul (L7) and custom Eureka-based load balancing for microservices. Client-side LB with Ribbon.
Google — Maglev: custom L4 LB using consistent hashing, handles 10M+ RPS per machine. Published in NSDI 2016.
GitHub — GLB Director: custom L4 LB using ECMP + consistent hashing, avoiding connection draining issues.
Cloudflare — Unimog: L4 LB using XDP/eBPF for line-rate packet processing across data centers.

Back-of-Envelope Numbers

Metric	Value
Nginx max concurrent connections	~10K–100K (event-driven)
HAProxy throughput	~2M HTTP req/s (modern hardware)
AWS ALB latency overhead	~1–5 ms
AWS NLB latency overhead	~100 μs
Health check interval (typical)	5–30 seconds
Failover detection time	15–90 seconds (3 consecutive failures)
Google Maglev throughput	~10M packets/s per machine

⚖️ Load Balancing