Load Balancing Concepts

A load balancer is a server that sits in front of several application servers and spreads incoming requests across them. Instead of one machine handling every request (and falling over when traffic spikes), the load balancer hands each request to whichever backend server is healthy and free. This is one of the most important tools in Site Reliability Engineering (SRE) — the practice of keeping systems fast, available, and able to grow. Done well, load balancing gives you both scale (handle more users) and resilience (survive a dead server without users noticing).

Why you need a load balancer

When you start out, your app runs on a single server. That works until two things happen:

Traffic outgrows one machine. One server can only handle so many requests per second before it slows down or crashes.
That one machine becomes a single point of failure. A single point of failure is any one component that, if it dies, takes the whole system down with it. If your only server reboots, your site is offline.

A load balancer fixes both. You run two or more identical copies of your app (called backends or upstream servers), and the load balancer distributes traffic between them. If one dies, traffic flows to the survivors.

When to use this: Add a load balancer the moment you need more than one app server, or the moment downtime from a single server failure becomes unacceptable. For a hobby project on one box, it’s overkill — keep it simple.

Where a load balancer sits

The load balancer sits between your users (the public internet) and your private application servers. A typical flow looks like this:

Users → DNS → Load Balancer → [ app-server-1 ]
                            → [ app-server-2 ]
                            → [ app-server-3 ]

Only the load balancer is exposed to the internet. The app servers live on a private network and never receive direct traffic. This is good for security: attackers can only reach the load balancer, and you can lock the backends down with ufw (Ubuntu’s firewall) so they only accept connections from the load balancer’s IP.

Layer 4 vs Layer 7

Load balancers operate at one of two levels of the network stack, named after the OSI model layers:

Layer 4 (L4) works at the transport layer — it deals with raw TCP/UDP connections. It sees IP addresses and ports, but not the actual content of the request. It cannot read a URL or an HTTP header. It’s extremely fast because it just forwards packets.
Layer 7 (L7) works at the application layer — it understands HTTP. It can read the URL path, headers, and cookies, and make smart routing decisions (send /api to one pool, /images to another). It does more work, so it’s slightly slower, but far more flexible.

Feature	Layer 4 (L4)	Layer 7 (L7)
Sees	IP, port, TCP/UDP	Full HTTP request, headers, cookies
Speed	Very fast	Fast, slightly more overhead
Route by URL/path	No	Yes
TLS/SSL termination	No (passes through)	Yes (decrypts here)
Best for	Databases, raw TCP, max throughput	Web apps, APIs, microservices
Examples	AWS NLB, HAProxy (TCP mode)	Nginx, AWS ALB, HAProxy (HTTP mode)

When to use which: Use L7 for almost all web traffic — it lets you route by URL, terminate TLS (decrypt HTTPS at the load balancer so backends speak plain HTTP), and inspect requests. Use L4 when you need raw speed, are balancing non-HTTP protocols (like a database or game server), or want the backend to handle its own encryption.

Balancing algorithms

Once a request arrives, the load balancer must pick which backend gets it. The rule it uses is the algorithm.

Algorithm	How it works	When to use
Round-robin	Sends requests to each backend in turn, evenly	Default; backends are equal and requests are similar in cost
Weighted round-robin	Like round-robin but bigger servers get a larger share	Backends have different sizes/CPU
Least-connections	Sends to the backend with the fewest active connections	Requests vary a lot in duration (some slow, some fast)
IP hash	Always sends the same client IP to the same backend	You need a client to stick to one server without cookies

Round-robin is the simplest and a fine default. Least-connections is smarter when some requests take much longer than others, because it avoids piling new work onto a server that’s still busy with a slow request.

Here is a real Nginx L7 config showing both an upstream pool and the algorithm:

upstream app_backend {
    least_conn;                 # use least-connections instead of round-robin
    server 10.0.0.11:8080;
    server 10.0.0.12:8080;
    server 10.0.0.13:8080 weight=2;   # this box gets double the traffic
}

server {
    listen 80;
    server_name example.com;

    location / {
        proxy_pass http://app_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

Health checks

A health check is a small request the load balancer sends to each backend on a schedule to ask “are you alive and ready?”. If a backend stops responding (or returns an error), the load balancer marks it unhealthy and stops sending it traffic. When it recovers, traffic resumes automatically. This is what makes a load balancer resilient — a dead backend is quietly removed instead of serving errors to users.

Most apps expose a tiny endpoint like /health that returns HTTP 200 when the app is fine. You can test it by hand:

curl -i http://10.0.0.11:8080/health

Output:

HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 15

{"status":"ok"}

In Nginx open source, passive health checks are built in (it stops using a backend after failures). On Ubuntu you can tune them:

upstream app_backend {
    server 10.0.0.11:8080 max_fails=3 fail_timeout=30s;
    server 10.0.0.12:8080 max_fails=3 fail_timeout=30s;
}

This says: after 3 failed requests within 30 seconds, take the server out of rotation for 30 seconds, then try it again.

Gotcha: A health check should test that the app can actually serve traffic, not just that the process is running. A good /health endpoint checks the database connection too — otherwise the load balancer keeps sending traffic to a server whose database is down.

Sticky sessions

Normally any backend can serve any request. But some apps store user session data in the memory of a single server (for example, a login session). If the next request goes to a different server, the user appears logged out. Sticky sessions (also called session affinity) tie a given user to one backend so their data is always found.

upstream app_backend {
    ip_hash;                 # same client IP always goes to the same backend
    server 10.0.0.11:8080;
    server 10.0.0.12:8080;
}

When to use this: Only when your app stores session state in server memory. The better long-term fix is to make your app stateless — store sessions in a shared store like Redis so any backend can handle any request. Sticky sessions hurt load balancing (one server can get overloaded) and break when a backend dies.

Cloud load balancers

You don’t have to run the load balancer yourself. Cloud providers offer managed ones that scale automatically and need no patching:

AWS Application Load Balancer (ALB) — L7, routes by path/host, terminates TLS.
AWS Network Load Balancer (NLB) — L4, ultra-high throughput for TCP/UDP.
Google Cloud Load Balancing and Azure Load Balancer — equivalents on other clouds.

A common pattern is a cloud L7 load balancer (ALB) in front, with Nginx on each server doing local L7 routing — combining managed scale with fine-grained control.

Best Practices

Run at least two backends behind every load balancer, in different availability zones where possible, so no single failure causes downtime.
Always configure health checks that test real readiness (including the database), not just “the port is open”.
Default to L7 for web traffic; reach for L4 only when you need raw speed or non-HTTP protocols.
Make your app stateless and avoid sticky sessions — store sessions in Redis so any backend can serve any user.
Terminate TLS at the load balancer so you manage certificates in one place, and lock backends down with ufw to accept traffic only from the load balancer.
Monitor per-backend metrics (latency, error rate, active connections) so you can spot one slow server before it drags down the pool.