High Availability & Redundancy
High availability (HA) means designing a system so it keeps serving users even when individual parts break. Hardware fails, disks die, networks drop, and whole data centres go offline — that is normal, not exceptional. The job of an SRE (Site Reliability Engineer — someone who keeps production systems running smoothly) is to make sure one failure never takes the whole service down. This page explains how to do that with redundancy, failover, and multi-zone design, and how to weigh the very real cost that HA adds.
What “available” actually means
Availability is the percentage of time your service is up and answering correctly. People talk about it in “nines”. More nines means less downtime, but each extra nine is dramatically harder and more expensive to reach.
| Availability | Common name | Downtime per year | Downtime per month |
|---|---|---|---|
| 99% | “two nines” | ~3.65 days | ~7.2 hours |
| 99.9% | “three nines” | ~8.76 hours | ~43 minutes |
| 99.95% | ~4.38 hours | ~21 minutes | |
| 99.99% | “four nines” | ~52.6 minutes | ~4.3 minutes |
| 99.999% | “five nines” | ~5.26 minutes | ~26 seconds |
Tip: Pick a target that matches the business, not your ego. Five nines for an internal dashboard is a waste of money. Decide the number first (it becomes your SLO — Service Level Objective), then build only the redundancy needed to hit it.
Single points of failure
A single point of failure (SPOF) is any one component that, if it dies, takes the whole service down. HA is mostly the discipline of finding SPOFs and removing them. Typical SPOFs are: one application server, one database, one load balancer, one network link, or even one availability zone.
The cure for a SPOF is redundancy — running two or more of the thing so the survivors keep working when one fails. To find SPOFs on a single Ubuntu box, walk through every layer and ask “what if this dies?”
# See what is actually running and could be a SPOF
systemctl list-units --type=service --state=running
ss -tlnp # which ports/processes accept connections
df -h # is everything on one disk?
Output:
UNIT LOAD ACTIVE SUB DESCRIPTION
nginx.service loaded active running A high performance web server
[email protected] loaded active running PostgreSQL Cluster
myapp.service loaded active running My Application
State Recv-Q Send-Q Local Address:Port Process
LISTEN 0 511 0.0.0.0:80 users:(("nginx",pid=812,fd=6))
LISTEN 0 244 127.0.0.1:5432 users:(("postgres",pid=901,fd=5))
Here nginx, the app, and PostgreSQL all live on one server and one disk. That single server is a SPOF — the first thing to fix.
Redundancy patterns
There are two basic ways to run redundant copies.
| Pattern | How it works | When to use |
|---|---|---|
| Active-active | All copies serve traffic at once, behind a load balancer | Stateless web/app tiers; you also gain capacity |
| Active-passive | One copy serves; a standby waits and takes over on failure | Databases and other stateful services where two writers would conflict |
For the app tier, run two or more identical servers and put a load balancer (a server that spreads incoming requests across several backends) in front. If one app server dies, the load balancer simply stops sending it traffic.
# /etc/nginx/sites-available/myapp — active-active app servers
upstream app_backend {
server 10.0.1.11:3000 max_fails=3 fail_timeout=10s;
server 10.0.1.12:3000 max_fails=3 fail_timeout=10s;
}
server {
listen 80;
server_name app.example.com;
location / {
proxy_pass http://app_backend;
proxy_next_upstream error timeout http_502 http_503;
}
}
max_fails and fail_timeout make Nginx mark a backend as down after 3 failures and stop using it for 10 seconds — that is automatic failover for the app tier.
Failover
Failover is the act of switching from a failed component to a healthy one. For it to count as HA, failover must be automatic and fast. The chain is always: detect the failure (health check), then redirect traffic (load balancer, virtual IP, or DNS update).
For a stateful service like PostgreSQL, use active-passive with a tool that promotes the standby automatically. On Ubuntu, the common stack is Patroni (a manager that handles leader election and promotion) plus a virtual IP. Install Patroni like this:
sudo apt update
sudo apt install -y python3-pip python3-psycopg2 etcd-server
sudo pip3 install "patroni[etcd]"
patroni --version
Output:
patroni 4.0.4
Patroni keeps one primary and one or more replicas in sync. If the primary’s host goes down, Patroni promotes a replica to primary within seconds, and your app reconnects through the same virtual IP. Avoid manual promotion in production — a human paging in at 3am is not “highly available”.
Gotcha: Two database primaries writing at the same time (“split-brain”) corrupts data. Always use a quorum store (etcd/Consul) so the cluster agrees on exactly one leader before promoting.
Spreading risk: zones and regions
Removing SPOFs on one machine is not enough if the whole building loses power. Cloud providers divide the world into regions (geographic areas like us-east-1) and, inside each region, availability zones (AZs — separate data centres with independent power and network). Placing copies in different places limits how much one failure can hurt you.
| Strategy | Protects against | Cost & complexity | When to use |
|---|---|---|---|
| Single AZ | Single server failure | Lowest | Dev, internal tools, low SLO |
| Multi-AZ (one region) | A whole data centre failing | Moderate (cross-AZ data transfer, 2x infra) | Most production services |
| Multi-region | An entire region outage | High (data replication lag, global routing) | Critical services, disaster recovery |
Multi-AZ is the sweet spot for most teams: it survives a data-centre outage at a reasonable cost. Multi-region is powerful but adds hard problems — replicating data across long distances introduces lag, and keeping two regions consistent is genuinely difficult. Reach for it only when an SLO or compliance rule demands it.
The cost and complexity trade-off
Every nine you add roughly doubles effort and spend. Redundant servers cost money even while idle, cross-AZ traffic is billed, and more moving parts mean more things that can break. Sometimes added complexity lowers real availability because the failover machinery itself fails.
A practical rule: add redundancy only where the math justifies it. Use your error budget (the small amount of downtime your SLO allows) to decide. If you are comfortably inside budget, do not spend on more nines — spend on features instead.
Best practices
- Find and document every SPOF before adding redundancy; you cannot remove what you have not named.
- Make the stateless app tier active-active behind a load balancer with health checks.
- Use active-passive with automatic, quorum-based failover (e.g. Patroni) for databases.
- Target multi-AZ for production; only go multi-region when an SLO or compliance need forces it.
- Test failover regularly — kill a node on purpose in a controlled window and watch it recover.
- Set the availability target from business needs, not bragging rights, and let the error budget guide spend.
- Watch out for split-brain: never allow two writers without a quorum deciding the leader.