High Availability & Redundancy

High availability (HA) means designing a system so it keeps serving users even when individual parts break. Hardware fails, disks die, networks drop, and whole data centres go offline — that is normal, not exceptional. The job of an SRE (Site Reliability Engineer — someone who keeps production systems running smoothly) is to make sure one failure never takes the whole service down. This page explains how to do that with redundancy, failover, and multi-zone design, and how to weigh the very real cost that HA adds.

What “available” actually means

Availability is the percentage of time your service is up and answering correctly. People talk about it in “nines”. More nines means less downtime, but each extra nine is dramatically harder and more expensive to reach.

Availability	Common name	Downtime per year	Downtime per month
99%	“two nines”	~3.65 days	~7.2 hours
99.9%	“three nines”	~8.76 hours	~43 minutes
99.95%		~4.38 hours	~21 minutes
99.99%	“four nines”	~52.6 minutes	~4.3 minutes
99.999%	“five nines”	~5.26 minutes	~26 seconds

Tip: Pick a target that matches the business, not your ego. Five nines for an internal dashboard is a waste of money. Decide the number first (it becomes your SLO — Service Level Objective), then build only the redundancy needed to hit it.

Single points of failure

A single point of failure (SPOF) is any one component that, if it dies, takes the whole service down. HA is mostly the discipline of finding SPOFs and removing them. Typical SPOFs are: one application server, one database, one load balancer, one network link, or even one availability zone.

The cure for a SPOF is redundancy — running two or more of the thing so the survivors keep working when one fails. To find SPOFs on a single Ubuntu box, walk through every layer and ask “what if this dies?”

# See what is actually running and could be a SPOF
systemctl list-units --type=service --state=running
ss -tlnp        # which ports/processes accept connections
df -h           # is everything on one disk?

Output:

UNIT                        LOAD   ACTIVE SUB     DESCRIPTION
nginx.service               loaded active running A high performance web server
[email protected]  loaded active running PostgreSQL Cluster
myapp.service               loaded active running My Application

State  Recv-Q Send-Q Local Address:Port  Process
LISTEN 0      511          0.0.0.0:80     users:(("nginx",pid=812,fd=6))
LISTEN 0      244        127.0.0.1:5432   users:(("postgres",pid=901,fd=5))

Here nginx, the app, and PostgreSQL all live on one server and one disk. That single server is a SPOF — the first thing to fix.

Redundancy patterns

There are two basic ways to run redundant copies.

Pattern	How it works	When to use
Active-active	All copies serve traffic at once, behind a load balancer	Stateless web/app tiers; you also gain capacity
Active-passive	One copy serves; a standby waits and takes over on failure	Databases and other stateful services where two writers would conflict

For the app tier, run two or more identical servers and put a load balancer (a server that spreads incoming requests across several backends) in front. If one app server dies, the load balancer simply stops sending it traffic.

# /etc/nginx/sites-available/myapp — active-active app servers
upstream app_backend {
    server 10.0.1.11:3000 max_fails=3 fail_timeout=10s;
    server 10.0.1.12:3000 max_fails=3 fail_timeout=10s;
}

server {
    listen 80;
    server_name app.example.com;
    location / {
        proxy_pass http://app_backend;
        proxy_next_upstream error timeout http_502 http_503;
    }
}

max_fails and fail_timeout make Nginx mark a backend as down after 3 failures and stop using it for 10 seconds — that is automatic failover for the app tier.

Failover

Failover is the act of switching from a failed component to a healthy one. For it to count as HA, failover must be automatic and fast. The chain is always: detect the failure (health check), then redirect traffic (load balancer, virtual IP, or DNS update).

For a stateful service like PostgreSQL, use active-passive with a tool that promotes the standby automatically. On Ubuntu, the common stack is Patroni (a manager that handles leader election and promotion) plus a virtual IP. Install Patroni like this:

sudo apt update
sudo apt install -y python3-pip python3-psycopg2 etcd-server
sudo pip3 install "patroni[etcd]"
patroni --version

Output:

patroni 4.0.4

Patroni keeps one primary and one or more replicas in sync. If the primary’s host goes down, Patroni promotes a replica to primary within seconds, and your app reconnects through the same virtual IP. Avoid manual promotion in production — a human paging in at 3am is not “highly available”.

Gotcha: Two database primaries writing at the same time (“split-brain”) corrupts data. Always use a quorum store (etcd/Consul) so the cluster agrees on exactly one leader before promoting.

Spreading risk: zones and regions

Removing SPOFs on one machine is not enough if the whole building loses power. Cloud providers divide the world into regions (geographic areas like us-east-1) and, inside each region, availability zones (AZs — separate data centres with independent power and network). Placing copies in different places limits how much one failure can hurt you.

Strategy	Protects against	Cost & complexity	When to use
Single AZ	Single server failure	Lowest	Dev, internal tools, low SLO
Multi-AZ (one region)	A whole data centre failing	Moderate (cross-AZ data transfer, 2x infra)	Most production services
Multi-region	An entire region outage	High (data replication lag, global routing)	Critical services, disaster recovery

Multi-AZ is the sweet spot for most teams: it survives a data-centre outage at a reasonable cost. Multi-region is powerful but adds hard problems — replicating data across long distances introduces lag, and keeping two regions consistent is genuinely difficult. Reach for it only when an SLO or compliance rule demands it.

The cost and complexity trade-off

Every nine you add roughly doubles effort and spend. Redundant servers cost money even while idle, cross-AZ traffic is billed, and more moving parts mean more things that can break. Sometimes added complexity lowers real availability because the failover machinery itself fails.

A practical rule: add redundancy only where the math justifies it. Use your error budget (the small amount of downtime your SLO allows) to decide. If you are comfortably inside budget, do not spend on more nines — spend on features instead.

Best practices

Find and document every SPOF before adding redundancy; you cannot remove what you have not named.
Make the stateless app tier active-active behind a load balancer with health checks.
Use active-passive with automatic, quorum-based failover (e.g. Patroni) for databases.
Target multi-AZ for production; only go multi-region when an SLO or compliance need forces it.
Test failover regularly — kill a node on purpose in a controlled window and watch it recover.
Set the availability target from business needs, not bragging rights, and let the error budget guide spend.
Watch out for split-brain: never allow two writers without a quorum deciding the leader.