Designing for High Availability

High availability (HA) means your application keeps working even when individual pieces of it break. The core idea is simple: find every place where one failure can take you down (a “single point of failure”), and add a redundant copy so the loss of any one component is survived automatically, with no human waking up at 3 a.m. to fix it. This page shows how to design HA on AWS by spreading your workload across Availability Zones, putting load balancers in front of your servers, and using managed databases that fail over on their own.

What “highly available” actually means

A system is highly available when it can lose a component and keep serving traffic without manual intervention. That last part is the part most tutorials skip. A single huge server is not HA, no matter how powerful it is, because if it dies, your site is down until someone rebuilds it. Availability is usually measured as a percentage of uptime over a year.

Availability	Downtime per year	Common name
99%	~3.65 days	”two nines”
99.9%	~8.76 hours	”three nines”
99.99%	~52.6 minutes	”four nines”
99.999%	~5.26 minutes	”five nines”

To reach three or four nines, redundancy must be automatic. If recovery needs a person, you are not counting on availability, you are counting on how fast your on-call engineer can react.

Gotcha: HA (handling component failures), scalability (handling more load), and disaster recovery (recovering from a region-wide disaster) are three different goals. Do not conflate them. A bigger instance gives you more capacity (scalability) but zero extra availability, it is still one box.

The Availability Zone model

AWS organizes infrastructure into Regions (a geographic area like us-east-1) and, inside each Region, multiple Availability Zones (AZs). An AZ is one or more physically separate data centers with independent power, cooling, and networking. AZs in a Region are close enough for fast, low-latency communication but far enough apart that a fire, flood, or power loss in one will not hit another.

Most Regions have three or more AZs (for example us-east-1a, us-east-1b, us-east-1c). The HA building block on AWS is: deploy at least two AZs, ideally three. If you run in one AZ and that AZ has an outage, you are down. If you run in two and one fails, you survive on the other.

When to use this: for any production workload that users depend on. When NOT to: a throwaway dev sandbox or a batch job you can simply re-run does not need multi-AZ, and the duplicated resources cost money.

Removing single points of failure, tier by tier

A typical web app has three tiers: the load balancer, the application servers, and the database. Each tier needs its own redundancy.

Load balancer tier

An Application Load Balancer (ALB) is a managed AWS service that accepts incoming traffic and spreads it across your servers. It is multi-AZ by design: you attach it to subnets in several AZs and AWS keeps it running even if an AZ fails. You never run “one load balancer instance,” AWS handles that for you.

Console steps to create an ALB:

Open the EC2 console, choose Load Balancers in the left menu, click Create load balancer.
Pick Application Load Balancer, give it a name, and choose Internet-facing.
Under Network mapping, select your VPC and tick at least two AZs, choosing a subnet in each.
Create or pick a security group that allows inbound port 443/80.
Create a target group pointing at your instances, set a health check path (e.g. /health), then create.

CLI equivalent:

aws elbv2 create-load-balancer \
  --name web-alb \
  --type application \
  --scheme internet-facing \
  --subnets subnet-0a1b2c3d subnet-0e4f5g6h \
  --security-groups sg-0a1b2c3d

Output:

{
    "LoadBalancers": [
        {
            "LoadBalancerArn": "arn:aws:elasticloadbalancing:us-east-1:111122223333:loadbalancer/app/web-alb/50dc6c495c0c9188",
            "DNSName": "web-alb-1234567890.us-east-1.elb.amazonaws.com",
            "AvailabilityZones": [
                {"ZoneName": "us-east-1a", "SubnetId": "subnet-0a1b2c3d"},
                {"ZoneName": "us-east-1b", "SubnetId": "subnet-0e4f5g6h"}
            ],
            "State": {"Code": "provisioning"}
        }
    ]
}

Application tier

Run your application servers in an Auto Scaling Group (ASG), a service that keeps a desired number of EC2 instances running and replaces any that fail or fail their health check. Spread the ASG across the same AZs as the ALB. If one instance dies, the ASG launches a fresh one automatically. If a whole AZ dies, the ASG rebalances into the surviving AZs.

The key design rule here is to keep the app tier stateless: store no session data, no uploaded files, no cache that only lives on that one server. Put session state in a shared store like Amazon ElastiCache or DynamoDB, and files in Amazon S3. Then any request can hit any instance, and losing an instance loses nothing.

CLI to create an ASG that spans AZs and registers with the target group:

aws autoscaling create-auto-scaling-group \
  --auto-scaling-group-name web-asg \
  --launch-template "LaunchTemplateId=lt-0a1b2c3d,Version=1" \
  --min-size 2 --max-size 6 --desired-capacity 2 \
  --vpc-zone-identifier "subnet-0a1b2c3d,subnet-0e4f5g6h" \
  --target-group-arns arn:aws:elasticloadbalancing:us-east-1:111122223333:targetgroup/web-tg/abcd1234 \
  --health-check-type ELB --health-check-grace-period 120

A --min-size of 2 across two AZs is the smallest sensible HA app tier: lose one instance and you are still serving.

Tip: A single NAT Gateway is a hidden single point of failure. If your private instances reach the internet through one NAT Gateway in one AZ and that AZ fails, every AZ loses outbound access. Deploy one NAT Gateway per AZ and route each subnet to the NAT in its own zone. Each NAT Gateway costs roughly $0.045/hour (~$32/month) plus data processing, so two or three add up, but it is the price of real HA.

Database tier

A single database instance is the most common HA mistake. Amazon RDS solves it with Multi-AZ deployments: RDS keeps a synchronous standby copy in a second AZ. If the primary fails, RDS promotes the standby and repoints the database endpoint automatically, usually within 60 to 120 seconds, with no DNS change needed on your side.

Enable Multi-AZ when creating an RDS instance:

aws rds create-db-instance \
  --db-instance-identifier prod-db \
  --db-instance-class db.r6g.large \
  --engine postgres \
  --multi-az \
  --allocated-storage 100 \
  --master-username admin \
  --manage-master-user-password

Output:

{
    "DBInstance": {
        "DBInstanceIdentifier": "prod-db",
        "Engine": "postgres",
        "MultiAZ": true,
        "DBInstanceStatus": "creating"
    }
}

Multi-AZ roughly doubles the database cost because you pay for the standby, but the standby cannot be used for read traffic, it exists purely for failover. (If you also need to scale reads, add separate read replicas, that is scalability, not HA.)

HA vs scalability vs disaster recovery

These three get mixed up constantly, so keep them separate:

Goal	Question it answers	Typical AWS tools
High availability	”Will I survive a component or AZ failure automatically?”	Multi-AZ, ALB, ASG, RDS Multi-AZ
Scalability	”Can I handle more load?”	Auto Scaling policies, read replicas, larger instances
Disaster recovery	”Can I recover if a whole Region is lost?”	Backups, cross-Region replication, multi-Region failover

A Multi-AZ database is HA but not DR: if the entire Region goes down, both your primary and standby go with it. DR requires copies in another Region.

Best Practices

Deploy across at least two AZs, prefer three, for every production tier.
Make the application tier stateless so any instance can serve any request and instances are disposable.
Put workloads behind a load balancer with health checks so unhealthy instances are removed automatically.
Use RDS Multi-AZ (or Aurora, which is multi-AZ by default) instead of a single database instance.
Eliminate hidden single points of failure: one NAT Gateway per AZ, no single shared instance everything depends on.
Set ASG minimum size of 2 or more so losing one instance never drops you to zero.
Test failover regularly, kill an instance or trigger an AZ failover, and confirm recovery happens with no human action.