Navigation

AWS aws architecture 5 min read

Disaster Recovery Strategies

Disaster recovery (DR) is your plan for getting an application back online after something big breaks: a Region goes down, a database gets corrupted, or someone deletes the wrong thing. AWS gives you four standard DR strategies, and they trade money against speed. The trick is to pick the cheapest one that still meets your recovery goals, then actually test it. A DR plan you have never tested is fiction.

Define RTO and RPO first

Before you choose a strategy, you must agree on two numbers with the business. Everything else depends on them.

RTO (Recovery Time Objective) — how fast you must be back online. “We can be down for at most 1 hour” means RTO = 1 hour. It measures downtime you can tolerate.
RPO (Recovery Point Objective) — how much data you can afford to lose, measured in time. “We can lose at most 5 minutes of data” means RPO = 5 minutes. It measures the gap between your last good copy and the moment of failure.

A small RTO means you need infrastructure ready to take over fast. A small RPO means you need to copy data often (or continuously). Both push cost up.

Pick the strategy from your RTO/RPO numbers, not the other way around. Building active/active multi-site when nightly backups would satisfy your RPO burns money for nothing. But under-investing and discovering your “DR plan” cannot meet the RTO during a real outage is far worse.

The four strategies

The strategies form a spectrum. As you move down the table, RTO and RPO shrink (better recovery) but cost climbs (you keep more running in a second Region).

Strategy	Typical RTO	Typical RPO	Standby running?	Relative cost	Use when
Backup & restore	Hours	Hours	Nothing	$	Non-critical apps; generous recovery windows
Pilot light	10s of minutes	Minutes	Core data only	$$	Important apps that can tolerate a short rebuild
Warm standby	Minutes	Seconds–minutes	Scaled-down full stack	$$$	Business-critical apps needing quick failover
Active/active (multi-site)	Near zero	Near zero	Full stack, both Regions live	$$$$	Mission-critical apps where downtime = lost revenue

Backup & restore

You take regular backups and store them safely. After a disaster, you provision new infrastructure and restore from backup. Cheapest option, slowest recovery.

AWS services: AWS Backup (central backup policies), Amazon S3 for files, EBS snapshots for volumes, RDS automated snapshots, and Amazon S3 Cross-Region Replication (copying objects to a second Region automatically).

When to use it: internal tools, reporting systems, anything where being down for a few hours is acceptable. When NOT to: customer-facing systems with tight RTO.

Console steps to set up a centralized backup plan:

Open the AWS Backup console.
Choose Backup plans then Create backup plan then Build a new plan.
Set a rule: backup frequency (e.g. daily), a backup window, and retention (e.g. 35 days).
Under Copy to destination, add your DR Region to replicate backups cross-Region.
Choose Create plan, then Assign resources using tags or resource IDs.

CLI equivalent — start an on-demand backup of an EBS volume’s resource:

aws backup start-backup-job \
  --backup-vault-name Default \
  --resource-arn arn:aws:ec2:us-east-1:111122223333:volume/vol-0a1b2c3d4e5f \
  --iam-role-arn arn:aws:iam::111122223333:role/service-role/AWSBackupDefaultServiceRole

Output:

{
    "BackupJobId": "A1B2C3D4-5678-90AB-CDEF-EXAMPLE11111",
    "CreationDate": "2026-06-15T10:42:00.123000+00:00"
}

Pilot light

You keep the bare minimum running in the DR Region: usually the database, replicated continuously, plus pre-built machine images. The application servers are switched off (or not running). When disaster strikes, you “turn up the gas” — launch servers from the images and point traffic at them.

AWS services: Amazon RDS cross-Region read replicas or Aurora global database for the data layer, pre-baked AMIs (Amazon Machine Images, the templates EC2 launches from, e.g. ami-0abcdef1234567890), and infrastructure as code to launch the rest fast.

When to use it: important apps that can survive a 10–30 minute rebuild but cannot lose much data. When NOT to: apps needing sub-minute failover.

Create a cross-Region RDS read replica that becomes your DR database:

aws rds create-db-instance-read-replica \
  --db-instance-identifier app-db-dr \
  --source-db-instance-identifier arn:aws:rds:us-east-1:111122223333:db:app-db-prod \
  --region us-west-2

Output:

{
    "DBInstance": {
        "DBInstanceIdentifier": "app-db-dr",
        "DBInstanceStatus": "creating",
        "DBInstanceClass": "db.r6g.large"
    }
}

Warm standby

A scaled-down but fully working copy of your stack runs in the DR Region all the time. It handles no live traffic (or a tiny amount), but everything is on. During failover you scale it up and shift traffic. Recovery is minutes, not tens of minutes.

AWS services: Auto Scaling groups (kept at minimum capacity), Aurora global database, Elastic Load Balancing in both Regions, and Amazon Route 53 (AWS’s DNS service) health checks with failover routing.

When to use it: business-critical apps where minutes of downtime cost real money. When NOT to: low-value apps — you pay for idle capacity 24/7.

Route 53 failover routing sends users to the standby automatically when the primary’s health check fails:

aws route53 change-resource-record-sets \
  --hosted-zone-id Z0A1B2C3D4E5F6 \
  --change-batch file://failover-record.json

A failover record set marks one endpoint as PRIMARY and another as SECONDARY, tied to a health check ID — Route 53 only serves the secondary when the primary is unhealthy.

Active/active (multi-site)

Both Regions run full production and serve live traffic at the same time. If one Region fails, the other absorbs everything with little or no interruption. RTO and RPO approach zero. It is also the most complex and most expensive.

AWS services: Aurora global database or Amazon DynamoDB global tables (multi-Region read/write), Route 53 latency or weighted routing, and CloudFront in front of it all.

When to use it: payments, trading, anything where downtime directly loses revenue or breaks trust. When NOT to: almost everything else — the cost and complexity rarely pay off.

Active/active is roughly 2x your steady-state bill because you run full capacity in two Regions. If your RPO is “a few hours,” nightly backups cost a tiny fraction of that. Match the spend to the requirement.

Test your plan

A DR strategy only counts if you have proven it works. Schedule game days where you fail over to the DR Region, time the actual RTO, and confirm data loss is within your RPO. AWS Resilience Hub can assess your app against your RTO/RPO targets and flag gaps. Automate the failover with Route 53 health checks and runbooks so a tired human at 3 a.m. is not improvising.

Best Practices

Agree on RTO and RPO with the business in writing before designing anything.
Choose the cheapest strategy that meets both numbers; do not over-engineer.
Replicate backups and data to a second Region — a single-Region backup is not DR.
Use infrastructure as code so the DR environment is reproducible and drift-free.
Automate failover with Route 53 health checks and tested runbooks.
Run regular game days and measure real RTO/RPO, not the numbers you hoped for.
Encrypt backups and replicas, and verify restores actually open and are usable.