Health Checks & DNS Failover
Amazon Route 53 is AWS’s DNS (Domain Name System — the service that turns names like app.example.com into IP addresses) service. On its own, DNS happily hands out the address of a server even if that server is on fire. Route 53 health checks fix that: they constantly probe your endpoints, and when one goes unhealthy, Route 53 stops returning it in DNS answers and sends traffic to a healthy backup instead. This page shows how health checks work, how to wire them into failover routing, and the one gotcha that surprises everyone — failover is often measured in minutes, not seconds.
What a health check actually is
A health check is a tiny monitor that Route 53 runs from many locations around the world. Each location asks your endpoint “are you alive?” on a schedule. If enough locations get a bad answer, Route 53 marks the check Unhealthy. The check has no effect by itself — it becomes useful only when you attach it to a DNS record so Route 53 can decide whether to return that record.
There are three kinds of health check:
| Type | What it monitors | When to use it |
|---|---|---|
| Endpoint | Pings an IP or domain over HTTP, HTTPS, or TCP | A single web server, load balancer, or API you can reach over the internet |
| Calculated | Combines the status of other health checks with AND/OR/NOT logic | ”Site is healthy only if at least 2 of 3 backends are up” |
| CloudWatch alarm | Watches an Amazon CloudWatch alarm (AWS’s metrics/alerting service) | Internal resources Route 53 can’t reach directly, e.g. an RDS database or a private endpoint behind a NAT |
Use a CloudWatch-alarm health check whenever the thing you care about is private or measured by a metric (CPU, queue depth, error rate) rather than by a public HTTP request. Route 53’s probes only come from the public internet, so they cannot reach a private subnet.
Create an endpoint health check
This check probes https://app.example.com/health and marks the endpoint unhealthy after 3 consecutive failures.
Console steps:
- Open the Route 53 console and choose Health checks in the left menu.
- Click Create health check.
- Name:
app-primary-health. For What to monitor, choose Endpoint. - Specify endpoint by: Domain name. Enter
app.example.com, protocol HTTPS, port 443, path/health. - Under Advanced configuration, set Request interval to Standard (30 seconds) (or Fast (10 seconds) for quicker detection at higher cost), and Failure threshold to 3.
- (Optional) Tick Create alarm to get notified via Amazon SNS when it fails.
- Click Create health check.
CLI equivalent (AWS CLI v2):
aws route53 create-health-check \
--caller-reference app-primary-2026-06-15 \
--health-check-config '{
"Type": "HTTPS",
"FullyQualifiedDomainName": "app.example.com",
"Port": 443,
"ResourcePath": "/health",
"RequestInterval": 30,
"FailureThreshold": 3
}'
Output:
{
"HealthCheck": {
"Id": "abcd1234-5678-90ab-cdef-EXAMPLE1111",
"CallerReference": "app-primary-2026-06-15",
"HealthCheckConfig": {
"Type": "HTTPS",
"FullyQualifiedDomainName": "app.example.com",
"Port": 443,
"ResourcePath": "/health",
"RequestInterval": 30,
"FailureThreshold": 3
}
}
}
Note the Id (abcd1234-...) — you attach that to your DNS records in the next step.
Wire it into failover routing
Health checks pair naturally with the Failover routing policy. You create two records with the same name: a Primary and a Secondary. Route 53 returns the primary while its health check is healthy, and switches to the secondary the moment the primary goes unhealthy.
Suppose app.example.com should point to your main load balancer, and fall back to a static “we’ll be right back” page on another endpoint.
Console steps:
- In Route 53, open your hosted zone and click Create record.
- Record name:
app. Record type: A. Turn on Alias if pointing at an AWS load balancer. - Routing policy: Failover. Set Failover record type to Primary.
- Attach the health check
app-primary-health. Give it a Record ID likeprimary. Save. - Create a second record, same name
app, type A. Routing policy Failover, type Secondary, value = your backup endpoint, Record IDsecondary. Save.
CLI equivalent — apply both records with a change batch:
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890ABCDEFGHIJ \
--change-batch '{
"Changes": [
{"Action": "UPSERT", "ResourceRecordSet": {
"Name": "app.example.com", "Type": "A", "TTL": 60,
"SetIdentifier": "primary",
"Failover": "PRIMARY",
"HealthCheckId": "abcd1234-5678-90ab-cdef-EXAMPLE1111",
"ResourceRecords": [{"Value": "203.0.113.10"}]
}},
{"Action": "UPSERT", "ResourceRecordSet": {
"Name": "app.example.com", "Type": "A", "TTL": 60,
"SetIdentifier": "secondary",
"Failover": "SECONDARY",
"ResourceRecords": [{"Value": "203.0.113.99"}]
}}
]
}'
Output:
{
"ChangeInfo": {
"Id": "/change/C2EXAMPLE9ABCD",
"Status": "PENDING",
"SubmittedAt": "2026-06-15T10:42:00.000Z"
}
}
The gotcha: failover is not instant
This is the part most tutorials skip. The time before clients actually move to the secondary endpoint is roughly:
failure detection time + record TTL
- Detection: With a 30-second interval and a failure threshold of 3, Route 53 needs about 90 seconds just to decide an endpoint is unhealthy. A 10-second interval cuts this to ~30 seconds.
- TTL: The record’s TTL (Time To Live — how long resolvers and clients are allowed to cache the answer) means even after Route 53 stops returning the bad IP, cached answers keep sending users to the dead endpoint until they expire. A 300-second TTL adds up to 5 minutes.
So a record with TTL 300 and a standard check can take 5-6 minutes to fail over end to end. Lowering the TTL (e.g. to 60) and using fast checks helps, but you can never fully trust caches — some resolvers ignore low TTLs, and browsers pin DNS for the life of a connection.
If you need failover in seconds rather than minutes — say for a regional outage of a latency-sensitive API — DNS failover is the wrong tool. Use AWS Global Accelerator, which routes over a fixed pair of anycast IPs and shifts traffic away from an unhealthy region at the network layer, with no DNS cache to wait on.
Cost note: Basic endpoint health checks against AWS endpoints are about $0.50/month each; non-AWS endpoints are about $0.75/month. Optional features (string matching, HTTPS, fast 10-second interval) add roughly $1-2/month per check. Calculated and CloudWatch-alarm checks are about $0.50-$1/month. This is trivial next to the cost of an outage, so monitor generously.
Best practices
- Probe a dedicated
/healthendpoint that checks real dependencies (database, cache), not just that the web server returns 200. - Keep failover-record TTLs low (60 seconds or less) so caches clear quickly — but don’t expect them to honor it perfectly.
- Use string matching so the check fails on a “degraded” body even when HTTP status is 200.
- Use calculated checks to avoid flapping when only one of several healthy backends blips.
- Add a CloudWatch alarm + Amazon SNS notification to every critical check so a human hears about failover.
- For true sub-second regional failover of latency-sensitive traffic, reach for Global Accelerator instead of DNS.
- Test failover deliberately (block the health-check path) before you rely on it in a real incident.