Load Balancer Health Checks
A load balancer (a service that spreads incoming traffic across several servers) is only useful if it sends traffic to servers that can actually handle it. A health check is the test the load balancer runs against each target on a schedule to decide if that target is “healthy” (ready for traffic) or “unhealthy” (broken, so stop sending it requests). When a server crashes, freezes, or restarts during a deploy, health checks are what quietly route around it so your users never notice. Getting these settings right is the difference between a self-healing system and one that flaps in and out at the worst possible moment.
How health checks work
Every Elastic Load Balancer (ELB — AWS’s managed load balancer) attaches to a target group (a list of backends like EC2 instances or containers). The target group owns the health-check configuration. On a fixed interval, the load balancer opens a connection to each registered target and runs the check.
- If a target passes the check enough times in a row, it is marked healthy and starts receiving traffic.
- If it fails enough times in a row, it is marked unhealthy and the load balancer stops routing new requests to it. Existing connections are drained (allowed to finish) according to your deregistration delay.
This happens continuously and automatically. You do not restart anything. A target that recovers will pass its checks again and rejoin the rotation on its own.
The load balancer never deletes an unhealthy target. It only stops sending it traffic. Recovery (replacing the instance) is the job of an Auto Scaling group health check, which is a separate setting — see the link at the bottom.
The settings you configure
| Setting | What it means | Typical value |
|---|---|---|
| Protocol | How to reach the target: HTTP, HTTPS, or TCP. | HTTP |
| Path | The URL the check requests, e.g. /healthz. | /healthz |
| Port | Which port to probe. traffic-port reuses the app port. | traffic-port |
| Healthy threshold | Consecutive passes before a target is “healthy”. | 3 |
| Unhealthy threshold | Consecutive fails before a target is “unhealthy”. | 3 |
| Interval | Seconds between checks. | 30 |
| Timeout | Seconds to wait for a response before counting a fail. | 5 |
| Success codes | HTTP status codes that count as a pass, e.g. 200 or 200-299. | 200 |
The math matters. With an interval of 30 and an unhealthy threshold of 3, a dead target takes up to 90 seconds to be pulled out. Lower the interval and threshold to react faster — but read the gotcha below first.
The two biggest gotchas
Aggressive thresholds cause flapping
It is tempting to set a 5-second interval and an unhealthy threshold of 2 so failures are caught fast. Under heavy load, a healthy server might occasionally answer one check slowly (a brief garbage-collection pause, a CPU spike). With tight settings, that single slow response marks the target unhealthy, removes it, which dumps its share of load onto the remaining servers, which then slow down too. The target then passes again, rejoins, and the cycle repeats. This is called flapping, and it can take down a healthy fleet.
Keep the timeout comfortably larger than your normal response time, and require at least 3 consecutive failures before pulling a target.
A health check that touches the database can cascade
If your /healthz endpoint runs a database query, then a brief database hiccup makes the health check fail on every target at once. The load balancer marks them all unhealthy and stops routing traffic entirely — a small DB blip becomes a full outage, even though your app servers were fine.
Use a lightweight, dependency-free health endpoint. A
/healthzroute that simply returns200 OKwithout calling the database, cache, or any downstream service. It answers one question only: “is this process alive and serving?” Use separate, deeper “readiness” probes for dependency monitoring and alerting — not for load-balancer routing.
Configure it in the console
- Open the EC2 console and choose Target Groups in the left menu.
- Select your target group, e.g.
tg-web-app. - Open the Health checks tab and click Edit.
- Set Protocol to
HTTPand Path to/healthz. - Expand Advanced health check settings.
- Set Port to
traffic-port, Healthy threshold3, Unhealthy threshold3, Timeout5, Interval30, and Success codes200. - Click Save changes. New settings apply within a few seconds.
Configure it with the AWS CLI
This updates an existing target group’s health check (AWS CLI v2).
aws elbv2 modify-target-group \
--target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/tg-web-app/0a1b2c3d4e5f6a7b \
--health-check-protocol HTTP \
--health-check-path /healthz \
--health-check-port traffic-port \
--healthy-threshold-count 3 \
--unhealthy-threshold-count 3 \
--health-check-timeout-seconds 5 \
--health-check-interval-seconds 30 \
--matcher HttpCode=200
Output:
{
"TargetGroups": [
{
"TargetGroupArn": "arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/tg-web-app/0a1b2c3d4e5f6a7b",
"TargetGroupName": "tg-web-app",
"Protocol": "HTTP",
"Port": 8080,
"VpcId": "vpc-0a1b2c3d",
"HealthCheckProtocol": "HTTP",
"HealthCheckPath": "/healthz",
"HealthCheckIntervalSeconds": 30,
"HealthCheckTimeoutSeconds": 5,
"HealthyThresholdCount": 3,
"UnhealthyThresholdCount": 3,
"Matcher": { "HttpCode": "200" }
}
]
}
To see the current health of every target, ask the target group directly:
aws elbv2 describe-target-health \
--target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/tg-web-app/0a1b2c3d4e5f6a7b
Output:
{
"TargetHealthDescriptions": [
{
"Target": { "Id": "i-0a1b2c3d4e5f", "Port": 8080 },
"TargetHealth": { "State": "healthy" }
},
{
"Target": { "Id": "i-0f9e8d7c6b5a", "Port": 8080 },
"TargetHealth": {
"State": "unhealthy",
"Reason": "Target.ResponseCodeMismatch",
"Description": "Health checks failed with these codes: [500]"
}
}
]
}
Define it as infrastructure as code
Putting the health check in CloudFormation means every environment gets identical, reviewed settings.
WebTargetGroup:
Type: AWS::ElasticLoadBalancingV2::TargetGroup
Properties:
Name: tg-web-app
VpcId: vpc-0a1b2c3d
Protocol: HTTP
Port: 8080
TargetType: instance
HealthCheckProtocol: HTTP
HealthCheckPath: /healthz
HealthCheckPort: traffic-port
HealthCheckIntervalSeconds: 30
HealthCheckTimeoutSeconds: 5
HealthyThresholdCount: 3
UnhealthyThresholdCount: 3
Matcher:
HttpCode: '200'
When to use which protocol
- HTTP / HTTPS — use for Application Load Balancers and any web app. You get to check a real route and status code, so the check confirms the app is serving, not just that the port is open. Use
HTTPSonly when you need to verify TLS termination on the target; otherwiseHTTPis simpler and cheaper in CPU. - TCP — use for Network Load Balancers fronting non-HTTP services (a database proxy, a game server). It only confirms the port accepts a connection, which is shallower but the only option when there is no HTTP layer.
There is no extra charge for health checks themselves — they are part of the load balancer you already pay for. The only indirect cost is data: a check every 30 seconds across many targets is negligible, but a 1-second interval across hundreds of targets adds measurable request volume to your app.
Best Practices
- Point checks at a dedicated
/healthzroute that returns200and touches no database, cache, or external service. - Keep the timeout well above your typical response time, and require at least
3consecutive failures before removing a target to avoid flapping. - Use
HTTPchecks with a specific success code (200) rather thanTCPwhenever an HTTP layer exists — it catches a far wider range of failures. - Match the health-check port to
traffic-portso the check exercises the same path real traffic takes. - Keep a separate, deeper “readiness” probe for alerting on dependencies — do not wire dependency failures into the load-balancer check.
- Watch the
UnHealthyHostCountmetric in CloudWatch and alarm when it rises, so you learn about removals before users do.