Load Balancer Health Checks

A load balancer (a service that spreads incoming traffic across several servers) is only useful if it sends traffic to servers that can actually handle it. A health check is the test the load balancer runs against each target on a schedule to decide if that target is “healthy” (ready for traffic) or “unhealthy” (broken, so stop sending it requests). When a server crashes, freezes, or restarts during a deploy, health checks are what quietly route around it so your users never notice. Getting these settings right is the difference between a self-healing system and one that flaps in and out at the worst possible moment.

How health checks work

Every Elastic Load Balancer (ELB — AWS’s managed load balancer) attaches to a target group (a list of backends like EC2 instances or containers). The target group owns the health-check configuration. On a fixed interval, the load balancer opens a connection to each registered target and runs the check.

If a target passes the check enough times in a row, it is marked healthy and starts receiving traffic.
If it fails enough times in a row, it is marked unhealthy and the load balancer stops routing new requests to it. Existing connections are drained (allowed to finish) according to your deregistration delay.

This happens continuously and automatically. You do not restart anything. A target that recovers will pass its checks again and rejoin the rotation on its own.

The load balancer never deletes an unhealthy target. It only stops sending it traffic. Recovery (replacing the instance) is the job of an Auto Scaling group health check, which is a separate setting — see the link at the bottom.

The settings you configure

Setting	What it means	Typical value
Protocol	How to reach the target: `HTTP`, `HTTPS`, or `TCP`.	`HTTP`
Path	The URL the check requests, e.g. `/healthz`.	`/healthz`
Port	Which port to probe. `traffic-port` reuses the app port.	`traffic-port`
Healthy threshold	Consecutive passes before a target is “healthy”.	`3`
Unhealthy threshold	Consecutive fails before a target is “unhealthy”.	`3`
Interval	Seconds between checks.	`30`
Timeout	Seconds to wait for a response before counting a fail.	`5`
Success codes	HTTP status codes that count as a pass, e.g. `200` or `200-299`.	`200`

The math matters. With an interval of 30 and an unhealthy threshold of 3, a dead target takes up to 90 seconds to be pulled out. Lower the interval and threshold to react faster — but read the gotcha below first.

The two biggest gotchas

Aggressive thresholds cause flapping

It is tempting to set a 5-second interval and an unhealthy threshold of 2 so failures are caught fast. Under heavy load, a healthy server might occasionally answer one check slowly (a brief garbage-collection pause, a CPU spike). With tight settings, that single slow response marks the target unhealthy, removes it, which dumps its share of load onto the remaining servers, which then slow down too. The target then passes again, rejoins, and the cycle repeats. This is called flapping, and it can take down a healthy fleet.

Keep the timeout comfortably larger than your normal response time, and require at least 3 consecutive failures before pulling a target.

A health check that touches the database can cascade

If your /healthz endpoint runs a database query, then a brief database hiccup makes the health check fail on every target at once. The load balancer marks them all unhealthy and stops routing traffic entirely — a small DB blip becomes a full outage, even though your app servers were fine.

Use a lightweight, dependency-free health endpoint. A /healthz route that simply returns 200 OK without calling the database, cache, or any downstream service. It answers one question only: “is this process alive and serving?” Use separate, deeper “readiness” probes for dependency monitoring and alerting — not for load-balancer routing.

Configure it in the console

Open the EC2 console and choose Target Groups in the left menu.
Select your target group, e.g. tg-web-app.
Open the Health checks tab and click Edit.
Set Protocol to HTTP and Path to /healthz.
Expand Advanced health check settings.
Set Port to traffic-port, Healthy threshold 3, Unhealthy threshold 3, Timeout 5, Interval 30, and Success codes 200.
Click Save changes. New settings apply within a few seconds.

Configure it with the AWS CLI

This updates an existing target group’s health check (AWS CLI v2).

aws elbv2 modify-target-group \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/tg-web-app/0a1b2c3d4e5f6a7b \
  --health-check-protocol HTTP \
  --health-check-path /healthz \
  --health-check-port traffic-port \
  --healthy-threshold-count 3 \
  --unhealthy-threshold-count 3 \
  --health-check-timeout-seconds 5 \
  --health-check-interval-seconds 30 \
  --matcher HttpCode=200

Output:

{
    "TargetGroups": [
        {
            "TargetGroupArn": "arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/tg-web-app/0a1b2c3d4e5f6a7b",
            "TargetGroupName": "tg-web-app",
            "Protocol": "HTTP",
            "Port": 8080,
            "VpcId": "vpc-0a1b2c3d",
            "HealthCheckProtocol": "HTTP",
            "HealthCheckPath": "/healthz",
            "HealthCheckIntervalSeconds": 30,
            "HealthCheckTimeoutSeconds": 5,
            "HealthyThresholdCount": 3,
            "UnhealthyThresholdCount": 3,
            "Matcher": { "HttpCode": "200" }
        }
    ]
}

To see the current health of every target, ask the target group directly:

aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/tg-web-app/0a1b2c3d4e5f6a7b

Output:

{
    "TargetHealthDescriptions": [
        {
            "Target": { "Id": "i-0a1b2c3d4e5f", "Port": 8080 },
            "TargetHealth": { "State": "healthy" }
        },
        {
            "Target": { "Id": "i-0f9e8d7c6b5a", "Port": 8080 },
            "TargetHealth": {
                "State": "unhealthy",
                "Reason": "Target.ResponseCodeMismatch",
                "Description": "Health checks failed with these codes: [500]"
            }
        }
    ]
}

Define it as infrastructure as code

Putting the health check in CloudFormation means every environment gets identical, reviewed settings.

WebTargetGroup:
  Type: AWS::ElasticLoadBalancingV2::TargetGroup
  Properties:
    Name: tg-web-app
    VpcId: vpc-0a1b2c3d
    Protocol: HTTP
    Port: 8080
    TargetType: instance
    HealthCheckProtocol: HTTP
    HealthCheckPath: /healthz
    HealthCheckPort: traffic-port
    HealthCheckIntervalSeconds: 30
    HealthCheckTimeoutSeconds: 5
    HealthyThresholdCount: 3
    UnhealthyThresholdCount: 3
    Matcher:
      HttpCode: '200'

When to use which protocol

HTTP / HTTPS — use for Application Load Balancers and any web app. You get to check a real route and status code, so the check confirms the app is serving, not just that the port is open. Use HTTPS only when you need to verify TLS termination on the target; otherwise HTTP is simpler and cheaper in CPU.
TCP — use for Network Load Balancers fronting non-HTTP services (a database proxy, a game server). It only confirms the port accepts a connection, which is shallower but the only option when there is no HTTP layer.

There is no extra charge for health checks themselves — they are part of the load balancer you already pay for. The only indirect cost is data: a check every 30 seconds across many targets is negligible, but a 1-second interval across hundreds of targets adds measurable request volume to your app.

Best Practices

Point checks at a dedicated /healthz route that returns 200 and touches no database, cache, or external service.
Keep the timeout well above your typical response time, and require at least 3 consecutive failures before removing a target to avoid flapping.
Use HTTP checks with a specific success code (200) rather than TCP whenever an HTTP layer exists — it catches a far wider range of failures.
Match the health-check port to traffic-port so the check exercises the same path real traffic takes.
Keep a separate, deeper “readiness” probe for alerting on dependencies — do not wire dependency failures into the load-balancer check.
Watch the UnHealthyHostCount metric in CloudWatch and alarm when it rises, so you learn about removals before users do.