Navigation

AWS aws load-balancing 6 min read

ASG Health Checks & Instance Replacement

An Auto Scaling Group (an ASG — a group of EC2 virtual machines that AWS keeps at a target size for you) is only as reliable as its health checks. A health check is a test that decides whether an instance is “healthy” or “unhealthy.” When an instance is marked unhealthy, the ASG terminates it and launches a fresh replacement, so your fleet self-heals without you waking up at 3am. This page explains the two health-check types, the one default that quietly breaks self-healing, and how to fix it.

Why health checks matter

The whole promise of an ASG is “set a desired count and walk away.” But the ASG can only replace an instance if something tells it the instance is broken. That “something” is the health check. Pick the wrong type and a crashed app can sit there serving errors forever while the ASG happily reports everything is fine, because from its point of view the virtual machine is still running.

EC2 health checks vs ELB health checks

There are two sources of health information an ASG can listen to.

EC2 health checks use EC2 status checks. EC2 (Elastic Compute Cloud, AWS’s virtual machine service) runs two automatic checks on every instance: a system status check (the underlying AWS hardware and network) and an instance status check (the VM’s OS reachability). If either fails, the instance is impaired and the ASG replaces it. This catches dead hosts, but it knows nothing about your application.

ELB health checks use the target health reported by an Elastic Load Balancer (ELB — the service that spreads traffic across instances). The load balancer probes a target group (the set of instances behind it) on a port and path you choose, for example GET /healthz. If the app stops answering with a healthy HTTP status, the target goes unhealthy, and when ELB health checks are enabled the ASG treats that as a reason to replace the instance.

Aspect	EC2 health check	ELB health check
What it tests	VM/OS reachability (status checks)	Your app responds correctly (HTTP/TCP probe)
Detects a crashed app on a healthy VM?	No	Yes
Requires a load balancer?	No	Yes (instances attached to a target group)
Default on a new ASG?	Yes	No
Best for	Standalone instances, batch workers	Any web app or API behind an ALB/NLB

Gotcha: By default an ASG uses ONLY EC2 health checks. If your application process crashes (502s, hung server) but the VM keeps running, the EC2 status checks still pass, so the ASG never replaces the instance. Your users get errors and nothing self-heals. You must explicitly turn on ELB health checks.

When to use each

EC2 only — fine for workloads with no load balancer: a queue worker, a cron-style batch instance, or a single standalone VM where “the OS is up” is a good enough definition of healthy.
ELB (plus EC2) — the right choice for anything serving traffic. Enabling ELB health checks does NOT disable EC2 checks; the ASG now replaces an instance if either the VM fails or the load balancer marks the target unhealthy. This is what makes web apps actually heal.

The grace period (don’t skip this)

The health check grace period is the number of seconds the ASG ignores health checks after an instance launches. New instances need time to boot the OS, start the app, and warm up before they can pass a /healthz probe. If the grace period is too short, the ASG kills a brand-new instance before it finishes booting, launches another, and loops forever (“flapping”).

Set the grace period a bit longer than your real boot-to-ready time. A simple Node or Spring Boot app might need 60-120 seconds; an instance that runs database migrations on start might need 300.

Tip: If you see instances being terminated seconds after launch and replaced endlessly, your grace period is almost always too short, or your health-check path returns a non-2xx before the app is ready.

How to enable ELB health checks

AWS Management Console

Open the EC2 console and choose Auto Scaling Groups in the left menu.
Select your ASG and open the Details tab.
Find the Health checks card and choose Edit.
Under Additional health check types, tick Turn on Elastic Load Balancing health checks.
Set Health check grace period to a value longer than your boot time, for example 120 seconds.
Choose Update.

AWS CLI

This turns on both EC2 and ELB health checks and sets a 120-second grace period:

aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name web-asg \
  --health-check-type ELB \
  --health-check-grace-period 120

Setting --health-check-type ELB keeps EC2 checks active too; ELB is added on top, it does not replace them. Confirm the setting:

aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names web-asg \
  --query "AutoScalingGroups[0].{Type:HealthCheckType,Grace:HealthCheckGracePeriod}"

Output:

{
    "Type": "ELB",
    "Grace": 120
}

You can watch the ASG decide an instance is unhealthy and replace it:

aws autoscaling describe-scaling-activities \
  --auto-scaling-group-name web-asg \
  --max-items 1

Output:

{
    "Activities": [
        {
            "ActivityId": "a1b2c3d4-5678-90ab-cdef-EXAMPLE11111",
            "AutoScalingGroupName": "web-asg",
            "Description": "Terminating EC2 instance: i-0a1b2c3d4e5f",
            "Cause": "An instance was taken out of service in response to an ELB system health check failure.",
            "StatusCode": "InProgress",
            "Progress": 50
        }
    ]
}

The Cause line confirms the ELB health check, not the EC2 check, triggered the replacement.

Infrastructure as Code

CloudFormation makes the setting permanent and reviewable:

WebAsg:
  Type: AWS::AutoScaling::AutoScalingGroup
  Properties:
    AutoScalingGroupName: web-asg
    MinSize: "2"
    MaxSize: "6"
    DesiredCapacity: "2"
    HealthCheckType: ELB
    HealthCheckGracePeriod: 120
    TargetGroupARNs:
      - !Ref WebTargetGroup
    VPCZoneIdentifier:
      - subnet-0a1b2c3d

Tuning the underlying target-group probe

The ASG trusts whatever the target group reports, so the real sensitivity lives on the target group’s own health check (path, interval, healthy/unhealthy thresholds). A target group with HealthCheckIntervalSeconds: 30 and an unhealthy threshold of 2 means a crash takes up to ~60 seconds to be detected before the ASG even acts. Make that path cheap and honest — it should check that the app can actually serve, not just return 200 from a static file.

Cost note

Health checks themselves are free. The cost is replacement churn: every replaced instance is a fresh EC2 launch, and you pay per-second for both the dying and the new instance during the brief overlap. Flapping caused by a too-short grace period can quietly multiply your instance-hours, so getting the grace period right is also a cost control, not just a reliability one.

Best practices

Always enable ELB health checks for any ASG behind a load balancer — never rely on EC2 checks alone for app health.
Set the grace period comfortably above your real boot-to-ready time to stop newly launched instances from being killed prematurely.
Build a dedicated lightweight /healthz endpoint that verifies the app can serve requests (and key dependencies), not just that the web server is up.
Keep at least MinSize: 2 across multiple Availability Zones so a replacement never drops you to zero capacity.
Tune the target group’s interval and unhealthy threshold to balance fast detection against false positives from brief blips.
Use describe-scaling-activities and CloudWatch alarms to monitor replacement rate; a sudden spike means flapping or a bad deploy.