Navigation

AWS aws ec2 6 min read

Monitoring EC2 with CloudWatch

When you run a server, you need to answer one question constantly: is it healthy and is it busy? Amazon CloudWatch (AWS’s built-in monitoring service) is how you answer that for EC2 (Elastic Compute Cloud, AWS’s virtual machine service). Every EC2 instance sends a set of metrics to CloudWatch automatically, and you can build alarms on top of them to get paged when something goes wrong. This page explains exactly which metrics you get for free, which ones you don’t (the surprising part), and how to wire up an alarm.

What CloudWatch gives you for free

A metric is just a number measured over time — for example, CPU usage every minute. The moment you launch an EC2 instance, the hypervisor (the host software that runs your virtual machine) starts publishing metrics about that instance to CloudWatch with zero setup. These live in the AWS/EC2 namespace (a namespace is just a folder that groups related metrics).

The default metrics fall into a few groups:

Metric	What it measures	Notes
`CPUUtilization`	Percent of CPU in use	The single most-watched metric
`NetworkIn` / `NetworkOut`	Bytes received / sent	Useful for spotting traffic spikes
`NetworkPacketsIn` / `NetworkPacketsOut`	Packet counts	Available with detailed monitoring
`DiskReadBytes` / `DiskWriteBytes`	I/O on instance store disks	NOT your EBS volume — see below
`DiskReadOps` / `DiskWriteOps`	I/O operations on instance store	Same caveat
`StatusCheckFailed_System`	Is the AWS host healthy?	Network, power, hardware
`StatusCheckFailed_Instance`	Is your OS healthy?	Boot, network config, exhausted memory
`StatusCheckFailed`	Either of the above failed	Good single alarm target

Heads up on disk metrics: the default DiskReadBytes/DiskWriteBytes metrics only cover instance store volumes (temporary local disks). For an EBS (Elastic Block Store, AWS’s network-attached disk) volume, the disk metrics live in the AWS/EBS namespace and are keyed by the volume ID, not the instance ID. People debugging slow disks often watch the wrong graph.

Status checks are worth calling out. AWS runs two automated checks every minute. The system check fails when the underlying physical host has a problem (the fix is usually to stop and start the instance, which moves it to new hardware). The instance check fails when your operating system is unreachable (the fix is yours — bad config, kernel panic, or out of memory).

The gotcha: memory and disk-used are not there

This trips up almost everyone. Memory utilization and disk space used are NOT in the default EC2 metrics. If you open CloudWatch expecting to debug a suspected memory leak, you will find a CPU graph and a network graph — and no memory graph at all.

The reason is architectural. CloudWatch’s default metrics come from the hypervisor, which sits outside your virtual machine. The hypervisor can see CPU and network because it controls them, but it has no visibility inside your operating system. How much RAM your app is using and how full the filesystem is are facts that only the OS knows. To export them, you must install software inside the instance: the CloudWatch agent.

Installing the CloudWatch agent

The CloudWatch agent is a small program you run inside the instance that reads OS-level stats and pushes them to CloudWatch as custom metrics (metrics you define, in your own namespace, e.g. CWAgent).

When to use it: any time you care about memory, swap, disk-space-used, per-process stats, or want to ship application log files to CloudWatch Logs. When NOT to: if CPU/network/status checks are enough for your use case, skip it — custom metrics are billed per metric per month.

Steps (Amazon Linux 2023 / Ubuntu, using SSM — Systems Manager, AWS’s remote management service):

Attach an IAM (Identity and Access Management) role with the CloudWatchAgentServerPolicy to the instance.
Install the package, then run the setup wizard:

# On Amazon Linux 2023
sudo dnf install -y amazon-cloudwatch-agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard

Start the agent with the config the wizard wrote:

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
  -a fetch-config -m ec2 \
  -c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json -s

Output:

****** processing amazon-cloudwatch-agent ******
Successfully fetched the config and saved in /opt/aws/.../amazon-cloudwatch-agent.toml.tmp
Configuration validation first phase succeeded
AmazonCloudWatchAgent has been started

After a minute, new metrics like mem_used_percent and disk_used_percent appear in the CWAgent namespace.

Basic vs detailed monitoring

By default EC2 metrics are published at 5-minute intervals (called basic monitoring) at no extra charge. You can switch an instance to detailed monitoring, which publishes the same metrics every 1 minute.

	Basic monitoring	Detailed monitoring
Granularity	5 minutes	1 minute
Cost	Free	~$2.10 per instance / month
Good for	Steady workloads, dev/test	Fast-scaling fleets, tight alarms
Alarm reaction time	Slower (5-min data points)	Faster (1-min data points)

When to enable detailed monitoring: if you run an Auto Scaling group or want alarms to react quickly, the 5-minute lag of basic monitoring is too slow — a spike can come and go before a data point lands. For a single low-traffic box, basic is fine.

Enable it via CLI:

aws ec2 monitor-instances --instance-ids i-0a1b2c3d4e5f

Output:

{
    "InstanceMonitorings": [
        {
            "InstanceId": "i-0a1b2c3d4e5f",
            "Monitoring": { "State": "pending" }
        }
    ]
}

Console: EC2 console -> Instances -> select the instance -> Actions -> Monitor and troubleshoot -> Manage detailed monitoring -> Enable.

Creating a CPU alarm

An alarm watches one metric and changes state (OK / ALARM / INSUFFICIENT_DATA) when the metric crosses a threshold you set. You then attach actions — most commonly send an SNS (Simple Notification Service, AWS’s pub/sub messaging service) notification to email or Slack.

Console steps

Go to the CloudWatch console -> Alarms -> All alarms -> Create alarm.
Click Select metric -> EC2 -> Per-Instance Metrics, pick CPUUtilization for your instance, then Select metric.
Set Statistic to Average and Period to 5 minutes (or 1 minute if detailed monitoring is on).
Under Conditions, choose Greater than and enter 80 (percent).
Set Datapoints to alarm to 3 out of 3 so a single blip doesn’t page you.
On the next screen pick or create an SNS topic for the notification email.
Name the alarm high-cpu-i-0a1b2c3d4e5f and click Create alarm.

CLI equivalent

aws cloudwatch put-metric-alarm \
  --alarm-name high-cpu-i-0a1b2c3d4e5f \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-0a1b2c3d4e5f \
  --statistic Average \
  --period 300 \
  --evaluation-periods 3 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:us-east-1:111122223333:ops-alerts

Output:

(no output on success; verify with describe-alarms)

aws cloudwatch describe-alarms --alarm-names high-cpu-i-0a1b2c3d4e5f \
  --query "MetricAlarms[0].StateValue" --output text

Output:

OK

You can also have the alarm recover or reboot the instance automatically with --alarm-actions arn:aws:automate:us-east-1:ec2:recover — handy for the system status-check alarm.

Cost note: the first 10 alarms and basic metrics are in the AWS Free Tier. Beyond that, standard alarms cost about $0.10 each per month and custom CloudWatch-agent metrics about $0.30 each per month — cheap, but a large fleet with many custom metrics adds up.

Best Practices

Always create at least a StatusCheckFailed alarm with an EC2 recover action — it auto-heals hardware failures with no human in the loop.
Install the CloudWatch agent on anything memory-sensitive before you have an incident, not during one.
Use Datapoints to alarm (e.g. 3 of 3) instead of a single data point to avoid flapping alerts on brief spikes.
Turn on detailed monitoring for Auto Scaling groups and anything that scales fast; leave basic monitoring on idle dev boxes to save money.
Tag your alarms and use consistent names (<metric>-<instance-id>) so you can find and clean them up later.
Send alarm actions to an SNS topic, not directly to one email — topics let you fan out to email, SMS, and chat without editing every alarm.
Watch EBS metrics in the AWS/EBS namespace, not AWS/EC2, when diagnosing slow disks.

Monitoring EC2 with CloudWatch

What CloudWatch gives you for free

The gotcha: memory and disk-used are not there

Installing the CloudWatch agent

Basic vs detailed monitoring

Creating a CPU alarm

Console steps

CLI equivalent

Best Practices

Related Topics