Monitoring EC2 with CloudWatch
When you run a server, you need to answer one question constantly: is it healthy and is it busy? Amazon CloudWatch (AWS’s built-in monitoring service) is how you answer that for EC2 (Elastic Compute Cloud, AWS’s virtual machine service). Every EC2 instance sends a set of metrics to CloudWatch automatically, and you can build alarms on top of them to get paged when something goes wrong. This page explains exactly which metrics you get for free, which ones you don’t (the surprising part), and how to wire up an alarm.
What CloudWatch gives you for free
A metric is just a number measured over time — for example, CPU usage every minute. The moment you launch an EC2 instance, the hypervisor (the host software that runs your virtual machine) starts publishing metrics about that instance to CloudWatch with zero setup. These live in the AWS/EC2 namespace (a namespace is just a folder that groups related metrics).
The default metrics fall into a few groups:
| Metric | What it measures | Notes |
|---|---|---|
CPUUtilization | Percent of CPU in use | The single most-watched metric |
NetworkIn / NetworkOut | Bytes received / sent | Useful for spotting traffic spikes |
NetworkPacketsIn / NetworkPacketsOut | Packet counts | Available with detailed monitoring |
DiskReadBytes / DiskWriteBytes | I/O on instance store disks | NOT your EBS volume — see below |
DiskReadOps / DiskWriteOps | I/O operations on instance store | Same caveat |
StatusCheckFailed_System | Is the AWS host healthy? | Network, power, hardware |
StatusCheckFailed_Instance | Is your OS healthy? | Boot, network config, exhausted memory |
StatusCheckFailed | Either of the above failed | Good single alarm target |
Heads up on disk metrics: the default
DiskReadBytes/DiskWriteBytesmetrics only cover instance store volumes (temporary local disks). For an EBS (Elastic Block Store, AWS’s network-attached disk) volume, the disk metrics live in theAWS/EBSnamespace and are keyed by the volume ID, not the instance ID. People debugging slow disks often watch the wrong graph.
Status checks are worth calling out. AWS runs two automated checks every minute. The system check fails when the underlying physical host has a problem (the fix is usually to stop and start the instance, which moves it to new hardware). The instance check fails when your operating system is unreachable (the fix is yours — bad config, kernel panic, or out of memory).
The gotcha: memory and disk-used are not there
This trips up almost everyone. Memory utilization and disk space used are NOT in the default EC2 metrics. If you open CloudWatch expecting to debug a suspected memory leak, you will find a CPU graph and a network graph — and no memory graph at all.
The reason is architectural. CloudWatch’s default metrics come from the hypervisor, which sits outside your virtual machine. The hypervisor can see CPU and network because it controls them, but it has no visibility inside your operating system. How much RAM your app is using and how full the filesystem is are facts that only the OS knows. To export them, you must install software inside the instance: the CloudWatch agent.
Installing the CloudWatch agent
The CloudWatch agent is a small program you run inside the instance that reads OS-level stats and pushes them to CloudWatch as custom metrics (metrics you define, in your own namespace, e.g. CWAgent).
When to use it: any time you care about memory, swap, disk-space-used, per-process stats, or want to ship application log files to CloudWatch Logs. When NOT to: if CPU/network/status checks are enough for your use case, skip it — custom metrics are billed per metric per month.
Steps (Amazon Linux 2023 / Ubuntu, using SSM — Systems Manager, AWS’s remote management service):
- Attach an IAM (Identity and Access Management) role with the
CloudWatchAgentServerPolicyto the instance. - Install the package, then run the setup wizard:
# On Amazon Linux 2023
sudo dnf install -y amazon-cloudwatch-agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard
- Start the agent with the config the wizard wrote:
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
-a fetch-config -m ec2 \
-c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json -s
Output:
****** processing amazon-cloudwatch-agent ******
Successfully fetched the config and saved in /opt/aws/.../amazon-cloudwatch-agent.toml.tmp
Configuration validation first phase succeeded
AmazonCloudWatchAgent has been started
After a minute, new metrics like mem_used_percent and disk_used_percent appear in the CWAgent namespace.
Basic vs detailed monitoring
By default EC2 metrics are published at 5-minute intervals (called basic monitoring) at no extra charge. You can switch an instance to detailed monitoring, which publishes the same metrics every 1 minute.
| Basic monitoring | Detailed monitoring | |
|---|---|---|
| Granularity | 5 minutes | 1 minute |
| Cost | Free | ~$2.10 per instance / month |
| Good for | Steady workloads, dev/test | Fast-scaling fleets, tight alarms |
| Alarm reaction time | Slower (5-min data points) | Faster (1-min data points) |
When to enable detailed monitoring: if you run an Auto Scaling group or want alarms to react quickly, the 5-minute lag of basic monitoring is too slow — a spike can come and go before a data point lands. For a single low-traffic box, basic is fine.
Enable it via CLI:
aws ec2 monitor-instances --instance-ids i-0a1b2c3d4e5f
Output:
{
"InstanceMonitorings": [
{
"InstanceId": "i-0a1b2c3d4e5f",
"Monitoring": { "State": "pending" }
}
]
}
Console: EC2 console -> Instances -> select the instance -> Actions -> Monitor and troubleshoot -> Manage detailed monitoring -> Enable.
Creating a CPU alarm
An alarm watches one metric and changes state (OK / ALARM / INSUFFICIENT_DATA) when the metric crosses a threshold you set. You then attach actions — most commonly send an SNS (Simple Notification Service, AWS’s pub/sub messaging service) notification to email or Slack.
Console steps
- Go to the CloudWatch console -> Alarms -> All alarms -> Create alarm.
- Click Select metric -> EC2 -> Per-Instance Metrics, pick
CPUUtilizationfor your instance, then Select metric. - Set Statistic to
Averageand Period to5 minutes(or1 minuteif detailed monitoring is on). - Under Conditions, choose Greater than and enter
80(percent). - Set Datapoints to alarm to
3 out of 3so a single blip doesn’t page you. - On the next screen pick or create an SNS topic for the notification email.
- Name the alarm
high-cpu-i-0a1b2c3d4e5fand click Create alarm.
CLI equivalent
aws cloudwatch put-metric-alarm \
--alarm-name high-cpu-i-0a1b2c3d4e5f \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-0a1b2c3d4e5f \
--statistic Average \
--period 300 \
--evaluation-periods 3 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--alarm-actions arn:aws:sns:us-east-1:111122223333:ops-alerts
Output:
(no output on success; verify with describe-alarms)
aws cloudwatch describe-alarms --alarm-names high-cpu-i-0a1b2c3d4e5f \
--query "MetricAlarms[0].StateValue" --output text
Output:
OK
You can also have the alarm recover or reboot the instance automatically with --alarm-actions arn:aws:automate:us-east-1:ec2:recover — handy for the system status-check alarm.
Cost note: the first 10 alarms and basic metrics are in the AWS Free Tier. Beyond that, standard alarms cost about $0.10 each per month and custom CloudWatch-agent metrics about $0.30 each per month — cheap, but a large fleet with many custom metrics adds up.
Best Practices
- Always create at least a
StatusCheckFailedalarm with an EC2 recover action — it auto-heals hardware failures with no human in the loop. - Install the CloudWatch agent on anything memory-sensitive before you have an incident, not during one.
- Use
Datapoints to alarm(e.g. 3 of 3) instead of a single data point to avoid flapping alerts on brief spikes. - Turn on detailed monitoring for Auto Scaling groups and anything that scales fast; leave basic monitoring on idle dev boxes to save money.
- Tag your alarms and use consistent names (
<metric>-<instance-id>) so you can find and clean them up later. - Send alarm actions to an SNS topic, not directly to one email — topics let you fan out to email, SMS, and chat without editing every alarm.
- Watch EBS metrics in the
AWS/EBSnamespace, notAWS/EC2, when diagnosing slow disks.