CloudWatch Alarms
A CloudWatch alarm watches a single metric (or the result of a math expression) and changes state when that metric crosses a line you draw. When the alarm “fires” it can send you an email, page an on-call engineer, restart an instance, or scale a group up. Alarms are how you turn raw numbers into action so you find out about problems before your users do. CloudWatch is Amazon’s built-in monitoring service (it collects metrics, logs, and events from your AWS resources), and alarms are the part that actually tells you when something is wrong.
How an alarm decides its state
Every alarm is always in one of three states:
| State | Meaning |
|---|---|
OK | The metric is within the threshold you set. |
ALARM | The metric breached the threshold for long enough. |
INSUFFICIENT_DATA | The alarm just started, or there isn’t enough data to judge. |
You configure four things that decide when OK flips to ALARM:
- Threshold — the value you compare against (e.g. CPU > 80%).
- Period — how long each data point covers, in seconds (e.g. 300 = 5 minutes). This must match how often the metric is published.
- Evaluation periods — how many recent periods to look at.
- Datapoints to alarm — how many of those periods must breach. This is the “M out of N” rule: e.g. 3 out of 5 means the alarm fires if 3 of the last 5 periods breached. This stops a single noisy spike from waking you up at 3am.
Tip: Don’t alarm on a single period. A short spike in CPU or latency is normal. Use something like “3 datapoints out of 5” so the alarm reflects a real, sustained problem.
Creating a metric alarm wired to SNS
The most common setup is: when a metric breaches, publish to an SNS topic. SNS (Simple Notification Service) is AWS’s pub/sub messaging service; an SNS topic can fan a single message out to email, SMS, a Lambda function, or a chat webhook. So the alarm pushes to the topic, and the topic delivers to whoever subscribed.
When to use this: almost always. Notification is the baseline. Use a direct EC2/Auto Scaling action (below) only when you also want the alarm to fix something automatically.
Console steps
- Open the CloudWatch console and choose Alarms > All alarms > Create alarm.
- Click Select metric, then pick a namespace (e.g. EC2 > Per-Instance Metrics) and a metric like CPUUtilization for instance
i-0a1b2c3d4e5f. - Set Statistic to
Averageand Period to5 minutes. - Under Conditions, choose Static, Greater, and set the threshold to
80. - Expand Additional configuration and set Datapoints to alarm to
3 out of 5. Set Missing data treatment deliberately (see below). - Click Next. Under Notification, choose In alarm and select an existing SNS topic (or create one and confirm the email subscription).
- Give the alarm a name like
prod-web-cpu-highand click Create alarm.
CLI equivalent
First create the topic and subscribe an email (you must click the confirmation link AWS emails you):
aws sns create-topic --name prod-alerts
aws sns subscribe \
--topic-arn arn:aws:sns:us-east-1:123456789012:prod-alerts \
--protocol email \
--notification-endpoint [email protected]
Then create the alarm:
aws cloudwatch put-metric-alarm \
--alarm-name prod-web-cpu-high \
--alarm-description "CPU > 80% for 15 min" \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-0a1b2c3d4e5f \
--statistic Average \
--period 300 \
--evaluation-periods 5 \
--datapoints-to-alarm 3 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--treat-missing-data missing \
--alarm-actions arn:aws:sns:us-east-1:123456789012:prod-alerts \
--ok-actions arn:aws:sns:us-east-1:123456789012:prod-alerts
Output:
(put-metric-alarm returns no output on success; verify with describe-alarms)
aws cloudwatch describe-alarms --alarm-names prod-web-cpu-high
{
"MetricAlarms": [
{
"AlarmName": "prod-web-cpu-high",
"StateValue": "INSUFFICIENT_DATA",
"ComparisonOperator": "GreaterThanThreshold",
"Threshold": 80.0,
"EvaluationPeriods": 5,
"DatapointsToAlarm": 3
}
]
}
The new alarm starts in INSUFFICIENT_DATA until it gathers enough periods.
The “treat missing data” gotcha
This is the setting that bites everyone. A metric only has data points when something reports them. If an EC2 instance dies, crashes, or loses its network, it stops sending CPUUtilization entirely — so there is no high value to breach your threshold. The alarm then has no data, not bad data. How it reacts depends on this one setting:
treat-missing-data value | What missing data means | Risk |
|---|---|---|
missing (default) | Ignore gaps; keep the previous state. | A dead instance stays OK forever — silence looks healthy. |
notBreaching | Treat gaps as good. | Same trap: outages never alarm. |
breaching | Treat gaps as bad. | Missing data fires ALARM. Best for “is it alive?” checks. |
ignore | Keep current state, never re-evaluate on gaps. | Alarm can get stuck. |
Warning: A common production incident is an instance that goes completely offline while its CPU alarm calmly sits in
OK. To catch the absence of a heartbeat, set--treat-missing-data breachingon at least one liveness alarm, or alarm on a metric that’s always present (like an ELBHealthyHostCountdropping to 0). Don’t let silence read as “everything’s fine.”
Composite alarms
A composite alarm has no metric of its own. Instead its state is a boolean rule over other alarms. This cuts alarm noise: instead of five separate pages, you get one when the situation is genuinely bad.
When to use this: to reduce false pages (only alert when CPU is high and latency is high), or to roll many child alarms into a single “service is unhealthy” signal. When not to: for a single simple threshold — a plain metric alarm is enough.
aws cloudwatch put-composite-alarm \
--alarm-name prod-web-unhealthy \
--alarm-rule "ALARM(prod-web-cpu-high) AND ALARM(prod-web-latency-high)" \
--alarm-actions arn:aws:sns:us-east-1:123456789012:prod-alerts
You can use ALARM(), OK(), INSUFFICIENT_DATA(), and the operators AND, OR, NOT in the rule.
Alarms that take an action instead of just notifying
Alarm actions aren’t limited to SNS. You can attach EC2 actions (stop, terminate, reboot, recover) or an Auto Scaling policy ARN. For example, to auto-recover an instance with failed system status, point an alarm on the StatusCheckFailed_System metric at the arn:aws:automate:us-east-1:ec2:recover action.
Cost note: Standard-resolution alarms cost about $0.10 per alarm per month; high-resolution (sub-60-second) alarms are about $0.30. Composite alarms are about $0.50 each. SNS email notifications are effectively free at low volume. A handful of alarms costs cents per month — never skip them to save money.
Best Practices
- Set datapoints to alarm (M of N) so transient spikes don’t page you, but real sustained problems do.
- Always set
treat-missing-dataexplicitly; defaultmissingcan hide a dead resource. - Add
--ok-actionsso you also get notified when the issue resolves — closing the loop matters. - Add at least one liveness alarm with
breachingmissing-data handling so silence is treated as failure. - Use composite alarms to combine signals and cut false pages instead of subscribing humans to every metric.
- Match the alarm period to the metric’s publish frequency (5 min for basic EC2, 1 min for detailed monitoring) or the alarm sits in
INSUFFICIENT_DATA. - Define alarms in infrastructure as code (CloudFormation/Terraform) so every environment has the same coverage.