Alerting on Metrics
Monitoring is only useful if someone finds out when things go wrong. Alerting is the part of your monitoring setup that watches your metrics and sends you a message — an email, a Slack ping, a phone notification — the moment a problem appears. The goal is simple: you want to know your disk is filling up before it fills up, not at 3 a.m. when the database crashes. This page shows you how to define alert rules, route them, and send notifications using the two most common tools: Prometheus Alertmanager and Grafana alerting.
What an alert actually is
An alert is a rule that says “if this metric crosses this line for this long, tell someone.” It has three parts:
- A condition — a query against your metrics, like “CPU usage above 90%”.
- A duration — how long the condition must stay true before firing (e.g. “for 5 minutes”). This stops one brief spike from waking you up.
- A destination — where the notification goes (email, Slack, etc.).
The duration part matters more than beginners expect. A server hitting 100% CPU for two seconds is normal. A server stuck at 100% for ten minutes is a real problem. The for clause is what tells the two apart.
Alertmanager vs Grafana alerting — when to use which
Both tools turn metrics into notifications. The difference is mostly about where you already live.
| Feature | Prometheus Alertmanager | Grafana alerting |
|---|---|---|
| Where rules live | In Prometheus config files (text) | In the Grafana UI (clickable) |
| Best for | Teams comfortable editing YAML | Teams who prefer a dashboard |
| Grouping & deduplication | Excellent, built-in | Good |
| Data sources | Prometheus only | Prometheus, Loki, MySQL, and more |
| Learning curve | Steeper | Gentler |
When to use Alertmanager: you already run Prometheus, you want your alert rules version-controlled in Git, and you need smart grouping (e.g. “20 servers went down, send one message, not 20”).
When to use Grafana alerting: you want to click rather than edit YAML, or you pull metrics from several different databases.
If you already have Prometheus and Grafana running, pick one alerting path and stick with it. Running both at once is the fastest way to get duplicate, confusing pages.
Writing alert rules in Prometheus
Prometheus (an open-source metrics database that scrapes numbers from your servers) reads alert rules from a YAML file. On Ubuntu 22.04/24.04 LTS, create the file:
sudo nano /etc/prometheus/alert.rules.yml
Here is a real, working set of three rules — high CPU, disk almost full, and a service that has gone down. These assume you have Node Exporter installed (the agent that exposes Linux server metrics to Prometheus).
groups:
- name: server-health
rules:
- alert: HighCpuUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU on {{ $labels.instance }}"
description: "CPU has been above 90% for 5 minutes."
- alert: DiskAlmostFull
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
for: 10m
labels:
severity: critical
annotations:
summary: "Disk almost full on {{ $labels.instance }}"
description: "Less than 10% disk space left on root."
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Target down: {{ $labels.instance }}"
description: "Prometheus cannot reach this target."
Tell Prometheus to load this file by adding it under rule_files in /etc/prometheus/prometheus.yml:
rule_files:
- "/etc/prometheus/alert.rules.yml"
Then check your config is valid and reload:
sudo promtool check rules /etc/prometheus/alert.rules.yml
sudo systemctl reload prometheus
Output:
Checking /etc/prometheus/alert.rules.yml
SUCCESS: 3 rules found
Open http://your-server:9090/alerts in a browser and you will see your three rules listed as Inactive, Pending (condition is true but the for timer is still counting), or Firing.
Routing and notifications with Alertmanager
Prometheus only detects alerts. To actually send them, it hands firing alerts to Alertmanager, which decides where they go. Install it on Ubuntu:
sudo apt update
sudo apt install prometheus-alertmanager -y
Now configure a Slack and email destination in /etc/prometheus/alertmanager.yml:
route:
receiver: team-slack
group_by: ['alertname', 'instance']
group_wait: 30s
repeat_interval: 4h
routes:
- matchers:
- severity = "critical"
receiver: team-email
receivers:
- name: team-slack
slack_configs:
- api_url: "https://hooks.slack.com/services/XXX/YYY/ZZZ"
channel: "#alerts"
title: "{{ .CommonAnnotations.summary }}"
- name: team-email
email_configs:
- to: "[email protected]"
from: "[email protected]"
smarthost: "smtp.example.com:587"
auth_username: "[email protected]"
auth_password: "your-app-password"
This routes everything to Slack, but anything tagged severity: critical also goes to email. The group_by line bundles related alerts into one message, and repeat_interval stops it from re-pinging you every 30 seconds while a problem is still open.
Connect Prometheus to Alertmanager in prometheus.yml:
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
Restart both services:
sudo systemctl restart prometheus alertmanager
sudo systemctl status alertmanager --no-pager
Output:
● prometheus-alertmanager.service - Prometheus Alertmanager
Active: active (running) since Mon 2026-06-15 09:14:02 UTC
Routing in Grafana instead
If you prefer the UI, Grafana does the whole job in one place. In the Grafana web interface go to Alerting → Alert rules → New alert rule, pick your Prometheus data source, write the same query (e.g. CPU above 90%), set the for duration, then under Contact points add a Slack webhook URL or an email address. Notification policies are Grafana’s equivalent of Alertmanager routes — they decide which alerts reach which contact point based on labels like severity.
Actionable alerts, not noise
The number-one reason alerting fails is too many alerts. When every minor blip pages the team, people start ignoring them — and then miss the real outage. This is called alert fatigue, and it is dangerous.
A good rule of thumb: if an alert fires and there is nothing a human needs to do about it, it should not be an alert. Put that information on a dashboard instead.
Best Practices
- Alert on symptoms users feel (site down, slow responses, disk full), not on every internal metric.
- Always set a
forduration so brief spikes do not page anyone. - Use
severitylabels and routecriticalto a louder channel (phone/email) thanwarning(Slack). - Write the
descriptionso the person woken up knows what to do, not just what happened. - Group related alerts so one incident sends one message, not fifty.
- Store rule files in Git so changes are reviewed and reversible.
- Periodically delete alerts nobody acts on — fewer, sharper alerts beat many ignored ones.