Alerting on Metrics

Monitoring is only useful if someone finds out when things go wrong. Alerting is the part of your monitoring setup that watches your metrics and sends you a message — an email, a Slack ping, a phone notification — the moment a problem appears. The goal is simple: you want to know your disk is filling up before it fills up, not at 3 a.m. when the database crashes. This page shows you how to define alert rules, route them, and send notifications using the two most common tools: Prometheus Alertmanager and Grafana alerting.

What an alert actually is

An alert is a rule that says “if this metric crosses this line for this long, tell someone.” It has three parts:

A condition — a query against your metrics, like “CPU usage above 90%”.
A duration — how long the condition must stay true before firing (e.g. “for 5 minutes”). This stops one brief spike from waking you up.
A destination — where the notification goes (email, Slack, etc.).

The duration part matters more than beginners expect. A server hitting 100% CPU for two seconds is normal. A server stuck at 100% for ten minutes is a real problem. The for clause is what tells the two apart.

Alertmanager vs Grafana alerting — when to use which

Both tools turn metrics into notifications. The difference is mostly about where you already live.

Feature	Prometheus Alertmanager	Grafana alerting
Where rules live	In Prometheus config files (text)	In the Grafana UI (clickable)
Best for	Teams comfortable editing YAML	Teams who prefer a dashboard
Grouping & deduplication	Excellent, built-in	Good
Data sources	Prometheus only	Prometheus, Loki, MySQL, and more
Learning curve	Steeper	Gentler

When to use Alertmanager: you already run Prometheus, you want your alert rules version-controlled in Git, and you need smart grouping (e.g. “20 servers went down, send one message, not 20”).

When to use Grafana alerting: you want to click rather than edit YAML, or you pull metrics from several different databases.

If you already have Prometheus and Grafana running, pick one alerting path and stick with it. Running both at once is the fastest way to get duplicate, confusing pages.

Writing alert rules in Prometheus

Prometheus (an open-source metrics database that scrapes numbers from your servers) reads alert rules from a YAML file. On Ubuntu 22.04/24.04 LTS, create the file:

sudo nano /etc/prometheus/alert.rules.yml

Here is a real, working set of three rules — high CPU, disk almost full, and a service that has gone down. These assume you have Node Exporter installed (the agent that exposes Linux server metrics to Prometheus).

groups:
  - name: server-health
    rules:
      - alert: HighCpuUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU on {{ $labels.instance }}"
          description: "CPU has been above 90% for 5 minutes."

      - alert: DiskAlmostFull
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Disk almost full on {{ $labels.instance }}"
          description: "Less than 10% disk space left on root."

      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Target down: {{ $labels.instance }}"
          description: "Prometheus cannot reach this target."

Tell Prometheus to load this file by adding it under rule_files in /etc/prometheus/prometheus.yml:

rule_files:
  - "/etc/prometheus/alert.rules.yml"

Then check your config is valid and reload:

sudo promtool check rules /etc/prometheus/alert.rules.yml
sudo systemctl reload prometheus

Output:

Checking /etc/prometheus/alert.rules.yml
  SUCCESS: 3 rules found

Open http://your-server:9090/alerts in a browser and you will see your three rules listed as Inactive, Pending (condition is true but the for timer is still counting), or Firing.

Routing and notifications with Alertmanager

Prometheus only detects alerts. To actually send them, it hands firing alerts to Alertmanager, which decides where they go. Install it on Ubuntu:

sudo apt update
sudo apt install prometheus-alertmanager -y

Now configure a Slack and email destination in /etc/prometheus/alertmanager.yml:

route:
  receiver: team-slack
  group_by: ['alertname', 'instance']
  group_wait: 30s
  repeat_interval: 4h
  routes:
    - matchers:
        - severity = "critical"
      receiver: team-email

receivers:
  - name: team-slack
    slack_configs:
      - api_url: "https://hooks.slack.com/services/XXX/YYY/ZZZ"
        channel: "#alerts"
        title: "{{ .CommonAnnotations.summary }}"

  - name: team-email
    email_configs:
      - to: "[email protected]"
        from: "[email protected]"
        smarthost: "smtp.example.com:587"
        auth_username: "[email protected]"
        auth_password: "your-app-password"

This routes everything to Slack, but anything tagged severity: critical also goes to email. The group_by line bundles related alerts into one message, and repeat_interval stops it from re-pinging you every 30 seconds while a problem is still open.

Connect Prometheus to Alertmanager in prometheus.yml:

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

Restart both services:

sudo systemctl restart prometheus alertmanager
sudo systemctl status alertmanager --no-pager

Output:

● prometheus-alertmanager.service - Prometheus Alertmanager
     Active: active (running) since Mon 2026-06-15 09:14:02 UTC

Routing in Grafana instead

If you prefer the UI, Grafana does the whole job in one place. In the Grafana web interface go to Alerting → Alert rules → New alert rule, pick your Prometheus data source, write the same query (e.g. CPU above 90%), set the for duration, then under Contact points add a Slack webhook URL or an email address. Notification policies are Grafana’s equivalent of Alertmanager routes — they decide which alerts reach which contact point based on labels like severity.

Actionable alerts, not noise

The number-one reason alerting fails is too many alerts. When every minor blip pages the team, people start ignoring them — and then miss the real outage. This is called alert fatigue, and it is dangerous.

A good rule of thumb: if an alert fires and there is nothing a human needs to do about it, it should not be an alert. Put that information on a dashboard instead.

Best Practices

Alert on symptoms users feel (site down, slow responses, disk full), not on every internal metric.
Always set a for duration so brief spikes do not page anyone.
Use severity labels and route critical to a louder channel (phone/email) than warning (Slack).
Write the description so the person woken up knows what to do, not just what happened.
Group related alerts so one incident sends one message, not fifty.
Store rule files in Git so changes are reviewed and reversible.
Periodically delete alerts nobody acts on — fewer, sharper alerts beat many ignored ones.