Navigation

DevOps devops monitoring 5 min read

Uptime & Health-Check Monitoring

Metrics tell you how your server is doing, but they do not always tell you the one thing your users care about most: is the site actually up? Uptime monitoring answers that simple question by repeatedly asking your service “are you alive?” from the outside, the same way a real visitor would. If the answer is “no”, you want to know in seconds, not when an angry email arrives. This page covers health-check endpoints, blackbox (outside-in) probing, and the tools that do it for you.

What is a health check?

A health check is a special URL (an “endpoint”) in your application that exists only to report whether the app is working. When you visit it, the app runs a quick self-test and replies with an HTTP status code. HTTP status codes are three-digit numbers a web server returns: 200 means “OK”, anything in the 500 range means “the server is broken”.

A healthy app returns 200. A broken app returns 503 (Service Unavailable) or simply fails to respond at all. By convention this endpoint lives at a path like /health, /healthz, or /-/healthy.

There are two flavours:

Check type	What it tests	When to use
Liveness	”Is the process running at all?” Returns 200 if the app can respond.	Detect crashed or frozen apps so they can be restarted.
Readiness	”Can the app serve real traffic?” Also checks the database, cache, etc.	Stop sending users to an app that started but cannot reach its database.

Gotcha: Do not make your readiness check too heavy. If it runs a slow database query on every probe and you probe every 5 seconds, the health check itself can overload your database. Keep it fast and lightweight.

Adding a simple health endpoint with Nginx

If you just want a static “I’m up” signal from your web server, Nginx (a popular web server and reverse proxy — a server that sits in front of your app and forwards requests to it) can serve one without touching your app code.

Edit your site config:

sudo nano /etc/nginx/sites-available/default

Add a location block inside the server { } section:

location = /healthz {
    access_log off;          # don't spam the log with probe hits
    add_header Content-Type text/plain;
    return 200 "ok\n";
}

Test the config and reload Nginx:

sudo nginx -t
sudo systemctl reload nginx

Output:

nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful

Now confirm it works:

curl -i http://localhost/healthz

Output:

HTTP/1.1 200 OK
Server: nginx/1.24.0 (Ubuntu)
Content-Type: text/plain
Content-Length: 3

ok

Blackbox monitoring vs whitebox monitoring

The tools from the metrics pages (like Node Exporter) do whitebox monitoring — they run inside your server and report internal details like CPU and memory. Uptime monitoring is blackbox monitoring: it probes your service from outside, knowing nothing about its internals, exactly like a real user. Both matter, and they catch different problems.

	Whitebox (metrics)	Blackbox (uptime)
Vantage point	Inside the server	Outside, over the network
Answers	”Why is it slow?"	"Is it reachable at all?”
Catches	High CPU, memory leaks	DNS failures, expired TLS certs, dead network
Tools	Node Exporter, Prometheus	Blackbox Exporter, Uptime Kuma

A server can report perfect internal metrics and still be unreachable because of a firewall mistake or an expired certificate. Only an outside probe catches that.

Prometheus Blackbox Exporter

The Blackbox Exporter is an official Prometheus tool that probes endpoints over HTTP, HTTPS, TCP, DNS, and ICMP (ping) and reports the results as metrics Prometheus can scrape and alert on. Use it when you already run Prometheus and want uptime data in the same dashboards and alerts as everything else.

Install it via Docker (the simplest path on Ubuntu):

sudo docker run -d \
  --name blackbox_exporter \
  -p 9115:9115 \
  prom/blackbox-exporter:latest

Probe a target manually to test:

curl "http://localhost:9115/probe?target=https://example.com&module=http_2xx"

Output:

# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1
probe_http_status_code 200
probe_http_ssl 1
probe_ssl_earliest_cert_expiry 1.78e+09

probe_success 1 means up; 0 means down. Notice probe_ssl_earliest_cert_expiry — you can alert before a TLS certificate expires.

Wire it into Prometheus by adding this to /etc/prometheus/prometheus.yml:

  - job_name: "blackbox-http"
    metrics_path: /probe
    params:
      module: [http_2xx]        # expect an HTTP 2xx response
    static_configs:
      - targets:
          - https://example.com
          - https://api.example.com/healthz
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: localhost:9115   # the exporter does the probing

Reload Prometheus to apply:

sudo systemctl reload prometheus

Uptime Kuma — the easy all-in-one option

If you do not want to run Prometheus at all, Uptime Kuma is a free, self-hosted uptime monitor with a friendly web dashboard, built-in alerting (email, Slack, Telegram, Discord, and more), and public status pages. It is the fastest way to get “is it up?” monitoring on a small server.

sudo docker run -d \
  --restart=always \
  -p 3001:3001 \
  -v uptime-kuma:/app/data \
  --name uptime-kuma \
  louislam/uptime-kuma:1

Open http://your-server-ip:3001 in a browser, create your admin account, then click Add New Monitor, paste a URL, set the interval, and you are done. Remember to open the port if ufw (Ubuntu’s simple firewall) is active:

sudo ufw allow 3001/tcp

Security tip: Do not leave Uptime Kuma exposed to the whole internet on a raw port. Put it behind Nginx with HTTPS and a login, or restrict the port to your IP with ufw.

Probe from outside your own network

Here is the critical limitation: if the probe runs on the same server as the app, and that whole server dies, the probe dies too — and you get no alert. To catch a total outage you need at least one probe running somewhere else: a second cheap VPS, or an external service like UptimeRobot, Better Stack, or Pingdom. Many offer a free tier that checks one URL every minute from multiple regions.

Best Practices

Expose a lightweight /healthz (liveness) and a deeper /readyz (readiness) endpoint; turn off access logging for them.
Always run at least one probe from outside the server being monitored, so a full host failure still pages you.
Monitor TLS certificate expiry with Blackbox Exporter and alert at least 14 days before it expires.
Match the probe interval to how fast you must react — 30-60 seconds is sensible for most sites.
Probe the real user-facing URL (through DNS and the load balancer), not just localhost, so you catch DNS and routing failures.
Combine uptime checks with metrics: uptime tells you that it broke, metrics tell you why.

Uptime & Health-Check Monitoring

What is a health check?

Adding a simple health endpoint with Nginx

Blackbox monitoring vs whitebox monitoring

Prometheus Blackbox Exporter

Uptime Kuma — the easy all-in-one option

Probe from outside your own network

Best Practices

Related Topics