Uptime & Health-Check Monitoring
Metrics tell you how your server is doing, but they do not always tell you the one thing your users care about most: is the site actually up? Uptime monitoring answers that simple question by repeatedly asking your service “are you alive?” from the outside, the same way a real visitor would. If the answer is “no”, you want to know in seconds, not when an angry email arrives. This page covers health-check endpoints, blackbox (outside-in) probing, and the tools that do it for you.
What is a health check?
A health check is a special URL (an “endpoint”) in your application that exists only to report whether the app is working. When you visit it, the app runs a quick self-test and replies with an HTTP status code. HTTP status codes are three-digit numbers a web server returns: 200 means “OK”, anything in the 500 range means “the server is broken”.
A healthy app returns 200. A broken app returns 503 (Service Unavailable) or simply fails to respond at all. By convention this endpoint lives at a path like /health, /healthz, or /-/healthy.
There are two flavours:
| Check type | What it tests | When to use |
|---|---|---|
| Liveness | ”Is the process running at all?” Returns 200 if the app can respond. | Detect crashed or frozen apps so they can be restarted. |
| Readiness | ”Can the app serve real traffic?” Also checks the database, cache, etc. | Stop sending users to an app that started but cannot reach its database. |
Gotcha: Do not make your readiness check too heavy. If it runs a slow database query on every probe and you probe every 5 seconds, the health check itself can overload your database. Keep it fast and lightweight.
Adding a simple health endpoint with Nginx
If you just want a static “I’m up” signal from your web server, Nginx (a popular web server and reverse proxy — a server that sits in front of your app and forwards requests to it) can serve one without touching your app code.
Edit your site config:
sudo nano /etc/nginx/sites-available/default
Add a location block inside the server { } section:
location = /healthz {
access_log off; # don't spam the log with probe hits
add_header Content-Type text/plain;
return 200 "ok\n";
}
Test the config and reload Nginx:
sudo nginx -t
sudo systemctl reload nginx
Output:
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
Now confirm it works:
curl -i http://localhost/healthz
Output:
HTTP/1.1 200 OK
Server: nginx/1.24.0 (Ubuntu)
Content-Type: text/plain
Content-Length: 3
ok
Blackbox monitoring vs whitebox monitoring
The tools from the metrics pages (like Node Exporter) do whitebox monitoring — they run inside your server and report internal details like CPU and memory. Uptime monitoring is blackbox monitoring: it probes your service from outside, knowing nothing about its internals, exactly like a real user. Both matter, and they catch different problems.
| Whitebox (metrics) | Blackbox (uptime) | |
|---|---|---|
| Vantage point | Inside the server | Outside, over the network |
| Answers | ”Why is it slow?" | "Is it reachable at all?” |
| Catches | High CPU, memory leaks | DNS failures, expired TLS certs, dead network |
| Tools | Node Exporter, Prometheus | Blackbox Exporter, Uptime Kuma |
A server can report perfect internal metrics and still be unreachable because of a firewall mistake or an expired certificate. Only an outside probe catches that.
Prometheus Blackbox Exporter
The Blackbox Exporter is an official Prometheus tool that probes endpoints over HTTP, HTTPS, TCP, DNS, and ICMP (ping) and reports the results as metrics Prometheus can scrape and alert on. Use it when you already run Prometheus and want uptime data in the same dashboards and alerts as everything else.
Install it via Docker (the simplest path on Ubuntu):
sudo docker run -d \
--name blackbox_exporter \
-p 9115:9115 \
prom/blackbox-exporter:latest
Probe a target manually to test:
curl "http://localhost:9115/probe?target=https://example.com&module=http_2xx"
Output:
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1
probe_http_status_code 200
probe_http_ssl 1
probe_ssl_earliest_cert_expiry 1.78e+09
probe_success 1 means up; 0 means down. Notice probe_ssl_earliest_cert_expiry — you can alert before a TLS certificate expires.
Wire it into Prometheus by adding this to /etc/prometheus/prometheus.yml:
- job_name: "blackbox-http"
metrics_path: /probe
params:
module: [http_2xx] # expect an HTTP 2xx response
static_configs:
- targets:
- https://example.com
- https://api.example.com/healthz
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: localhost:9115 # the exporter does the probing
Reload Prometheus to apply:
sudo systemctl reload prometheus
Uptime Kuma — the easy all-in-one option
If you do not want to run Prometheus at all, Uptime Kuma is a free, self-hosted uptime monitor with a friendly web dashboard, built-in alerting (email, Slack, Telegram, Discord, and more), and public status pages. It is the fastest way to get “is it up?” monitoring on a small server.
sudo docker run -d \
--restart=always \
-p 3001:3001 \
-v uptime-kuma:/app/data \
--name uptime-kuma \
louislam/uptime-kuma:1
Open http://your-server-ip:3001 in a browser, create your admin account, then click Add New Monitor, paste a URL, set the interval, and you are done. Remember to open the port if ufw (Ubuntu’s simple firewall) is active:
sudo ufw allow 3001/tcp
Security tip: Do not leave Uptime Kuma exposed to the whole internet on a raw port. Put it behind Nginx with HTTPS and a login, or restrict the port to your IP with
ufw.
Probe from outside your own network
Here is the critical limitation: if the probe runs on the same server as the app, and that whole server dies, the probe dies too — and you get no alert. To catch a total outage you need at least one probe running somewhere else: a second cheap VPS, or an external service like UptimeRobot, Better Stack, or Pingdom. Many offer a free tier that checks one URL every minute from multiple regions.
Best Practices
- Expose a lightweight
/healthz(liveness) and a deeper/readyz(readiness) endpoint; turn off access logging for them. - Always run at least one probe from outside the server being monitored, so a full host failure still pages you.
- Monitor TLS certificate expiry with Blackbox Exporter and alert at least 14 days before it expires.
- Match the probe interval to how fast you must react — 30-60 seconds is sensible for most sites.
- Probe the real user-facing URL (through DNS and the load balancer), not just
localhost, so you catch DNS and routing failures. - Combine uptime checks with metrics: uptime tells you that it broke, metrics tell you why.