Monitoring vs Observability

“Monitoring” and “observability” get thrown around as if they mean the same thing, but they answer different questions. Monitoring tells you whether something is broken. Observability helps you figure out why it broke — especially when the failure is something you never expected. On a real server, you need both, and understanding the split helps you pick the right tool and stop staring at a dashboard that has no answer for the problem in front of you.

The simplest way to think about it

Monitoring watches things you already decided to watch. You pick some metrics (a metric is just a number measured over time, like CPU usage or requests per second), you draw a graph, and you set an alert (a rule that pings you when a number crosses a line). It is great for known unknowns — problems you can predict, like “the disk might fill up” or “the site might go down.”

Observability is about asking new questions of your system without shipping new code first. You collect rich, detailed data (telemetry — the signals your system emits about what it is doing), and when something weird happens, you slice and filter that data to chase down the cause. It is built for unknown unknowns — failures nobody thought to put on a dashboard, like “checkout is slow, but only for users in Germany paying with one specific card type.”

A quick rule of thumb: if you can answer the question by looking at an existing dashboard, that is monitoring. If answering it means digging through raw data in new ways, that is observability.

Known unknowns vs unknown unknowns

This is the heart of the difference, so it is worth being concrete.

Known unknown: “Will the server run out of memory?” You know this can happen, you just do not know when. So you graph memory and alert at 90%. That is monitoring.
Unknown unknown: “Why did 3% of requests return a 500 error for 90 seconds at 2am, then stop?” Nobody built a dashboard for “this exact thing,” because nobody knew it would happen. You need to explore the data to find out. That is observability.

Traditional monitoring struggles with the second case because it only shows you the questions you thought of in advance.

A side-by-side comparison

Aspect	Monitoring	Observability
Core question	”Is it working?"	"Why is it behaving this way?”
Best for	Known unknowns (predictable issues)	Unknown unknowns (surprises)
Data shape	Pre-aggregated metrics, fixed dashboards	High-detail metrics, logs, and traces you can query freely
Typical action	Get alerted, glance at a graph	Filter, group, and drill down to find a root cause
Example tool	Prometheus + Grafana alerts, Uptime checks	Same metrics plus structured logs and distributed traces
When it fails	The problem is one you never anticipated	(It is the fallback when monitoring has no answer)

Notice the tools overlap. Observability is not a different product you buy instead of monitoring — it is a richer practice that often uses the same tools, just with more detailed data and the ability to query it.

The three signals (telemetry)

Observability usually rests on three kinds of telemetry. You will see these called the “three pillars.”

Metrics — numbers over time (CPU, request rate, error count). Cheap to store, great for dashboards and alerts.
Logs — timestamped text records of events (“user 42 logged in”, “payment failed: timeout”). Detailed, good for the why.
Traces — a record of one request as it travels through every service it touches, with timing for each hop. Great for “where is the slowness?”

Monitoring leans heavily on metrics. Observability uses all three together so you can pivot from “an alert fired” to “here is the exact request that broke and the line of log that explains it.”

See both in action on Ubuntu

You do not need fancy tooling to feel the difference. Here is monitoring — a quick, repeatable health check using built-in tools on Ubuntu 22.04 / 24.04.

# Known-unknown check: is the disk filling up? Is a service up?
df -h /
systemctl is-active nginx

Output:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        78G   61G   13G  83% /
active

That answers a pre-planned question: “disk and nginx — are they OK?” Clean and fast.

Now here is the observability style — exploring rich data to answer a question you did not plan for. Say errors spiked at 2am. You query the raw access log (a file Nginx writes for every request) to slice it in a new way.

# Unknown-unknown hunt: which URLs returned 500 errors, and how often?
sudo awk '$9 == 500 {print $7}' /var/log/nginx/access.log \
  | sort | uniq -c | sort -rn | head

Output:

     47 /api/checkout
      3 /api/profile
      1 /health

You just discovered that almost all the errors hit /api/checkout — a question no dashboard was built to answer. That is observability in spirit: asking new questions of detailed telemetry. Real systems do this with structured logs and trace queries instead of awk, but the mindset is identical.

Gotcha: do not delete or aggressively truncate raw logs to save space without keeping a copy. Aggregated metrics are cheap, but once you throw away the detailed logs and traces, you also throw away your ability to investigate the surprises. Configure log rotation (/etc/logrotate.d/) thoughtfully rather than rm-ing log files.

When to use which

Use monitoring to answer health questions you can predict: uptime, disk, memory, error rate, latency thresholds. Wire these to alerts so a human gets paged before users notice.
Use observability when an alert fires and the dashboard does not explain why, or when a problem is intermittent, user-specific, or brand new.
Do not treat them as either/or. Monitoring tells you the house is on fire; observability helps you find which room.

Best Practices

Start with solid monitoring (metrics + alerts) for the obvious failure modes — it is cheap and catches most outages early.
Emit structured logs (key-value or JSON) instead of plain prose so you can filter and group them later.
Keep enough log and trace detail to investigate surprises; tune log rotation rather than discarding data outright.
Make alerts actionable — every alert should map to “a human can do something about this,” or it becomes noise.
Use consistent labels (service, environment, host) across metrics, logs, and traces so you can pivot between them.
Practice asking new questions of your data before an incident, so you are not learning the query syntax at 2am.