What is Observability?

When you run software on a server, things break. A website slows down, a database stops responding, or a server runs out of memory at 3 a.m. Observability is the practice of being able to look at your running systems and understand exactly what they are doing — and why they are doing it — from the outside. The core idea: you cannot operate, fix, or improve a system you cannot see. This page defines observability in plain terms, explains why it matters, and introduces the three pillars (metrics, logs, and traces) that the rest of this section builds on.

What observability actually means

The word comes from control theory (a branch of engineering about steering systems). A system is “observable” if you can figure out its internal state just by looking at the outputs it produces. For a Linux server, those outputs are signals like CPU usage, error messages in log files, and timing data from requests.

In everyday terms: observability is having enough information coming out of your systems that, when something goes wrong, you can answer questions you did not think to ask in advance. That last part is the key. Old-school monitoring answers “Is the server up?” — a question you wrote down ahead of time. Observability lets you answer “Why did checkout requests from European users get slow for ten minutes last Tuesday?” — a question nobody predicted.

A quick distinction: monitoring is watching things you already decided to watch. Observability is the property of your system that makes deep investigation possible at all. You monitor using the data an observable system emits. The next page covers this difference in detail.

Why you can’t run systems you can’t see

Imagine driving a car with the windshield painted over and no dashboard. You have no speedometer, no fuel gauge, no warning lights. You could drive for a while, but the first time something went wrong you would have no idea what or where. Running a production server without observability is exactly that.

Without visibility into your systems, you end up:

Finding out about outages from your users, not your tools — the worst way to learn your site is down.
Guessing during incidents. Restarting random services and hoping. This is slow and often makes things worse.
Unable to spot trends. Disk slowly filling up over weeks, memory leaking, traffic growing — all invisible until they cause a crash.
Flying blind after deploys. You ship new code and have no way to tell if it made things better or worse.

With good observability you get the opposite: early warnings before users notice, fast root-cause analysis during incidents, and data to prove whether a change helped.

The three pillars

Observability is commonly built from three kinds of data, often called the three pillars. Each answers a different type of question. You usually want all three, because they complement each other.

Pillar	What it is	Best at answering	Example
Metrics	Numbers measured over time (time-series data)	“How much / how many / how fast?” and “Is it getting worse?”	CPU at 87%, 1,200 requests/sec, 2% error rate
Logs	Timestamped text records of discrete events	”What exactly happened, and when?”	`ERROR 14:02:11 payment failed: timeout connecting to db`
Traces	The path of a single request across services	”Where in the request did the time go?”	Checkout took 900ms: 50ms app, 850ms in the database call

Metrics

A metric is a number sampled at regular intervals — for example, CPU usage recorded every 15 seconds. Because metrics are just numbers, they are cheap to store and fast to chart. They are excellent for dashboards and alerts (“page me if error rate goes above 5%”). Their weakness: a metric tells you that something changed, not the detailed why. Tools like Prometheus (a metrics database) collect these; Grafana (a dashboard tool) graphs them.

Logs

A log is a line of text your application or the operating system writes whenever something notable happens. On Ubuntu, logs live in /var/log and are managed by journald (the systemd logging service). You read recent system logs like this:

sudo journalctl -xe --since "10 minutes ago"

Output:

Jun 15 14:02:11 web01 nginx[812]: 502 Bad Gateway upstream timed out
Jun 15 14:02:11 web01 myapp[940]: ERROR payment failed: db connection refused
Jun 15 14:02:12 web01 systemd[1]: myapp.service: Main process exited, code=killed

Logs give you rich detail and context, but they are bulky and harder to summarize at a glance. They are how you confirm the exact sequence of events during an incident.

Traces

A trace follows one single request as it travels through your system. If a user clicks “Buy”, that request might hit a web server, then your app, then a database. A trace breaks that journey into timed segments (called spans) so you can see precisely which step was slow. Traces shine in systems with many moving parts; for a simple single-server app, metrics and logs are often enough to start.

When to invest in observability (and how much)

You do not need a full tracing pipeline on day one. Start small and grow with your needs.

Stage	What you have	What to add
Single hobby server	journald logs by default	Add metrics (Prometheus + Node Exporter) and uptime checks
Small production app	Metrics + logs + dashboards + alerts	A status/uptime monitor so you hear about outages first
Multiple services / microservices	All of the above	Distributed tracing to follow requests across services

The honest rule: add observability before you need it, because you cannot debug an outage with data you forgot to collect. But do not over-engineer — a tiny blog does not need distributed tracing.

Gotcha: Logs and metrics consume disk space. An unbounded log can fill /var and crash the very server you are trying to watch. Configure retention from the start — for journald, set SystemMaxUse=500M in /etc/systemd/journald.conf and run sudo systemctl restart systemd-journald.

Best practices

Collect all three pillars where practical, but start with metrics and logs — they cover most everyday incidents on a single Ubuntu server.
Instrument before you launch. Add monitoring while building, not after your first outage.
Set retention limits on logs and metrics so observability never fills the disk and takes the system down.
Alert on symptoms users feel (slow responses, errors) rather than on every internal metric — too many alerts get ignored.
Keep dashboards focused. A few clear graphs of the “golden signals” (latency, traffic, errors, saturation) beat fifty noisy ones.
Centralize where it makes sense. As you add servers, ship metrics and logs to one place so you investigate in a single tool, not by SSH-ing into ten machines.