Skip to content
DevOps devops monitoring 5 min read

The Three Pillars: Metrics, Logs, Traces

When something goes wrong on a server, you need answers fast: Is it slow? Is it broken? Why? Observability (the ability to understand what is happening inside a system just by looking at the data it emits) rests on three kinds of data, often called the three pillars: metrics, logs, and traces. Each one answers a different question, and you usually need all three to see the full picture. This page explains what each is, shows real examples on Ubuntu, and tells you when to reach for which.

The big idea

Think of debugging a production problem like a doctor diagnosing a patient:

  • Metrics are the vital signs — heart rate, temperature, blood pressure. Numbers measured over time. They tell you that something is wrong.
  • Logs are the patient’s diary — a record of individual events as they happened. They tell you what happened.
  • Traces are the X-ray that follows the path of one specific thing through the whole body. They tell you where the problem is across many services.

Telemetry is the umbrella word for all three (data a system sends out about itself). Let’s look at each in detail.

Metrics — numbers over time

A metric is a number measured repeatedly at fixed intervals — for example “CPU usage every 15 seconds” or “requests per second”. Each measurement is a tiny data point: a timestamp, a value, and some labels (key/value tags like host="web-01"). Stored together, they form a time series you can graph and alert on.

Metrics are cheap to store and fast to query because each point is just a few bytes. That makes them perfect for dashboards and alerts.

You already have metrics on any Ubuntu server. Try these:

# Instant CPU, memory and load snapshot
uptime
free -h

Output:

 14:32:09 up 6 days,  3:14,  2 users,  load average: 0.42, 0.55, 0.61
               total        used        free      shared  buff/cache   available
Mem:           7.8Gi       2.1Gi       3.9Gi       180Mi       1.8Gi       5.3Gi

To watch a metric change over time, install sysstat (the System Activity Report toolkit) and sample CPU every two seconds, five times:

sudo apt update
sudo apt install -y sysstat
sar -u 2 5

Output:

14:35:01     CPU     %user     %nice   %system   %iowait    %steal     %idle
14:35:03     all      3.21      0.00      1.10      0.25      0.00     95.44
14:35:05     all      4.02      0.00      1.51      0.13      0.00     94.34
Average:     all      3.55      0.00      1.27      0.19      0.00     94.99

When to use metrics: for dashboards, capacity planning, and alerting (“page me if CPU > 90% for 5 minutes”). When NOT to: when you need to know why one specific request failed — metrics aggregate, so they hide individual events.

Logs — a record of events

A log is a timestamped record of a discrete event: “user 42 logged in”, “connection refused”, “disk full”. Unlike metrics, a log entry carries rich detail — full messages, stack traces, IDs. That detail is also why logs cost more to store and search.

On Ubuntu, system services log through journald (the systemd journal). Read them with journalctl:

# Last 20 lines from the SSH service, follow new entries live
sudo journalctl -u ssh -n 20 -f

Output:

Jun 16 14:40:02 web-01 sshd[2231]: Accepted publickey for deploy from 203.0.113.7 port 51224
Jun 16 14:41:18 web-01 sshd[2240]: Failed password for invalid user admin from 198.51.100.9 port 40122

Many apps also write plain text logs under /var/log. For example, Nginx (a popular web server) writes here:

# Show the 5 most recent error lines from Nginx
sudo tail -n 5 /var/log/nginx/error.log

Output:

2026/06/16 14:42:51 [error] 911#911: *38 connect() failed (111: Connection refused) while connecting to upstream, client: 203.0.113.7, server: example.com

That single line tells you exactly what failed — something a CPU graph never could.

Gotcha: Logs can grow until they fill the disk and crash the server. Ubuntu rotates /var/log files via logrotate, and journald is capped in /etc/systemd/journald.conf (set SystemMaxUse=500M). Always confirm rotation is active before sending real traffic.

When to use logs: debugging a specific failure, auditing security events, reading error messages. When NOT to: for trends or alerting thresholds — counting log lines to build a graph is slow and expensive; use metrics instead.

Traces — a request’s journey

A trace follows one request as it travels across multiple services. Modern apps are often split into microservices (many small programs that call each other), so a single user click might touch a web server, an authentication service, and a database. A trace stitches all of those steps into one timeline.

Each step is a span (one timed operation with a start, an end, and a parent). Spans share a trace ID so you can see the whole journey and spot exactly which hop was slow.

Traces are emitted by your application code, usually via OpenTelemetry (the industry-standard, vendor-neutral telemetry framework, abbreviated OTel). You run a collector to receive them. A minimal collector config looks like this:

# /etc/otelcol/config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
exporters:
  debug:
    verbosity: detailed
service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [debug]

A trace, once collected, reads like this:

Trace 7f3a... (total 240ms)
└─ GET /checkout            web-api      0ms ─ 240ms
   ├─ verifyToken           auth-svc    12ms ─  35ms
   └─ SELECT * FROM orders  postgres    40ms ─ 232ms  ← slow

At a glance you see the database query is the bottleneck. Metrics could have told you the request was slow; only the trace shows where.

When to use traces: finding the slow service in a multi-service request, understanding dependencies. When NOT to: for a single-process script or a simple static site — there is no journey to trace, so the setup cost is not worth it.

When each pillar helps

Question you are askingBest pillarExample tool
Is the system healthy right now?MetricsPrometheus, sar
How is usage trending over weeks?MetricsGrafana dashboards
Why did this exact request fail?Logsjournalctl, /var/log
Who logged in and when?Logsjournalctl -u ssh
Which service made the request slow?TracesOpenTelemetry, Jaeger
How do my services depend on each other?TracesOpenTelemetry

Best Practices

  • Use all three together. A metric alert fires, logs explain the error, and a trace pinpoints the slow service. Each alone leaves a gap.
  • Add labels and IDs consistently. Tag metrics and logs with the same host, service, and request_id so you can jump between them.
  • Keep metrics for alerting, logs for forensics. Don’t build dashboards by counting log lines — it’s slow and costly.
  • Set retention deliberately. Keep high-resolution metrics short-term, logs medium-term, and downsample old data to save disk.
  • Cap your storage. Configure logrotate and journald limits so telemetry never fills the disk and takes the server down.
  • Adopt OpenTelemetry early. Standardising on OTel means you can swap backends later without re-instrumenting your code.
Last updated June 15, 2026
Was this helpful?