Navigation

DevOps devops linux-admin 6 min read

Monitoring System Resources

When a server feels slow, the first thing you do is look at what it is actually doing right now. Is the CPU (the processor, the chip that runs your programs) pegged at 100%? Is it out of memory (RAM, the fast short-term storage where running programs live)? Is the disk overloaded with read and write requests? This page teaches you the handful of built-in tools that answer those questions in seconds. Think of it as your first-response triage kit for a misbehaving Ubuntu server.

The four questions of triage

Before reaching for any tool, you are trying to answer four things, in order:

Is the CPU the bottleneck? Are processes fighting over the processor?
Is memory the bottleneck? Is the server swapping (pushing memory to disk because RAM is full)?
Is the disk the bottleneck? Are reads and writes piling up?
Which process is responsible?

Each tool below answers one or more of these.

top — the always-available live view

top shows a live, refreshing list of running processes sorted by CPU usage. It is installed on every Linux system, so it is the one tool you can always count on, even on a bare server.

top

Output:

top - 14:22:07 up 9 days,  3:11,  2 users,  load average: 1.42, 0.98, 0.71
Tasks: 213 total,   1 running, 212 sleeping,   0 stopped,   0 zombie
%Cpu(s):  18.3 us,  4.1 sy,  0.0 ni, 76.9 id,  0.5 wa,  0.0 hi,  0.2 si,  0.0 st
MiB Mem :   7951.4 total,    412.6 free,   3120.8 used,   4418.0 buff/cache
MiB Swap:   2048.0 total,   1980.0 free,     68.0 used.   4502.1 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   1834 postgres  20   0  421560  98220  86440 S   9.6   1.2   2:14.08 postgres
   2201 www-data  20   0  712340 142880  18220 S   6.3   1.8   0:51.33 nginx

How to read the header line by line:

load average — three numbers for the last 1, 5, and 15 minutes (covered in detail below).
%Cpu(s) — the key fields are us (user, time spent on your apps), sy (system, time spent in the kernel), id (idle, doing nothing), and wa (I/O wait, the CPU waiting on disk or network). High wa means the disk, not the CPU, is your problem.
MiB Mem — total, free, used, and buff/cache. The avail Mem figure is what matters: it is how much memory apps can still claim. Linux deliberately uses spare RAM for cache, so a low free number is normal and healthy.

Useful keys while top is running:

Key	What it does
`M`	Sort by memory usage
`P`	Sort by CPU usage (the default)
`k`	Kill a process (it asks for the PID)
`1`	Show each CPU core separately
`q`	Quit

When to use it: any time, on any box. When not to: when you want a friendlier, more readable display and htop is available.

htop — the friendly upgrade

htop is top with colour bars, mouse support, and easy scrolling. It is not installed by default, so add it:

sudo apt update
sudo apt install htop
htop

It shows a coloured bar per CPU core, a memory bar, and a swap bar at the top, then a scrollable process list. You can click a column to sort, press F6 to change the sort field, and F9 to kill a process by selecting it instead of typing a PID.

Tip: The green portion of the CPU bar is user time, red is kernel/system time, and blue is low-priority. If the bar is mostly red, something is hammering the kernel — often heavy disk or network I/O.

When to use it: interactive investigation when you have a terminal and can install a package. For scripts or minimal containers, stick with top.

Understanding load average

Load average is the single most misread number in Linux. It is not a percentage. It is the average number of processes that are either running or waiting to run (or waiting on disk). You read it relative to your CPU core count.

Find your core count first:

nproc

Output:

Now interpret the three load numbers:

Load = number of cores → fully busy, no queue. On a 4-core box, a load of 4.0 means perfectly saturated.
Load < cores → spare capacity.
Load > cores → processes are queueing; the server is overloaded.

So a load average of 8.0 is a disaster on a 2-core server but barely warm on a 16-core one. Compare the three numbers to see the trend: if the 1-minute figure is far above the 15-minute figure, load is climbing right now. If the 1-minute is lower, the spike is passing.

vmstat — CPU, memory, and swap over time

vmstat (virtual memory statistics) prints a one-line snapshot of system health. Pass a number to refresh every N seconds; the first line is an average since boot, so ignore it and read the second line onward.

vmstat 2

Output:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0  69632 422040 102400 4500120    0    0    11    34  210  398 18  4 77  1  0
 2  1  69632 418900 102400 4500140    0   12  1840   420 1120 2240 22  6 60 12  0

The columns that matter for triage:

r — processes waiting for CPU. Consistently above your core count means CPU starvation.
b — processes blocked waiting on I/O. A non-zero b points at the disk.
si / so — swap in / swap out. Anything other than 0 here means you are out of RAM and swapping to disk, which murders performance. This is the clearest “buy more memory” signal.
wa — CPU I/O wait, same as in top.

iostat — is the disk the problem?

iostat (input/output statistics) shows per-disk activity. It lives in the sysstat package:

sudo apt install sysstat
iostat -x 2

The -x flag adds extended columns. Watch these:

Column	Meaning	Red flag
`%util`	How busy the disk is	Near 100% = disk saturated
`await`	Average wait per I/O, in ms	High and rising = disk too slow
`r/s`, `w/s`	Reads and writes per second	Context for the above

If %util sits near 100 and await is high while CPU id (idle) is also high, the CPU is fine and the disk is your bottleneck.

free — a quick memory check

For a fast, one-shot memory read without the live refresh, use free with -h (human-readable units):

free -h

Output:

               total        used        free      shared  buff/cache   available
Mem:           7.8Gi       3.0Gi       402Mi        88Mi       4.3Gi       4.4Gi
Swap:          2.0Gi        68Mi       1.9Gi

Read the available column, not free. As with top, Linux uses idle RAM for buff/cache and hands it back to apps on demand, so a small free value is expected.

Best Practices

Start triage with load average and top/htop, then drill into the specific resource that looks wrong.
Always read load average against nproc — the raw number means nothing on its own.
Treat any non-zero si/so in vmstat as an out-of-memory alarm, not a minor detail.
Read available memory, never free — cached RAM is reclaimable and counts as available.
High I/O wait (wa) with idle CPU means the disk is the bottleneck; confirm it with iostat -x.
Install htop and sysstat on every server during setup so the tools are ready before an incident.
Use vmstat 2 or iostat 2 to watch trends over a few seconds; a single snapshot can mislead you.

Monitoring System Resources

The four questions of triage

top — the always-available live view

htop — the friendly upgrade

Understanding load average

vmstat — CPU, memory, and swap over time

iostat — is the disk the problem?

free — a quick memory check

Best Practices

Related Topics