Skip to content
DevOps devops sre 6 min read

Capacity Planning

Capacity planning is the work of making sure your servers have enough resources (CPU, memory, disk, and network) to handle the traffic you get today and the traffic you expect tomorrow — without paying for a pile of machines that sit idle. It sits at the heart of Site Reliability Engineering (SRE), which is the practice of running services in a measured, data-driven way. Get it wrong on the low side and your service falls over during a busy moment. Get it wrong on the high side and you burn money. This page shows you how to measure what you have, forecast what you’ll need, and keep a sensible safety margin.

What capacity planning actually answers

Capacity planning answers three questions:

  1. How much am I using right now? (measurement)
  2. How much will I need in 3, 6, or 12 months? (forecasting)
  3. How much spare room do I keep for surprises? (headroom)

You answer these with numbers, not gut feeling. A “metric” here just means a measured number over time — like “CPU is 40% busy” or “the app handles 800 requests per second”.

When to do this: Run a capacity review every quarter, and again before any event you know will bring traffic (a product launch, a sale, a marketing push). Do NOT skip it just because “things feel fine” — the moment things stop feeling fine is usually too late to add servers calmly.

Measuring utilization

Utilization is the percentage of a resource that is currently in use. Start on the server itself. On Ubuntu, install a couple of standard tools.

sudo apt update
sudo apt install -y sysstat htop
sudo systemctl enable --now sysstat

sysstat ships the sar command (System Activity Reporter), which records resource use over time so you can look back later instead of only seeing “right now”.

Check CPU usage over the last collected samples:

sar -u 1 3

Output:

Linux 6.8.0-31-generic (web-01)   06/16/2026   _x86_64_   (4 CPU)

12:01:01 PM CPU   %user   %nice  %system  %iowait  %idle
12:01:02 PM all   31.20    0.00     6.40     0.50   61.90
12:01:03 PM all   34.10    0.00     5.90     0.40   59.60
12:01:04 PM all   29.80    0.00     6.10     0.60   63.50
Average:    all   31.70    0.00     6.13     0.50   61.67

Here the box is about 38% busy (100 minus the idle column). Check memory and disk too:

free -h
df -h /

Output:

               total        used        free      shared  buff/cache   available
Mem:           7.7Gi       4.1Gi       0.5Gi       0.2Gi       3.1Gi       3.3Gi
Swap:          2.0Gi       0.1Gi       1.9Gi

Filesystem      Size  Used Avail Use% Mounted on
/dev/root        49G   31G   18G  64% /

The numbers that matter for a web service are usually CPU %, memory available (not just free — “available” includes reclaimable cache), disk Use%, and your application’s own throughput (requests per second) and latency (how long requests take). The system tools above cover the machine; your monitoring stack (such as Prometheus scraping the app) covers the application.

Forecasting growth

Forecasting means projecting today’s numbers into the future. You don’t need fancy math to start. Collect a few months of your peak daily traffic, then fit a simple trend.

ApproachHow it worksWhen to use it
Linear growthAdd the same amount each month (e.g. +50 req/s per month)Steady, predictable products
Compound growthMultiply by a percentage each month (e.g. +10%/month)Fast-growing or viral products
Event-drivenSize for a known spike (launch, Black Friday)One-off big days

A quick rule for compound growth: divide 72 by your monthly growth percentage to find how many months until traffic doubles. At 10% per month, traffic doubles in roughly 7 months — so a server that’s 60% busy today will be overloaded well within a year.

Tie this back to a single, meaningful unit. Pick something like “requests per CPU core per second” that your service can sustain, then forecasting is just: expected traffic ÷ per-core capacity = cores needed.

Headroom for spikes

Headroom is the spare capacity you keep on purpose, above your normal peak. Real traffic is bumpy: a normal day might peak at 50% CPU, but a single burst can briefly double that. If you run at 90% on a calm day, a small spike pushes you to failure.

A common target is to keep peak utilization at 60-70% of capacity, leaving 30-40% headroom. This buffer also covers losing a server: if one of three machines dies, the survivors must absorb its share without falling over.

Gotcha: Headroom is not only about CPU. A service can have spare CPU but run out of memory, file descriptors, or database connections first. Find the resource that runs out first — that’s your real ceiling. Adding CPU won’t help if the bottleneck is database connections.

Load testing

Load testing means deliberately sending traffic at your service to find out where it breaks. This is the only way to know your true ceiling instead of guessing. Use k6, a modern, scriptable load tester.

sudo apt install -y gnupg
sudo gpg --no-default-keyring \
  --keyring /usr/share/keyrings/k6-archive-keyring.gpg \
  --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys C5AD17C747E3415A3642D57D77C6C491D6AC1D69
echo "deb [signed-by=/usr/share/keyrings/k6-archive-keyring.gpg] https://dl.k6.io/deb stable main" \
  | sudo tee /etc/apt/sources.list.d/k6.list
sudo apt update
sudo apt install -y k6

Write a tiny test that ramps from 0 to 200 virtual users:

import http from 'k6/http';
import { check } from 'k6';

export const options = {
  stages: [
    { duration: '1m', target: 50 },
    { duration: '2m', target: 200 },
    { duration: '1m', target: 0 },
  ],
};

export default function () {
  const res = http.get('https://staging.example.com/');
  check(res, { 'status is 200': (r) => r.status === 200 });
}

Run it against a staging environment, never production:

k6 run load-test.js

Output:

     http_req_duration..............: avg=142ms min=31ms med=120ms p(95)=410ms
     http_req_failed................: 0.40%  ✓ 38  ✗ 9512
     http_reqs......................: 9550   159.1/s
     iteration_duration.............: avg=1.14s

Watch the point where p95 latency (the response time that 95% of requests beat) climbs sharply or errors appear. The request rate just before that breakdown is your real per-server capacity.

Cost vs headroom

More headroom means higher reliability and higher cost. The goal is enough margin to be safe, not so much that you pay for idle machines.

Peak utilization targetReliabilityCostFit
50%Very safeHighSpiky, critical, hard-to-scale services
65%BalancedModerateMost production web services
85%TightLowPredictable load with fast autoscaling

The cheapest way to keep headroom low and stay safe is autoscaling — automatically adding or removing servers based on a metric like CPU. With reliable autoscaling you can run a higher base utilization, because new capacity appears within minutes. Without it, you must buy peak capacity up front and let it sit idle most of the time.

Best practices

  • Measure before you provision — base every decision on collected metrics (sar, Prometheus), not on guesses or feelings.
  • Find the resource that runs out first (CPU, memory, connections) and plan around that bottleneck, not the most visible one.
  • Keep peak utilization around 60-70% so one failed node or a sudden spike doesn’t take you down.
  • Load test in staging before launches; know your per-server ceiling instead of discovering it in production.
  • Use autoscaling tied to a real metric so you can run leaner while staying safe.
  • Re-forecast every quarter and before any known traffic event — capacity plans go stale fast.
  • Set monitoring alerts to fire on a trend (e.g. “70% busy for 30 minutes”), not just instant spikes, so you add capacity calmly.
Last updated June 15, 2026
Was this helpful?