Capacity Planning
Capacity planning is the work of making sure your servers have enough resources (CPU, memory, disk, and network) to handle the traffic you get today and the traffic you expect tomorrow — without paying for a pile of machines that sit idle. It sits at the heart of Site Reliability Engineering (SRE), which is the practice of running services in a measured, data-driven way. Get it wrong on the low side and your service falls over during a busy moment. Get it wrong on the high side and you burn money. This page shows you how to measure what you have, forecast what you’ll need, and keep a sensible safety margin.
What capacity planning actually answers
Capacity planning answers three questions:
- How much am I using right now? (measurement)
- How much will I need in 3, 6, or 12 months? (forecasting)
- How much spare room do I keep for surprises? (headroom)
You answer these with numbers, not gut feeling. A “metric” here just means a measured number over time — like “CPU is 40% busy” or “the app handles 800 requests per second”.
When to do this: Run a capacity review every quarter, and again before any event you know will bring traffic (a product launch, a sale, a marketing push). Do NOT skip it just because “things feel fine” — the moment things stop feeling fine is usually too late to add servers calmly.
Measuring utilization
Utilization is the percentage of a resource that is currently in use. Start on the server itself. On Ubuntu, install a couple of standard tools.
sudo apt update
sudo apt install -y sysstat htop
sudo systemctl enable --now sysstat
sysstat ships the sar command (System Activity Reporter), which records resource use over time so you can look back later instead of only seeing “right now”.
Check CPU usage over the last collected samples:
sar -u 1 3
Output:
Linux 6.8.0-31-generic (web-01) 06/16/2026 _x86_64_ (4 CPU)
12:01:01 PM CPU %user %nice %system %iowait %idle
12:01:02 PM all 31.20 0.00 6.40 0.50 61.90
12:01:03 PM all 34.10 0.00 5.90 0.40 59.60
12:01:04 PM all 29.80 0.00 6.10 0.60 63.50
Average: all 31.70 0.00 6.13 0.50 61.67
Here the box is about 38% busy (100 minus the idle column). Check memory and disk too:
free -h
df -h /
Output:
total used free shared buff/cache available
Mem: 7.7Gi 4.1Gi 0.5Gi 0.2Gi 3.1Gi 3.3Gi
Swap: 2.0Gi 0.1Gi 1.9Gi
Filesystem Size Used Avail Use% Mounted on
/dev/root 49G 31G 18G 64% /
The numbers that matter for a web service are usually CPU %, memory available (not just free — “available” includes reclaimable cache), disk Use%, and your application’s own throughput (requests per second) and latency (how long requests take). The system tools above cover the machine; your monitoring stack (such as Prometheus scraping the app) covers the application.
Forecasting growth
Forecasting means projecting today’s numbers into the future. You don’t need fancy math to start. Collect a few months of your peak daily traffic, then fit a simple trend.
| Approach | How it works | When to use it |
|---|---|---|
| Linear growth | Add the same amount each month (e.g. +50 req/s per month) | Steady, predictable products |
| Compound growth | Multiply by a percentage each month (e.g. +10%/month) | Fast-growing or viral products |
| Event-driven | Size for a known spike (launch, Black Friday) | One-off big days |
A quick rule for compound growth: divide 72 by your monthly growth percentage to find how many months until traffic doubles. At 10% per month, traffic doubles in roughly 7 months — so a server that’s 60% busy today will be overloaded well within a year.
Tie this back to a single, meaningful unit. Pick something like “requests per CPU core per second” that your service can sustain, then forecasting is just: expected traffic ÷ per-core capacity = cores needed.
Headroom for spikes
Headroom is the spare capacity you keep on purpose, above your normal peak. Real traffic is bumpy: a normal day might peak at 50% CPU, but a single burst can briefly double that. If you run at 90% on a calm day, a small spike pushes you to failure.
A common target is to keep peak utilization at 60-70% of capacity, leaving 30-40% headroom. This buffer also covers losing a server: if one of three machines dies, the survivors must absorb its share without falling over.
Gotcha: Headroom is not only about CPU. A service can have spare CPU but run out of memory, file descriptors, or database connections first. Find the resource that runs out first — that’s your real ceiling. Adding CPU won’t help if the bottleneck is database connections.
Load testing
Load testing means deliberately sending traffic at your service to find out where it breaks. This is the only way to know your true ceiling instead of guessing. Use k6, a modern, scriptable load tester.
sudo apt install -y gnupg
sudo gpg --no-default-keyring \
--keyring /usr/share/keyrings/k6-archive-keyring.gpg \
--keyserver hkp://keyserver.ubuntu.com:80 --recv-keys C5AD17C747E3415A3642D57D77C6C491D6AC1D69
echo "deb [signed-by=/usr/share/keyrings/k6-archive-keyring.gpg] https://dl.k6.io/deb stable main" \
| sudo tee /etc/apt/sources.list.d/k6.list
sudo apt update
sudo apt install -y k6
Write a tiny test that ramps from 0 to 200 virtual users:
import http from 'k6/http';
import { check } from 'k6';
export const options = {
stages: [
{ duration: '1m', target: 50 },
{ duration: '2m', target: 200 },
{ duration: '1m', target: 0 },
],
};
export default function () {
const res = http.get('https://staging.example.com/');
check(res, { 'status is 200': (r) => r.status === 200 });
}
Run it against a staging environment, never production:
k6 run load-test.js
Output:
http_req_duration..............: avg=142ms min=31ms med=120ms p(95)=410ms
http_req_failed................: 0.40% ✓ 38 ✗ 9512
http_reqs......................: 9550 159.1/s
iteration_duration.............: avg=1.14s
Watch the point where p95 latency (the response time that 95% of requests beat) climbs sharply or errors appear. The request rate just before that breakdown is your real per-server capacity.
Cost vs headroom
More headroom means higher reliability and higher cost. The goal is enough margin to be safe, not so much that you pay for idle machines.
| Peak utilization target | Reliability | Cost | Fit |
|---|---|---|---|
| 50% | Very safe | High | Spiky, critical, hard-to-scale services |
| 65% | Balanced | Moderate | Most production web services |
| 85% | Tight | Low | Predictable load with fast autoscaling |
The cheapest way to keep headroom low and stay safe is autoscaling — automatically adding or removing servers based on a metric like CPU. With reliable autoscaling you can run a higher base utilization, because new capacity appears within minutes. Without it, you must buy peak capacity up front and let it sit idle most of the time.
Best practices
- Measure before you provision — base every decision on collected metrics (
sar, Prometheus), not on guesses or feelings. - Find the resource that runs out first (CPU, memory, connections) and plan around that bottleneck, not the most visible one.
- Keep peak utilization around 60-70% so one failed node or a sudden spike doesn’t take you down.
- Load test in staging before launches; know your per-server ceiling instead of discovering it in production.
- Use autoscaling tied to a real metric so you can run leaner while staying safe.
- Re-forecast every quarter and before any known traffic event — capacity plans go stale fast.
- Set monitoring alerts to fire on a trend (e.g. “70% busy for 30 minutes”), not just instant spikes, so you add capacity calmly.