SLAs, SLOs & SLIs

When people talk about how reliable a system is, they throw around three confusingly similar acronyms: SLI, SLO, and SLA. They sound alike but mean very different things, and mixing them up leads to bad engineering decisions and broken promises to customers. This page defines each one in plain English, shows how they build on each other, and gives you a worked example using the famous “99.9%” number so you know exactly what it costs you in real downtime.

The three terms in one sentence each

These three ideas stack on top of one another. You measure something (the SLI), you set a goal for that measurement (the SLO), and sometimes you sign a legal contract around it (the SLA).

Term	Full name	What it is	Audience
SLI	Service Level Indicator	A real, measured metric (a number)	Engineers
SLO	Service Level Objective	A target value for that metric	Engineers + product
SLA	Service Level Agreement	A contract with consequences if you miss the target	Lawyers + customers

A good way to remember the order: an SLI is what you measure, an SLO is what you aim for, and an SLA is what you promise (and pay for if you break it).

SLI — the thing you actually measure

An SLI (Service Level Indicator) is a number you collect from your running system. It is always a ratio of “good events” to “total events”, usually expressed as a percentage. It is not a goal or a promise — it is just the current reading on the dial.

Common SLIs include:

Availability — the percent of requests that succeeded (did not return a 5xx server error).
Latency — the percent of requests served faster than some threshold, e.g. under 300 milliseconds.
Error rate — the percent of requests that failed.
Throughput — requests handled per second.

If you run Nginx (a popular web server and reverse proxy — a server that sits in front of your app and forwards requests to it), the raw data for an availability SLI is already sitting in your access log on Ubuntu. Here is a quick way to compute it for the last chunk of traffic:

# Count total requests vs. requests that did NOT return a 5xx error
sudo awk '{ total++ } $9 !~ /^5/ { good++ } END { printf "Availability: %.4f%%\n", (good/total)*100 }' /var/log/nginx/access.log

Output:

Availability: 99.9732%

The $9 field is the HTTP status code in the default Nginx combined log format. If you customised your log_format in /etc/nginx/nginx.conf, the status code may be in a different column — check before trusting the number.

That 99.9732% is your SLI for this period. It is a fact about the past, not a goal.

SLO — the target you aim for

An SLO (Service Level Objective) is the goal you set for an SLI over a window of time, usually a rolling 28 or 30 days. It is an internal target that your team agrees on. It is intentionally less than 100%, because chasing 100% is impossibly expensive and leaves no room to ship new features or do maintenance.

A typical SLO reads like this:

“99.9% of HTTP requests over a rolling 30-day window will succeed (return a non-5xx status).”

The gap between 100% and your SLO is your error budget — the amount of failure you are allowed to spend before you must stop shipping risky changes and focus on stability. Error budgets get their own page (linked below), but the short version is: the SLO creates the budget.

When to set an SLO (and when not to)

Set an SLO for any user-facing service whose reliability customers notice — APIs, web apps, login systems, payment flows.
Do not set a strict SLO for internal batch jobs, experiments, or services where a few minutes of downtime causes no real harm. Over-engineering reliability for things nobody depends on wastes time and money.

SLA — the contract with teeth

An SLA (Service Level Agreement) is a legal contract between you and your customer. It promises a certain level of service and spells out the consequences — usually money back or service credits — if you fail to deliver. This is the only one of the three that involves lawyers and refunds.

The crucial rule: your SLA target should always be looser than your internal SLO. If your engineers aim for 99.9% (the SLO) but you only promise customers 99.5% (the SLA), you have a safety margin. You will breach your internal target and start fixing things long before you ever owe a customer money.

Concept	Example value	Consequence of missing
SLO (internal)	99.9%	Freeze risky deploys, focus on reliability
SLA (external)	99.5%	Pay service credits / refunds to the customer

What the nines actually cost you

The phrase “three nines” means 99.9%. Each extra nine is ten times harder to achieve. Here is what each level allows in real downtime — the most important table on this page. Memorise the “three nines” row; it is the industry default.

Availability	Nickname	Downtime per year	Downtime per month (30 days)	Downtime per day
99%	two nines	3.65 days	7.2 hours	14.4 minutes
99.9%	three nines	8.77 hours	43.8 minutes	1.44 minutes
99.95%	three-and-a-half nines	4.38 hours	21.9 minutes	43 seconds
99.99%	four nines	52.6 minutes	4.38 minutes	8.6 seconds
99.999%	five nines	5.26 minutes	26.3 seconds	0.86 seconds

So a 99.9% availability SLO means your service is allowed to be down for about 43 minutes per month. That is your error budget. If a single bad deploy causes a 30-minute outage, you have already spent most of the month’s budget in one go.

Doing the maths yourself

You can compute allowed downtime for any target on Ubuntu with a one-liner:

# Minutes of allowed downtime per 30-day month for a 99.9% SLO
python3 -c "slo=99.9; print(round((1 - slo/100) * 30*24*60, 1), 'minutes/month')"