Skip to content
DevOps devops sre 6 min read

SLAs, SLOs & SLIs

When people talk about how reliable a system is, they throw around three confusingly similar acronyms: SLI, SLO, and SLA. They sound alike but mean very different things, and mixing them up leads to bad engineering decisions and broken promises to customers. This page defines each one in plain English, shows how they build on each other, and gives you a worked example using the famous “99.9%” number so you know exactly what it costs you in real downtime.

The three terms in one sentence each

These three ideas stack on top of one another. You measure something (the SLI), you set a goal for that measurement (the SLO), and sometimes you sign a legal contract around it (the SLA).

TermFull nameWhat it isAudience
SLIService Level IndicatorA real, measured metric (a number)Engineers
SLOService Level ObjectiveA target value for that metricEngineers + product
SLAService Level AgreementA contract with consequences if you miss the targetLawyers + customers

A good way to remember the order: an SLI is what you measure, an SLO is what you aim for, and an SLA is what you promise (and pay for if you break it).

SLI — the thing you actually measure

An SLI (Service Level Indicator) is a number you collect from your running system. It is always a ratio of “good events” to “total events”, usually expressed as a percentage. It is not a goal or a promise — it is just the current reading on the dial.

Common SLIs include:

  • Availability — the percent of requests that succeeded (did not return a 5xx server error).
  • Latency — the percent of requests served faster than some threshold, e.g. under 300 milliseconds.
  • Error rate — the percent of requests that failed.
  • Throughput — requests handled per second.

If you run Nginx (a popular web server and reverse proxy — a server that sits in front of your app and forwards requests to it), the raw data for an availability SLI is already sitting in your access log on Ubuntu. Here is a quick way to compute it for the last chunk of traffic:

# Count total requests vs. requests that did NOT return a 5xx error
sudo awk '{ total++ } $9 !~ /^5/ { good++ } END { printf "Availability: %.4f%%\n", (good/total)*100 }' /var/log/nginx/access.log

Output:

Availability: 99.9732%

The $9 field is the HTTP status code in the default Nginx combined log format. If you customised your log_format in /etc/nginx/nginx.conf, the status code may be in a different column — check before trusting the number.

That 99.9732% is your SLI for this period. It is a fact about the past, not a goal.

SLO — the target you aim for

An SLO (Service Level Objective) is the goal you set for an SLI over a window of time, usually a rolling 28 or 30 days. It is an internal target that your team agrees on. It is intentionally less than 100%, because chasing 100% is impossibly expensive and leaves no room to ship new features or do maintenance.

A typical SLO reads like this:

“99.9% of HTTP requests over a rolling 30-day window will succeed (return a non-5xx status).”

The gap between 100% and your SLO is your error budget — the amount of failure you are allowed to spend before you must stop shipping risky changes and focus on stability. Error budgets get their own page (linked below), but the short version is: the SLO creates the budget.

When to set an SLO (and when not to)

  • Set an SLO for any user-facing service whose reliability customers notice — APIs, web apps, login systems, payment flows.
  • Do not set a strict SLO for internal batch jobs, experiments, or services where a few minutes of downtime causes no real harm. Over-engineering reliability for things nobody depends on wastes time and money.

SLA — the contract with teeth

An SLA (Service Level Agreement) is a legal contract between you and your customer. It promises a certain level of service and spells out the consequences — usually money back or service credits — if you fail to deliver. This is the only one of the three that involves lawyers and refunds.

The crucial rule: your SLA target should always be looser than your internal SLO. If your engineers aim for 99.9% (the SLO) but you only promise customers 99.5% (the SLA), you have a safety margin. You will breach your internal target and start fixing things long before you ever owe a customer money.

ConceptExample valueConsequence of missing
SLO (internal)99.9%Freeze risky deploys, focus on reliability
SLA (external)99.5%Pay service credits / refunds to the customer

What the nines actually cost you

The phrase “three nines” means 99.9%. Each extra nine is ten times harder to achieve. Here is what each level allows in real downtime — the most important table on this page. Memorise the “three nines” row; it is the industry default.

AvailabilityNicknameDowntime per yearDowntime per month (30 days)Downtime per day
99%two nines3.65 days7.2 hours14.4 minutes
99.9%three nines8.77 hours43.8 minutes1.44 minutes
99.95%three-and-a-half nines4.38 hours21.9 minutes43 seconds
99.99%four nines52.6 minutes4.38 minutes8.6 seconds
99.999%five nines5.26 minutes26.3 seconds0.86 seconds

So a 99.9% availability SLO means your service is allowed to be down for about 43 minutes per month. That is your error budget. If a single bad deploy causes a 30-minute outage, you have already spent most of the month’s budget in one go.

Doing the maths yourself

You can compute allowed downtime for any target on Ubuntu with a one-liner:

# Minutes of allowed downtime per 30-day month for a 99.9% SLO
python3 -c "slo=99.9; print(round((1 - slo/100) * 30*24*60, 1), 'minutes/month')"

Output:

43.2 minutes/month

Best Practices

  • Measure SLIs from the user’s perspective. A request the user never saw succeed is a failure, even if your server thinks it returned 200.
  • Keep the SLA looser than the SLO. Always leave a buffer so internal alarms fire before contractual penalties do.
  • Don’t aim for 100%. It is impossibly expensive and leaves no error budget for shipping features or doing maintenance.
  • Pick a clear measurement window. A rolling 28- or 30-day window is the common default; state it explicitly in the SLO.
  • Start with one or two SLIs. Availability and latency cover most user pain. Add more only when they reflect real user experience.
  • Review SLOs quarterly. As traffic and customer expectations change, your targets should too.
  • Automate the math. Wire your SLI computation into Prometheus or a dashboard so nobody hand-runs awk during an incident.
Last updated June 15, 2026
Was this helpful?