Error Budgets

Every online service breaks sometimes. The clever idea at the heart of Site Reliability Engineering (SRE — the practice of running software systems reliably using engineering, not just manual operations) is that you should not try to make a service perfect. Instead you decide how much unreliability you can tolerate, and you treat that allowance as a budget you are allowed to spend. That allowance is called the error budget. It turns the endless argument between “ship faster” and “be more stable” into a single number that both sides can see.

What an error budget actually is

An error budget is simply 100% minus your SLO.

An SLO (Service Level Objective — the reliability target you promise to hit, like “99.9% of requests succeed”) sets the bar. Whatever is left under 100% is the amount of failure you are allowed to have. That leftover is the budget.

SLO target	Allowed failure (error budget)	Downtime per 30 days
99%	1%	~7 hours 18 min
99.9%	0.1%	~43 min
99.95%	0.05%	~21 min
99.99%	0.01%	~4 min 19 sec

So if your SLO is 99.9% over a 30-day window, your error budget is 0.1%. Out of every 1000 requests, 1 is allowed to fail before you have a problem. Across a month that is roughly 43 minutes of “down” you can spend however you like.

The key shift in thinking: those failures are not a disaster to be avoided at all costs. They are a resource. You can spend them on risky deploys, experiments, and faster shipping. The only rule is you cannot spend more than you have.

A 100% reliability target is the wrong target for almost everything. Chasing the last fraction of a percent costs exponentially more money and slows your team to a crawl, and your users usually cannot even tell the difference because their phone, Wi-Fi, and ISP are less reliable than your service anyway.

Why this is so useful

Without an error budget, “is it stable enough to ship?” is a feelings argument. Developers want to release features. The operations team wants nothing to change because change is what breaks things. Both are partly right and there is no way to settle it.

The error budget settles it with data:

If the budget has room left, the team is free to ship, take risks, and move fast. Stability is fine, so go.
If the budget is spent, shipping risky changes stops until reliability recovers. The data says you are out of room.

Nobody has to win the argument because the number decides.

Measuring your budget burn

You measure the budget the same way you measure the SLO: with an SLI (Service Level Indicator — the actual measured number, like the real percentage of successful requests). If your SLI is request success rate, “burning” budget means failed requests.

Here is a simple way to read budget burn from logs on an Ubuntu server. Say Nginx (a popular web server and reverse proxy — a server that sits in front of your app and forwards requests to it) logs status codes for every request.

# Count total requests and 5xx errors in today's access log
sudo awk '{print $9}' /var/log/nginx/access.log \
  | sort | uniq -c | sort -rn | head

Output:

Now turn that into a burn calculation:

sudo awk '
  { total++ }
  $9 ~ /^5/ { errors++ }
  END {
    printf "Total:   %d\n", total
    printf "Errors:  %d\n", errors
    printf "Success: %.4f%%\n", (1 - errors/total) * 100
    printf "Budget (SLO 99.9%%) spent: %.1f%%\n", (errors/total) / 0.001 * 100
  }' /var/log/nginx/access.log

Output:

Total:   50042
Errors:  705
Success: 98.5912%
Budget (SLO 99.9%) spent: 1408.9%

That last line is the alarm. A success rate of 98.59% is far below the 99.9% SLO, so you have spent 1409% of your monthly budget. You are deep in the red and risky changes should stop now.

In production you would not use raw awk on log files. You would record these numbers in a metrics system like Prometheus (an open-source monitoring tool that stores time-series numbers). A typical Prometheus rule for budget burn looks like this:

# /etc/prometheus/rules/error-budget.yml
groups:
  - name: error-budget
    rules:
      - record: job:request_error_ratio:rate1h
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[1h]))
            /
          sum(rate(http_requests_total[1h]))

      - alert: ErrorBudgetBurningFast
        expr: job:request_error_ratio:rate1h > (14.4 * 0.001)
        for: 5m
        labels:
          severity: page
        annotations:
          summary: "Burning error budget 14x too fast"

The 14.4 is a standard “fast burn” multiplier: at that rate you would use a whole month of budget in about two days, so it pages a human immediately.

When the budget runs out

The most important part of an error budget is the policy you agree on before anything breaks. The classic SRE rule is a change freeze: when the budget is exhausted, you stop shipping risky changes and the team’s focus switches to reliability work until the budget recovers.

A clear written policy might say:

Error Budget Policy
-------------------
Window:        30 rolling days
SLO:           99.9% request success
Budget:        0.1% (~43 min/month)

If budget remaining > 25%:   ship freely, take risks
If budget remaining 0-25%:   slow down, extra review on risky changes
If budget exhausted (0%):    FEATURE FREEZE
                             - only reliability + safety fixes ship
                             - team works on hardening until budget recovers

The freeze is not a punishment. It is the system protecting itself. It forces the team to invest in stability exactly when stability is the problem, and it lifts automatically once the rolling window heals.

Best practices

Set the SLO from user pain, not vanity. Pick the lowest reliability your users genuinely need. A lower SLO means a bigger budget and a faster team.
Use a rolling window (usually 28-30 days), not a calendar month. A rolling window means an outage stays “spent” until it ages out, instead of resetting to full on the 1st.
Agree the freeze policy in writing, in advance. Decide what happens at 0% budget while everyone is calm, not during an outage.
Make the budget visible. Put remaining budget on a dashboard everyone watches. A hidden budget changes no behavior.
Alert on burn rate, not just on being out. A fast-burn alert (like the 14.4x rule) catches a fire while you still have time to act.
Let the budget reward good work. When teams ship safely and stay under budget, give them more freedom — that is the incentive that makes the whole system work.