What is Site Reliability Engineering?

Site Reliability Engineering (SRE) is a way of running computer systems so they stay reliable, fast, and available — without burning out the people who keep them running. The big idea is simple: treat operations (the work of keeping servers and apps online) as a software problem, and solve it with code instead of manual labour. SRE was created at Google around 2003 and is now used by companies of every size. If you are learning DevOps and Linux server administration, SRE gives you the mindset and the numbers to decide how reliable a system needs to be — and how to get there without endless 3am pages.

Where SRE came from

In 2003 a Google engineer named Ben Treynor Sloss was asked to build a team to keep Google’s services running. Instead of hiring traditional system administrators (people who manage servers mostly by hand), he hired software engineers and told them to run operations the way they would build software. The result was the SRE discipline.

The famous one-line definition from Google is:

SRE is what you get when you ask a software engineer to design an operations team.

That single sentence is the whole philosophy. When an engineer who writes code is told “keep this service online,” they get bored doing the same manual task twice — so they automate it.

The core idea: operations as a software problem

In old-style operations, when something broke, a human logged into a server, ran some commands, and fixed it. That works for one server. It does not work for ten thousand servers. Manual fixes do not scale.

SRE flips this. Anything you do by hand more than once should become a script, a tool, or an automated system. Restarting a crashed app, rotating logs, deploying a new version, scaling up under load — all of these become software that runs on its own.

A small, concrete example. Instead of manually restarting a service when it dies, you let systemd (Ubuntu’s service manager, the system that starts and supervises background programs) do it automatically.

sudo nano /etc/systemd/system/myapp.service

[Unit]
Description=My web app
After=network.target

[Service]
ExecStart=/usr/bin/node /srv/myapp/server.js
Restart=always
RestartSec=5
User=www-data

[Install]
WantedBy=multi-user.target

sudo systemctl daemon-reload
sudo systemctl enable --now myapp.service
systemctl status myapp.service

Output:

● myapp.service - My web app
     Loaded: loaded (/etc/systemd/system/myapp.service; enabled; preset: enabled)
     Active: active (running) since Mon 2026-06-15 10:21:04 UTC; 3s ago
   Main PID: 4821 (node)
      Tasks: 11 (limit: 4915)
     Memory: 38.2M
        CPU: 240ms

The Restart=always line means the app comes back on its own if it crashes. That is SRE thinking: a machine handles the repetitive recovery, not a tired human.

Toil: the enemy SRE fights

Toil is SRE’s name for the boring, repetitive, manual work that keeps a service running but adds no lasting value. Toil is work that is manual, repetitive, automatable, reactive (you only do it because something broke), and scales linearly — meaning if your traffic doubles, the work doubles too.

Examples of toil: manually restarting services, copying files by hand to deploy, clicking through a dashboard every morning to “check if things are okay,” or applying the same config to ten servers one at a time.

SRE teams measure toil and try to keep it under control. A well-known Google guideline is the 50% rule: an SRE should spend no more than half their time on toil. The other half goes to engineering work — building tools that remove toil. If toil grows past 50%, the team is drowning in operations and has no time to improve.

Trait of toil	Why it’s bad	The SRE fix
Manual	Wastes human time	Write a script
Repetitive	Boring, error-prone	Automate it
Reactive	You’re always firefighting	Build self-healing systems
Scales with load	More users = more pain	Make the work O(1), not O(n)

Gotcha: Not all manual work is toil. Designing a new system, writing a postmortem, or planning capacity is valuable engineering — even though it’s done by hand. Toil is specifically the automatable drudgery. Don’t try to “automate away” thinking.

How SRE relates to DevOps

DevOps and SRE are close cousins and often confused. DevOps is a broad culture and set of principles for breaking down the wall between developers (who write software) and operations (who run it). It says things like “automate everything” and “share responsibility” but doesn’t tell you exactly how.

SRE is one concrete implementation of those principles. A popular way to put it: “class SRE implements interface DevOps.” DevOps describes what you want; SRE is a specific, opinionated set of practices — with hard numbers — for achieving it.

	DevOps	SRE
What it is	A culture / philosophy	A specific job role and method
Origin	Community movement (~2009)	Google (~2003)
Reliability target	”Be reliable”	Measured with SLOs and error budgets
Key tool	Collaboration + automation	Error budgets, SLOs, toil limits
Question it answers	How should teams work?	How reliable, exactly, and how?

The reliability numbers SRE introduces

The reason SRE feels different from old operations is that it makes reliability measurable. Instead of vaguely aiming for “100% uptime” (which is impossible and wasteful), SRE sets a precise target and tracks it. The key terms — covered in depth on the next pages — are:

SLI (Service Level Indicator): a measurement, like “percentage of requests that succeed.”
SLO (Service Level Objective): your target for that measurement, like “99.9% of requests succeed.”
Error budget: the small amount of failure you are allowed (if your SLO is 99.9%, you may fail 0.1% of the time). This budget gives teams permission to take risks and ship features as long as the budget isn’t spent.

These three ideas turn reliability from a feeling into math, and they’re the foundation of everything else in SRE.

Best Practices

Embrace risk, don’t chase perfection. 100% reliability is the wrong target; pick an SLO that matches what users actually need.
Measure toil and cap it. Aim to keep toil under 50% of an SRE’s time so there’s room to engineer improvements.
Automate the second time you do something. The first manual fix is fine; the second is a signal to write a tool.
Make systems self-healing. Use systemd, health checks, and auto-scaling so machines recover without paging a human.
Blame the system, not the person. When things break, fix the process and write a postmortem instead of punishing people.
Tie reliability to a budget. Use error budgets to decide whether to ship new features or pause and harden the system.