What is Site Reliability Engineering?
Site Reliability Engineering (SRE) is a way of running computer systems so they stay reliable, fast, and available — without burning out the people who keep them running. The big idea is simple: treat operations (the work of keeping servers and apps online) as a software problem, and solve it with code instead of manual labour. SRE was created at Google around 2003 and is now used by companies of every size. If you are learning DevOps and Linux server administration, SRE gives you the mindset and the numbers to decide how reliable a system needs to be — and how to get there without endless 3am pages.
Where SRE came from
In 2003 a Google engineer named Ben Treynor Sloss was asked to build a team to keep Google’s services running. Instead of hiring traditional system administrators (people who manage servers mostly by hand), he hired software engineers and told them to run operations the way they would build software. The result was the SRE discipline.
The famous one-line definition from Google is:
SRE is what you get when you ask a software engineer to design an operations team.
That single sentence is the whole philosophy. When an engineer who writes code is told “keep this service online,” they get bored doing the same manual task twice — so they automate it.
The core idea: operations as a software problem
In old-style operations, when something broke, a human logged into a server, ran some commands, and fixed it. That works for one server. It does not work for ten thousand servers. Manual fixes do not scale.
SRE flips this. Anything you do by hand more than once should become a script, a tool, or an automated system. Restarting a crashed app, rotating logs, deploying a new version, scaling up under load — all of these become software that runs on its own.
A small, concrete example. Instead of manually restarting a service when it dies, you let systemd (Ubuntu’s service manager, the system that starts and supervises background programs) do it automatically.
sudo nano /etc/systemd/system/myapp.service
[Unit]
Description=My web app
After=network.target
[Service]
ExecStart=/usr/bin/node /srv/myapp/server.js
Restart=always
RestartSec=5
User=www-data
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now myapp.service
systemctl status myapp.service
Output:
● myapp.service - My web app
Loaded: loaded (/etc/systemd/system/myapp.service; enabled; preset: enabled)
Active: active (running) since Mon 2026-06-15 10:21:04 UTC; 3s ago
Main PID: 4821 (node)
Tasks: 11 (limit: 4915)
Memory: 38.2M
CPU: 240ms
The Restart=always line means the app comes back on its own if it crashes. That is SRE thinking: a machine handles the repetitive recovery, not a tired human.
Toil: the enemy SRE fights
Toil is SRE’s name for the boring, repetitive, manual work that keeps a service running but adds no lasting value. Toil is work that is manual, repetitive, automatable, reactive (you only do it because something broke), and scales linearly — meaning if your traffic doubles, the work doubles too.
Examples of toil: manually restarting services, copying files by hand to deploy, clicking through a dashboard every morning to “check if things are okay,” or applying the same config to ten servers one at a time.
SRE teams measure toil and try to keep it under control. A well-known Google guideline is the 50% rule: an SRE should spend no more than half their time on toil. The other half goes to engineering work — building tools that remove toil. If toil grows past 50%, the team is drowning in operations and has no time to improve.
| Trait of toil | Why it’s bad | The SRE fix |
|---|---|---|
| Manual | Wastes human time | Write a script |
| Repetitive | Boring, error-prone | Automate it |
| Reactive | You’re always firefighting | Build self-healing systems |
| Scales with load | More users = more pain | Make the work O(1), not O(n) |
Gotcha: Not all manual work is toil. Designing a new system, writing a postmortem, or planning capacity is valuable engineering — even though it’s done by hand. Toil is specifically the automatable drudgery. Don’t try to “automate away” thinking.
How SRE relates to DevOps
DevOps and SRE are close cousins and often confused. DevOps is a broad culture and set of principles for breaking down the wall between developers (who write software) and operations (who run it). It says things like “automate everything” and “share responsibility” but doesn’t tell you exactly how.
SRE is one concrete implementation of those principles. A popular way to put it: “class SRE implements interface DevOps.” DevOps describes what you want; SRE is a specific, opinionated set of practices — with hard numbers — for achieving it.
| DevOps | SRE | |
|---|---|---|
| What it is | A culture / philosophy | A specific job role and method |
| Origin | Community movement (~2009) | Google (~2003) |
| Reliability target | ”Be reliable” | Measured with SLOs and error budgets |
| Key tool | Collaboration + automation | Error budgets, SLOs, toil limits |
| Question it answers | How should teams work? | How reliable, exactly, and how? |
The reliability numbers SRE introduces
The reason SRE feels different from old operations is that it makes reliability measurable. Instead of vaguely aiming for “100% uptime” (which is impossible and wasteful), SRE sets a precise target and tracks it. The key terms — covered in depth on the next pages — are:
- SLI (Service Level Indicator): a measurement, like “percentage of requests that succeed.”
- SLO (Service Level Objective): your target for that measurement, like “99.9% of requests succeed.”
- Error budget: the small amount of failure you are allowed (if your SLO is 99.9%, you may fail 0.1% of the time). This budget gives teams permission to take risks and ship features as long as the budget isn’t spent.
These three ideas turn reliability from a feeling into math, and they’re the foundation of everything else in SRE.
Best Practices
- Embrace risk, don’t chase perfection. 100% reliability is the wrong target; pick an SLO that matches what users actually need.
- Measure toil and cap it. Aim to keep toil under 50% of an SRE’s time so there’s room to engineer improvements.
- Automate the second time you do something. The first manual fix is fine; the second is a signal to write a tool.
- Make systems self-healing. Use
systemd, health checks, and auto-scaling so machines recover without paging a human. - Blame the system, not the person. When things break, fix the process and write a postmortem instead of punishing people.
- Tie reliability to a budget. Use error budgets to decide whether to ship new features or pause and harden the system.