Incident Response & On-Call

An incident is any unplanned event that breaks (or threatens to break) your service for users — a site that is down, a database that is slow, payments that fail. Incident response is the calm, repeatable process you follow to detect that problem, stop the bleeding, and get back to normal. The goal is not heroics; it is to shrink the time users feel pain and to make sure the same fire doesn’t restart tomorrow. If you are learning DevOps and Linux server administration, having a written incident process is the difference between a frantic 3am scramble and a controlled, ten-minute fix.

The incident lifecycle

Every incident moves through the same four stages. Knowing where you are stops panic and tells you what to do next.

Stage	What happens	The one question to ask
Detect	You find out something is wrong	”Is this real, and how bad?”
Triage	You assess impact and assign roles	”Who is affected and who owns this?”
Mitigate	You stop the user-facing pain now	”What is the fastest way to make it stop?”
Resolve	The system is healthy and verified	”Are we truly back to normal?”

The most important mindset shift for beginners: mitigate before you diagnose. If a bad deploy broke the site, roll it back first and figure out why it broke afterwards. Users do not care about the root cause while they are staring at an error page.

Detect

Detection should be automatic, not “a customer emailed us.” You set up monitoring and alerting (covered on the monitoring pages) so a machine pages a human when something crosses a threshold. A quick manual check on an Ubuntu server during an incident:

systemctl status nginx
journalctl -u myapp.service --since "10 min ago" --no-pager
curl -s -o /dev/null -w "%{http_code} %{time_total}s\n" https://example.com/health

Output:

● nginx.service - A high performance web server
     Active: active (running) since Mon 2026-06-15 09:02:11 UTC; 4h ago
Jun 15 13:14:22 web-1 myapp[4821]: ERROR connection pool exhausted (max=20)
Jun 15 13:14:23 web-1 myapp[4821]: ERROR connection pool exhausted (max=20)
503 0.018s

A 503 plus “connection pool exhausted” tells you the app is up but the database is the bottleneck — useful triage information in seconds.

Triage and severity levels

Triage means deciding how bad this is, because a slow image thumbnail and a total payments outage do not deserve the same response. Teams use severity levels (often “SEV” for short) so everyone instantly understands the stakes.

Severity	Meaning	Example	Response
SEV1	Critical, major user impact	Whole site down, data loss	All hands, page immediately, update status page
SEV2	Serious, partial impact	One key feature broken, login flaky	Page on-call, fix within hours
SEV3	Minor, limited impact	Slow page, cosmetic bug	Handle in business hours

When to use this: declare a SEV1 generously. It is far cheaper to wake up a second engineer for a false alarm than to let a real outage smoulder because nobody wanted to “overreact.” You can always downgrade.

Mitigate

Mitigation is the fastest action that restores service, even if it is ugly. Common Ubuntu mitigations:

# Roll back a bad deploy by re-enabling the previous release
sudo ln -sfn /srv/myapp/releases/2026-06-14 /srv/myapp/current
sudo systemctl restart myapp.service

# Restart a stuck service
sudo systemctl restart nginx

# Temporarily block an abusive IP that is hammering you
sudo ufw deny from 203.0.113.45

Output:

Rule added

Resolve

The incident is resolved only when you have verified the fix — re-run the health check, watch error rates drop, and confirm users are happy. Then you close the incident and move to the postmortem.

Roles during an incident

When several people pile onto a problem, chaos grows unless someone is in charge. Clear roles fix this.

Incident Commander (IC): the single person in charge of coordinating — not necessarily the one typing fixes. The IC decides severity, assigns tasks, and keeps everyone aligned. There is always exactly one IC.
Operations / Responder: the engineer(s) actually investigating and applying fixes.
Communications Lead: posts updates to the status page and to stakeholders so the IC and responders can focus.
Scribe: writes a timestamped log of what happened and when, which becomes the backbone of the postmortem.

Gotcha: the Incident Commander should resist the urge to dive into the terminal. The moment the IC starts debugging, nobody is steering — and two people may apply conflicting fixes at once. The IC’s job is the helicopter view.

Communication

During an outage, silence is the enemy. Users forgive downtime far more readily than being ignored. Post an update to your status page (such as a hosted page or a simple Statuspage/Instatus board) within minutes, even if it just says “we are investigating.” Inside the team, run a single dedicated channel (one Slack channel, one video call) so information doesn’t scatter. Update on a fixed cadence — every 30 minutes for a SEV1 — even when there is no news.

Humane on-call and escalation

On-call means being the designated person who gets paged when something breaks outside business hours. Done badly, it burns people out and they quit. Done humanely, it is sustainable.

Rotate fairly: one week on, several weeks off. Never one hero who is always on-call.
Pay or comp it: on-call is real work; recognise it with pay or time off.
Keep alerts actionable: every page must be something a human needs to act on now. Delete or downgrade noisy alerts ruthlessly — alert fatigue (ignoring pages because there are too many) is dangerous.
Escalation policy: if the primary on-call doesn’t acknowledge a page within, say, 5 minutes, it automatically escalates to a secondary, then to a manager. Tools like PagerDuty, Opsgenie, or Grafana OnCall handle this.

A simple escalation chain looks like this:

# escalation-policy.yaml (Grafana OnCall style)
name: payments-oncall
escalation_chain:
  - notify: primary_engineer
    wait: 5m
  - notify: secondary_engineer
    wait: 5m
  - notify: engineering_manager

Set up postmortems

Once the fire is out, you write a postmortem (a blameless write-up of what happened, why, and how to prevent it). “Blameless” means you fix the system and process, never punish the person who pushed the button — because a system that lets one human cause an outage is the real flaw. Every SEV1 and SEV2 should get a postmortem. Postmortems get their own dedicated page in these docs.

Best Practices

Mitigate first, diagnose later. Stop user pain before hunting the root cause.
Declare severity early and generously. Downgrading is easy; a missed outage is expensive.
Appoint one Incident Commander who coordinates and does not get lost in the terminal.
Over-communicate. Post status updates on a fixed cadence, even with no news.
Make on-call sustainable. Rotate fairly, pay for it, and kill noisy alerts.
Automate escalation so an unacknowledged page never goes unnoticed.
Always write a blameless postmortem for serious incidents and track its action items to completion.