Blameless Postmortems

A postmortem is a written document that explains what went wrong during an incident (an unplanned outage or degradation of your service), why it happened, and what you will change so it never happens the same way again. A blameless postmortem means you write this document without pointing fingers at any single person. The goal is to learn from the failure, not to punish someone for it. This page shows you how to run a blameless postmortem culture and gives you a ready-to-use template.

Why blameless matters

When people fear being blamed, they hide information. An engineer who accidentally ran the wrong command on a production server will quietly fix it and say nothing if they think they will be fired. That silence is dangerous, because the rest of the team never learns that the command was easy to run by mistake.

Blameless postmortems flip this around. They assume that everyone acted with good intentions and the best information they had at the time. If a person made a mistake, the real question is: why did the system make that mistake so easy to make? Maybe there was no confirmation prompt. Maybe the staging and production terminals looked identical. Those are fixable system problems, not character flaws.

Blameless does not mean accountability-free. The team still owns the fix. It means we hold the system accountable, not the individual. A culture of blame produces fewer reports and slower learning.

When to write one: any incident that breached an SLO (Service Level Objective — your reliability target), caused customer-facing downtime, or required emergency human action. When NOT to bother: a five-minute blip caught and auto-healed with zero customer impact does not need a full document, though you may still log it.

The structure of a postmortem

A good postmortem has a fixed shape so readers always know where to look. The core sections are below.

Section	What it answers
Summary	One paragraph: what broke, for how long, who was affected
Impact	How many users, how much revenue, which SLOs were breached
Timeline	Minute-by-minute log of what happened and when
Root cause	The deep reason, found by asking “why” repeatedly
Resolution	What action finally restored service
Action items	Concrete, owned, dated follow-up tasks
Lessons learned	What went well, what went badly, where we got lucky

The timeline is the most important part. Build it from real evidence: chat logs, alert timestamps, and your log files. On Ubuntu, your service and system logs live in predictable places, so pull exact times from there.

sudo journalctl -u nginx --since "2026-06-15 14:00" --until "2026-06-15 15:00"

Output:

Jun 15 14:03:11 web01 nginx[8421]: 2026/06/15 14:03:11 [crit] upstream timed out
Jun 15 14:03:12 web01 systemd[1]: nginx.service: Main process exited, code=killed
Jun 15 14:07:48 web01 systemd[1]: nginx.service: Scheduled restart job, restart counter is at 3

Use the same approach for app logs under /var/log and database logs under /var/log/postgresql/. Real timestamps stop the timeline from becoming guesswork.

Finding the root cause

The root cause is the underlying reason, not the first symptom you noticed. A common technique is the Five Whys: keep asking “why” until you reach a cause you can actually fix.

The site was down. Why? Nginx kept restarting.
Why? The app behind it was not responding.
Why? The database connection pool was exhausted.
Why? A slow query held connections open.
Why? A new feature shipped without an index on a large table.

The fixable root cause is the missing index plus the lack of a check for un-indexed queries before release. Notice the answer is never “Priya deployed it.” It is always a gap in the system.

Watch for contributing factors, not just one root cause. Real incidents usually have several. A missing index AND a missing alert AND a misleading dashboard can all line up at once.

Action items that actually get done

A postmortem with no follow-through is just a story. Every action item must be specific, owned, and dated, and it must be tracked in the same tool your team uses for normal work (Jira, GitHub Issues, Linear) so it cannot be forgotten.

Bad action item	Good action item
”Improve database performance"	"Add index on `orders.created_at` — owner: @sam — due 2026-06-20"
"Better monitoring"	"Alert when connection pool > 80% used — owner: @lee — due 2026-06-22”

Track completion as part of your weekly review. A simple way to surface stale items is a label search:

gh issue list --label postmortem --state open

Output:

#412  Add index on orders.created_at        sam   due 2026-06-20
#413  Alert on connection pool saturation    lee   due 2026-06-22

Hold a recurring meeting to walk this list. If items keep slipping, that is a signal your team is over capacity, which is itself worth a discussion.

A template you can copy

Save this as postmortem-template.md in your team’s repository so every incident starts from the same shape.

# Postmortem: <short title>

- Date: <YYYY-MM-DD>
- Authors: <names>
- Status: Draft | Reviewed | Closed
- Severity: SEV1 | SEV2 | SEV3

## Summary
One paragraph. What broke, duration, who was affected.

## Impact
Users affected, revenue lost, SLOs breached, error-budget burned.

## Timeline (all times UTC)
- 14:03 First alert fired
- 14:07 On-call acknowledged
- 14:25 Root cause identified
- 14:40 Service restored

## Root cause
The deep, fixable reason (use Five Whys).

## Resolution
What restored service.

## Action items
| Item | Owner | Due | Ticket |
| --- | --- | --- | --- |

## Lessons learned
- What went well
- What went badly
- Where we got lucky

Best Practices

Write the postmortem within a few days, while memory is fresh, then circulate it widely so the whole org learns.
Keep the language neutral: say “the deploy introduced a slow query,” never “Priya broke the database.”
Pull timeline entries from real logs (journalctl, /var/log) instead of from memory.
Make every action item specific, owned, and dated, and file it in your normal issue tracker.
Review open postmortem action items weekly so fixes do not rot.
Always capture a “where we got lucky” note — luck is a risk you have not paid for yet.
Store all postmortems in one searchable place so patterns across incidents become visible over time.