Post-Mortems

Post-mortems help your team learn from incidents. By documenting what happened, why it happened, and how to prevent it in the future, you turn every outage into an opportunity for improvement.

What is a Post-Mortem?

A post-mortem is a structured document created after an incident is resolved. It captures:

What happened — Timeline of the incident
Why it happened — Root cause analysis
Who was impacted — Scope and severity
How we fixed it — Resolution steps
How we prevent recurrence — Action items

Post-mortems are sometimes called "retrospectives," "incident reviews," or "after-action reports." Whatever you call them, the goal is the same: learn and improve.

Why Write Post-Mortems?

Build institutional knowledge — Document what went wrong so the team doesn't repeat mistakes.

Identify systemic issues — Individual incidents often reveal broader problems in systems or processes.

Improve response times — Reviewing what worked (and what didn't) makes future responses faster.

Create accountability — Action items ensure improvements actually happen.

Blameless culture — Focus on systems, not individuals, to encourage honest reporting.

Creating a Post-Mortem

Navigate to the incident

Go to Incidents and open the resolved incident you want to document.

Click Create Post-Mortem

Find the button in the incident details page. This creates a post-mortem linked to that incident.

Fill in the title

Give your post-mortem a clear, descriptive title:

"Database outage on March 15, 2024"
"API latency spike affecting checkout"
"DNS propagation failure"

Complete each section

Work through the post-mortem template, filling in details for each section.

Add action items

Define specific follow-up tasks to prevent recurrence.

Review and publish

Have the team review the post-mortem, then publish it for broader visibility.

Post-Mortem Sections

Summary

A brief overview of the incident for readers who need the highlights.

Include:

What service/system was affected
Duration of the incident
Number of users impacted
Severity level

Example:

On March 15, 2024, our primary API experienced a complete outage lasting 47 minutes. Approximately 12,000 users were unable to access the platform. The incident was classified as Severity 1 (Critical).

Root Cause

The underlying reason why the incident happened. Dig deep — the first answer is rarely the root cause.

Ask "why" repeatedly:

Why did the server go down? → Memory exhaustion
Why did memory run out? → Memory leak in background job
Why wasn't the leak caught? → No memory monitoring alerts
Root cause: Insufficient monitoring for memory usage patterns

The "5 Whys" technique helps find root causes. Keep asking "why" until you reach a systemic issue that can be addressed.

Good root cause examples:

"Deployment script didn't validate database schema compatibility"
"Auto-scaling policy had a 10-minute delay that couldn't handle traffic spike"
"Configuration change wasn't tested in staging environment"

Poor root cause examples:

"Server crashed" (symptom, not cause)
"Human error" (too vague, not actionable)
"Bad luck" (not a cause)

Impact

Describe who and what was affected by the incident.

Quantify when possible:

Number of affected users
Revenue impact
Failed requests/transactions
Duration of degraded service

Example:

12,000 users couldn't log in

340 orders failed to process ($15,200 estimated revenue impact)

API error rate peaked at 78%

Customer support received 89 tickets

Timeline

A chronological record of the incident from detection to resolution.

Include:

When the incident was detected
How it was detected (monitoring, customer report, etc.)
Key milestones in the response
When service was restored
When the incident was officially closed

Example timeline:

Time (UTC)	Event
14:32	Monitoring alert: API response time > 5s
14:35	On-call engineer acknowledges alert
14:38	Investigation begins, database connections maxed
14:45	Decision to restart database connection pool
14:48	Connection pool restarted, service recovering
14:52	All monitors green, incident resolved
15:30	Post-mortem meeting scheduled

PixoMonitor automatically captures some timeline events from the incident, which you can incorporate into your post-mortem.

Action Items

Specific, assignable tasks that will prevent this incident from recurring.

Good action items are:

Specific — Clearly defined scope
Assignable — Someone is responsible
Time-bound — Has a deadline
Measurable — You can tell when it's done

Example action items:

Action	Owner	Due Date	Status
Add memory usage alerts at 80% threshold	Alice	March 22	Pending
Implement database connection pool monitoring	Bob	March 25	Pending
Update runbook with memory leak troubleshooting	Carol	March 20	Complete
Load test new deployment before production	DevOps	Ongoing	In Progress

Lessons Learned

Insights gained from the incident that may apply beyond the immediate action items.

Categories:

What went well — Effective parts of the response
What went poorly — Areas needing improvement
Where we got lucky — Factors that could have made it worse

Example:

What went well:

On-call engineer responded within 3 minutes

Rollback procedure worked as documented

Customer communication was clear and timely

What went poorly:

Initial alert was too generic to pinpoint the issue

Had to wake up a second engineer who knew the system

No runbook existed for this failure mode

Where we got lucky:

Happened during business hours, not 3 AM

Database didn't corrupt any data

Backup engineer happened to be awake

Publishing Post-Mortems

Draft vs. Published

Post-mortems start as drafts so you can:

Collaborate with team members
Gather accurate information
Review before sharing widely

Once ready, publish the post-mortem to:

Make it visible to the broader team
Include it in reporting and metrics
Optionally share on your status page

You can optionally publish post-mortems to your public status page:

Toggle Public visibility in post-mortem settings
The post-mortem appears on your status page linked to the incident
Demonstrates transparency to customers

Review post-mortems carefully before making them public. Remove any sensitive information, internal system details, or customer data.

Post-Mortem Best Practices

Timeliness

Write post-mortems within 48-72 hours of incident resolution while details are fresh. Waiting too long leads to forgotten details and less accurate documentation.

Blameless Culture

Focus on systems, not individuals. Phrases to use:

"The deployment process didn't catch..."
"The monitoring gap allowed..."
"The runbook was missing..."

Phrases to avoid:

"John broke the database"
"The team should have known better"
"Someone forgot to..."

In a blameless post-mortem, people feel safe admitting mistakes, which leads to more honest analysis and better improvements.

Involve the Right People

Include everyone involved in the incident:

Engineers who responded
Subject matter experts
Team leads or managers
Anyone who can add context

Follow Up on Action Items

Post-mortems are only valuable if action items get completed. Regular practices:

Review open action items in team meetings
Set calendar reminders for due dates
Close the loop when items are complete
Track completion rate as a metric

Archive and Reference

Build a searchable library of post-mortems:

Similar incidents may have similar solutions
New team members can learn from past incidents
Identify patterns across multiple incidents

Viewing Post-Mortems

Post-Mortem List

Go to Incidents → Post-Mortems to see all post-mortems:

Filter by date range
Filter by severity
Search by title or content
Sort by newest or most impactful

Linking to Incidents

Each post-mortem links to its associated incident:

Click through to see original incident details
View monitor data from the incident time period
See the full incident timeline

Post-Mortems

What is a Post-Mortem?

Why Write Post-Mortems?

Creating a Post-Mortem

Post-Mortem Sections

Summary

Root Cause

Impact

Timeline

Action Items

Lessons Learned

Publishing Post-Mortems

Draft vs. Published

Post-Mortem Best Practices

Timeliness

Blameless Culture

Involve the Right People

Follow Up on Action Items

Archive and Reference

Viewing Post-Mortems

Post-Mortem List

Linking to Incidents

Common Post-Mortem Mistakes

Too brief

No action items

Too many action items

Blame-focused

Never reviewed

Next Steps

Incidents

Escalation Policies

SLA Tracking