Post-Mortems
Post-mortems help your team learn from incidents. By documenting what happened, why it happened, and how to prevent it in the future, you turn every outage into an opportunity for improvement.
What is a Post-Mortem?
A post-mortem is a structured document created after an incident is resolved. It captures:
- What happened — Timeline of the incident
- Why it happened — Root cause analysis
- Who was impacted — Scope and severity
- How we fixed it — Resolution steps
- How we prevent recurrence — Action items
Post-mortems are sometimes called "retrospectives," "incident reviews," or "after-action reports." Whatever you call them, the goal is the same: learn and improve.
Why Write Post-Mortems?
Build institutional knowledge — Document what went wrong so the team doesn't repeat mistakes.
Identify systemic issues — Individual incidents often reveal broader problems in systems or processes.
Improve response times — Reviewing what worked (and what didn't) makes future responses faster.
Create accountability — Action items ensure improvements actually happen.
Blameless culture — Focus on systems, not individuals, to encourage honest reporting.
Creating a Post-Mortem
Navigate to the incident
Go to Incidents and open the resolved incident you want to document.
Click Create Post-Mortem
Find the button in the incident details page. This creates a post-mortem linked to that incident.
Fill in the title
Give your post-mortem a clear, descriptive title:
- "Database outage on March 15, 2024"
- "API latency spike affecting checkout"
- "DNS propagation failure"
Complete each section
Work through the post-mortem template, filling in details for each section.
Add action items
Define specific follow-up tasks to prevent recurrence.
Review and publish
Have the team review the post-mortem, then publish it for broader visibility.
Post-Mortem Sections
Summary
A brief overview of the incident for readers who need the highlights.
Include:
- What service/system was affected
- Duration of the incident
- Number of users impacted
- Severity level
Example:
On March 15, 2024, our primary API experienced a complete outage lasting 47 minutes. Approximately 12,000 users were unable to access the platform. The incident was classified as Severity 1 (Critical).
Root Cause
The underlying reason why the incident happened. Dig deep — the first answer is rarely the root cause.
Ask "why" repeatedly:
- Why did the server go down? → Memory exhaustion
- Why did memory run out? → Memory leak in background job
- Why wasn't the leak caught? → No memory monitoring alerts
- Root cause: Insufficient monitoring for memory usage patterns
The "5 Whys" technique helps find root causes. Keep asking "why" until you reach a systemic issue that can be addressed.
Good root cause examples:
- "Deployment script didn't validate database schema compatibility"
- "Auto-scaling policy had a 10-minute delay that couldn't handle traffic spike"
- "Configuration change wasn't tested in staging environment"
Poor root cause examples:
- "Server crashed" (symptom, not cause)
- "Human error" (too vague, not actionable)
- "Bad luck" (not a cause)
Impact
Describe who and what was affected by the incident.
Quantify when possible:
- Number of affected users
- Revenue impact
- Failed requests/transactions
- Duration of degraded service
Example:
- 12,000 users couldn't log in
- 340 orders failed to process ($15,200 estimated revenue impact)
- API error rate peaked at 78%
- Customer support received 89 tickets
Timeline
A chronological record of the incident from detection to resolution.
Include:
- When the incident was detected
- How it was detected (monitoring, customer report, etc.)
- Key milestones in the response
- When service was restored
- When the incident was officially closed
Example timeline:
| Time (UTC) | Event |
|---|---|
| 14:32 | Monitoring alert: API response time > 5s |
| 14:35 | On-call engineer acknowledges alert |
| 14:38 | Investigation begins, database connections maxed |
| 14:45 | Decision to restart database connection pool |
| 14:48 | Connection pool restarted, service recovering |
| 14:52 | All monitors green, incident resolved |
| 15:30 | Post-mortem meeting scheduled |
PixoMonitor automatically captures some timeline events from the incident, which you can incorporate into your post-mortem.
Action Items
Specific, assignable tasks that will prevent this incident from recurring.
Good action items are:
- Specific — Clearly defined scope
- Assignable — Someone is responsible
- Time-bound — Has a deadline
- Measurable — You can tell when it's done
Example action items:
| Action | Owner | Due Date | Status |
|---|---|---|---|
| Add memory usage alerts at 80% threshold | Alice | March 22 | Pending |
| Implement database connection pool monitoring | Bob | March 25 | Pending |
| Update runbook with memory leak troubleshooting | Carol | March 20 | Complete |
| Load test new deployment before production | DevOps | Ongoing | In Progress |
Lessons Learned
Insights gained from the incident that may apply beyond the immediate action items.
Categories:
- What went well — Effective parts of the response
- What went poorly — Areas needing improvement
- Where we got lucky — Factors that could have made it worse
Example:
What went well:
- On-call engineer responded within 3 minutes
- Rollback procedure worked as documented
- Customer communication was clear and timely
What went poorly:
- Initial alert was too generic to pinpoint the issue
- Had to wake up a second engineer who knew the system
- No runbook existed for this failure mode
Where we got lucky:
- Happened during business hours, not 3 AM
- Database didn't corrupt any data
- Backup engineer happened to be awake
Publishing Post-Mortems
Draft vs. Published
Post-mortems start as drafts so you can:
- Collaborate with team members
- Gather accurate information
- Review before sharing widely
Once ready, publish the post-mortem to:
- Make it visible to the broader team
- Include it in reporting and metrics
- Optionally share on your status page
Sharing Externally
You can optionally publish post-mortems to your public status page:
- Toggle Public visibility in post-mortem settings
- The post-mortem appears on your status page linked to the incident
- Demonstrates transparency to customers
Review post-mortems carefully before making them public. Remove any sensitive information, internal system details, or customer data.
Post-Mortem Best Practices
Timeliness
Write post-mortems within 48-72 hours of incident resolution while details are fresh. Waiting too long leads to forgotten details and less accurate documentation.
Blameless Culture
Focus on systems, not individuals. Phrases to use:
- "The deployment process didn't catch..."
- "The monitoring gap allowed..."
- "The runbook was missing..."
Phrases to avoid:
- "John broke the database"
- "The team should have known better"
- "Someone forgot to..."
In a blameless post-mortem, people feel safe admitting mistakes, which leads to more honest analysis and better improvements.
Involve the Right People
Include everyone involved in the incident:
- Engineers who responded
- Subject matter experts
- Team leads or managers
- Anyone who can add context
Follow Up on Action Items
Post-mortems are only valuable if action items get completed. Regular practices:
- Review open action items in team meetings
- Set calendar reminders for due dates
- Close the loop when items are complete
- Track completion rate as a metric
Archive and Reference
Build a searchable library of post-mortems:
- Similar incidents may have similar solutions
- New team members can learn from past incidents
- Identify patterns across multiple incidents
Viewing Post-Mortems
Post-Mortem List
Go to Incidents → Post-Mortems to see all post-mortems:
- Filter by date range
- Filter by severity
- Search by title or content
- Sort by newest or most impactful
Linking to Incidents
Each post-mortem links to its associated incident:
- Click through to see original incident details
- View monitor data from the incident time period
- See the full incident timeline
Common Post-Mortem Mistakes
Too brief
A post-mortem that says "server crashed, we restarted it" provides no learning value. Include enough detail that someone unfamiliar with the incident can understand what happened.
No action items
A post-mortem without action items is just documentation. The goal is preventing recurrence, which requires specific follow-up tasks.
Too many action items
Don't create 20 action items you'll never complete. Focus on the 3-5 highest-impact improvements.
Blame-focused
If your post-mortem reads like an HR report about who messed up, you're doing it wrong. Focus on systems and processes.
Never reviewed
Writing post-mortems that nobody reads is wasted effort. Share them in team meetings, include them in onboarding, reference them when similar issues occur.
