Post-Mortems

Post-mortems help your team learn from incidents. By documenting what happened, why it happened, and how to prevent it in the future, you turn every outage into an opportunity for improvement.

What is a Post-Mortem?

A post-mortem is a structured document created after an incident is resolved. It captures:

  • What happened — Timeline of the incident
  • Why it happened — Root cause analysis
  • Who was impacted — Scope and severity
  • How we fixed it — Resolution steps
  • How we prevent recurrence — Action items

Post-mortems are sometimes called "retrospectives," "incident reviews," or "after-action reports." Whatever you call them, the goal is the same: learn and improve.

Why Write Post-Mortems?

Build institutional knowledge — Document what went wrong so the team doesn't repeat mistakes.

Identify systemic issues — Individual incidents often reveal broader problems in systems or processes.

Improve response times — Reviewing what worked (and what didn't) makes future responses faster.

Create accountability — Action items ensure improvements actually happen.

Blameless culture — Focus on systems, not individuals, to encourage honest reporting.

Creating a Post-Mortem

1

Navigate to the incident

Go to Incidents and open the resolved incident you want to document.

2

Click Create Post-Mortem

Find the button in the incident details page. This creates a post-mortem linked to that incident.

3

Fill in the title

Give your post-mortem a clear, descriptive title:

  • "Database outage on March 15, 2024"
  • "API latency spike affecting checkout"
  • "DNS propagation failure"
4

Complete each section

Work through the post-mortem template, filling in details for each section.

5

Add action items

Define specific follow-up tasks to prevent recurrence.

6

Review and publish

Have the team review the post-mortem, then publish it for broader visibility.

Post-Mortem Sections

Summary

A brief overview of the incident for readers who need the highlights.

Include:

  • What service/system was affected
  • Duration of the incident
  • Number of users impacted
  • Severity level

Example:

On March 15, 2024, our primary API experienced a complete outage lasting 47 minutes. Approximately 12,000 users were unable to access the platform. The incident was classified as Severity 1 (Critical).

Root Cause

The underlying reason why the incident happened. Dig deep — the first answer is rarely the root cause.

Ask "why" repeatedly:

  • Why did the server go down? → Memory exhaustion
  • Why did memory run out? → Memory leak in background job
  • Why wasn't the leak caught? → No memory monitoring alerts
  • Root cause: Insufficient monitoring for memory usage patterns

The "5 Whys" technique helps find root causes. Keep asking "why" until you reach a systemic issue that can be addressed.

Good root cause examples:

  • "Deployment script didn't validate database schema compatibility"
  • "Auto-scaling policy had a 10-minute delay that couldn't handle traffic spike"
  • "Configuration change wasn't tested in staging environment"

Poor root cause examples:

  • "Server crashed" (symptom, not cause)
  • "Human error" (too vague, not actionable)
  • "Bad luck" (not a cause)

Impact

Describe who and what was affected by the incident.

Quantify when possible:

  • Number of affected users
  • Revenue impact
  • Failed requests/transactions
  • Duration of degraded service

Example:

  • 12,000 users couldn't log in
  • 340 orders failed to process ($15,200 estimated revenue impact)
  • API error rate peaked at 78%
  • Customer support received 89 tickets

Timeline

A chronological record of the incident from detection to resolution.

Include:

  • When the incident was detected
  • How it was detected (monitoring, customer report, etc.)
  • Key milestones in the response
  • When service was restored
  • When the incident was officially closed

Example timeline:

Time (UTC)Event
14:32Monitoring alert: API response time > 5s
14:35On-call engineer acknowledges alert
14:38Investigation begins, database connections maxed
14:45Decision to restart database connection pool
14:48Connection pool restarted, service recovering
14:52All monitors green, incident resolved
15:30Post-mortem meeting scheduled

PixoMonitor automatically captures some timeline events from the incident, which you can incorporate into your post-mortem.

Action Items

Specific, assignable tasks that will prevent this incident from recurring.

Good action items are:

  • Specific — Clearly defined scope
  • Assignable — Someone is responsible
  • Time-bound — Has a deadline
  • Measurable — You can tell when it's done

Example action items:

ActionOwnerDue DateStatus
Add memory usage alerts at 80% thresholdAliceMarch 22Pending
Implement database connection pool monitoringBobMarch 25Pending
Update runbook with memory leak troubleshootingCarolMarch 20Complete
Load test new deployment before productionDevOpsOngoingIn Progress

Lessons Learned

Insights gained from the incident that may apply beyond the immediate action items.

Categories:

  • What went well — Effective parts of the response
  • What went poorly — Areas needing improvement
  • Where we got lucky — Factors that could have made it worse

Example:

What went well:

  • On-call engineer responded within 3 minutes
  • Rollback procedure worked as documented
  • Customer communication was clear and timely

What went poorly:

  • Initial alert was too generic to pinpoint the issue
  • Had to wake up a second engineer who knew the system
  • No runbook existed for this failure mode

Where we got lucky:

  • Happened during business hours, not 3 AM
  • Database didn't corrupt any data
  • Backup engineer happened to be awake

Publishing Post-Mortems

Draft vs. Published

Post-mortems start as drafts so you can:

  • Collaborate with team members
  • Gather accurate information
  • Review before sharing widely

Once ready, publish the post-mortem to:

  • Make it visible to the broader team
  • Include it in reporting and metrics
  • Optionally share on your status page

Sharing Externally

You can optionally publish post-mortems to your public status page:

  • Toggle Public visibility in post-mortem settings
  • The post-mortem appears on your status page linked to the incident
  • Demonstrates transparency to customers

Review post-mortems carefully before making them public. Remove any sensitive information, internal system details, or customer data.

Post-Mortem Best Practices

Timeliness

Write post-mortems within 48-72 hours of incident resolution while details are fresh. Waiting too long leads to forgotten details and less accurate documentation.

Blameless Culture

Focus on systems, not individuals. Phrases to use:

  • "The deployment process didn't catch..."
  • "The monitoring gap allowed..."
  • "The runbook was missing..."

Phrases to avoid:

  • "John broke the database"
  • "The team should have known better"
  • "Someone forgot to..."

In a blameless post-mortem, people feel safe admitting mistakes, which leads to more honest analysis and better improvements.

Involve the Right People

Include everyone involved in the incident:

  • Engineers who responded
  • Subject matter experts
  • Team leads or managers
  • Anyone who can add context

Follow Up on Action Items

Post-mortems are only valuable if action items get completed. Regular practices:

  • Review open action items in team meetings
  • Set calendar reminders for due dates
  • Close the loop when items are complete
  • Track completion rate as a metric

Archive and Reference

Build a searchable library of post-mortems:

  • Similar incidents may have similar solutions
  • New team members can learn from past incidents
  • Identify patterns across multiple incidents

Viewing Post-Mortems

Post-Mortem List

Go to Incidents → Post-Mortems to see all post-mortems:

  • Filter by date range
  • Filter by severity
  • Search by title or content
  • Sort by newest or most impactful

Linking to Incidents

Each post-mortem links to its associated incident:

  • Click through to see original incident details
  • View monitor data from the incident time period
  • See the full incident timeline

Common Post-Mortem Mistakes

Too brief

A post-mortem that says "server crashed, we restarted it" provides no learning value. Include enough detail that someone unfamiliar with the incident can understand what happened.

No action items

A post-mortem without action items is just documentation. The goal is preventing recurrence, which requires specific follow-up tasks.

Too many action items

Don't create 20 action items you'll never complete. Focus on the 3-5 highest-impact improvements.

Blame-focused

If your post-mortem reads like an HR report about who messed up, you're doing it wrong. Focus on systems and processes.

Never reviewed

Writing post-mortems that nobody reads is wasted effort. Share them in team meetings, include them in onboarding, reference them when similar issues occur.

Next Steps