Post-Mortem Best Practices

How to run effective post-mortems that improve system reliability without creating a culture of blame.

Blameless Culture

A blameless post-mortem focuses on systems and processes, not individuals. The goal is to understand what happened and why, then fix the systems that allowed it. People who feel safe reporting mistakes produce more honest post-mortems, which lead to better fixes.

Assume good intent

Every person involved was doing what they thought was right given the information they had at the time. The post-mortem should ask "what information was missing or misleading?" not "who made a mistake?"

Focus on contributing factors

Complex systems fail for multiple reasons. A deploy that caused an outage is the trigger, but missing rollback procedures, inadequate monitoring, and insufficient testing are the contributing factors worth addressing.

Separate the post-mortem from performance reviews

If post-mortem findings feed directly into performance reviews, people will be less honest. Keep the two processes separate. Post-mortems are for organizational learning, not individual evaluation.

The 5 Whys Technique

The 5 Whys is a root cause analysis technique that works by repeatedly asking "why" to peel back layers of symptoms and reach the underlying cause. It is simple to use and effective for most incidents.

Example

Problem: Users could not log in for 23 minutes.
Why 1: The authentication service returned 503 errors.
Why 2: The auth service could not connect to the user database.
Why 3: The database connection pool was exhausted.
Why 4: A slow query from the reporting service was holding connections open for minutes instead of milliseconds.
Why 5: The reporting service had no query timeout configured, and a recent schema change caused a full table scan.
Root causes identified:
  • Missing query timeout on reporting service database connections
  • Schema migration did not include query performance validation
  • Auth service had no connection pool exhaustion alerting

The number 5 is a guideline, not a rule. Stop when you reach actionable root causes. Some incidents require 3 levels, others require 7.

Timeline Reconstruction

Build a complete timeline from multiple sources: monitoring alerts, Slack messages, deployment logs, customer reports, and participant recollections. A good timeline reveals gaps in detection and response.

Time (UTC)EventSourceGap Analysis
14:02Deploy of reporting-service v2.14.0CI/CD log--
14:08DB connection pool usage rises above 80%MetricsNo alert configured for this metric
14:15First customer reports login failureSupport ticket7-minute detection gap (customer before monitoring)
14:18Auth service 5xx rate alert triggersMonitoring--
14:25Root cause identified; reporting service rolled backIncident channel--

Action Item Tracking

Post-mortem action items are only valuable if they get completed. Use a structured format and track them in your issue tracker, not just in the post-mortem document.

ActionTypePriorityOwnerDue
Add query timeout to reporting service DB connectionsPreventP0Backend teamThis sprint
Add connection pool saturation alertDetectP1SRE teamThis sprint
Add query plan validation to migration CIPreventP2Platform teamNext sprint
Document DB connection pool sizing guidelinesProcessP3SRE teamThis quarter

Action types: Prevent (stop it from happening), Detect (find it faster), Mitigate (reduce impact), Process (improve response).

Post-Mortem Document Structure

# Post-Mortem: [INCIDENT TITLE]

## Metadata
- Date: [INCIDENT DATE]
- Duration: [START] to [END] ([TOTAL DURATION])
- Severity: [SEV LEVEL]
- Incident Commander: [NAME]
- Post-Mortem Author: [NAME]
- Post-Mortem Review Date: [DATE]

## Executive Summary
[2-3 sentences: what happened, who was affected, how long]

## Impact
- Users affected: [NUMBER or PERCENTAGE]
- Revenue impact: [IF APPLICABLE]
- SLA impact: [ERROR BUDGET CONSUMED]
- Support tickets: [COUNT]

## Timeline
[CHRONOLOGICAL LIST of all significant events with timestamps]

## Root Cause Analysis
[DETAILED TECHNICAL EXPLANATION]
[5 WHYS or other analysis technique results]

## Contributing Factors
[LIST of system and process weaknesses that enabled the incident]

## What Went Well
[LIST of things that worked during the response]

## What Could Be Improved
[LIST of things that slowed detection, response, or recovery]

## Action Items
[TABLE: Action | Type | Priority | Owner | Due Date | Ticket Link]

## Lessons Learned
[KEY TAKEAWAYS for the broader organization]

## Follow-Up Schedule
- [DATE]: Review action item progress
- [DATE]: Verify fixes in production
- [DATE]: Close post-mortem

Follow-Up Tracking

Schedule review meetings

Set a follow-up review at 2 weeks and 4 weeks post-incident. Check that all P0 and P1 action items are complete or have a clear plan. Unfinished action items from post-mortems are one of the most common failure patterns in incident management.

Track completion rate

Measure the percentage of post-mortem action items completed within their target timeframe. A healthy organization completes 80%+ of P0/P1 items on time. If completion rate is low, the issue is usually prioritization, not engineering capacity.

Share learnings broadly

Publish post-mortem summaries (with appropriate detail level) across the organization. Other teams may have the same vulnerabilities. A monthly digest of post-mortem learnings is an effective format.