Post-Mortem Best Practices
How to run effective post-mortems that improve system reliability without creating a culture of blame.
Blameless Culture
A blameless post-mortem focuses on systems and processes, not individuals. The goal is to understand what happened and why, then fix the systems that allowed it. People who feel safe reporting mistakes produce more honest post-mortems, which lead to better fixes.
Assume good intent
Every person involved was doing what they thought was right given the information they had at the time. The post-mortem should ask "what information was missing or misleading?" not "who made a mistake?"
Focus on contributing factors
Complex systems fail for multiple reasons. A deploy that caused an outage is the trigger, but missing rollback procedures, inadequate monitoring, and insufficient testing are the contributing factors worth addressing.
Separate the post-mortem from performance reviews
If post-mortem findings feed directly into performance reviews, people will be less honest. Keep the two processes separate. Post-mortems are for organizational learning, not individual evaluation.
The 5 Whys Technique
The 5 Whys is a root cause analysis technique that works by repeatedly asking "why" to peel back layers of symptoms and reach the underlying cause. It is simple to use and effective for most incidents.
Example
- Missing query timeout on reporting service database connections
- Schema migration did not include query performance validation
- Auth service had no connection pool exhaustion alerting
The number 5 is a guideline, not a rule. Stop when you reach actionable root causes. Some incidents require 3 levels, others require 7.
Timeline Reconstruction
Build a complete timeline from multiple sources: monitoring alerts, Slack messages, deployment logs, customer reports, and participant recollections. A good timeline reveals gaps in detection and response.
| Time (UTC) | Event | Source | Gap Analysis |
|---|---|---|---|
| 14:02 | Deploy of reporting-service v2.14.0 | CI/CD log | -- |
| 14:08 | DB connection pool usage rises above 80% | Metrics | No alert configured for this metric |
| 14:15 | First customer reports login failure | Support ticket | 7-minute detection gap (customer before monitoring) |
| 14:18 | Auth service 5xx rate alert triggers | Monitoring | -- |
| 14:25 | Root cause identified; reporting service rolled back | Incident channel | -- |
Action Item Tracking
Post-mortem action items are only valuable if they get completed. Use a structured format and track them in your issue tracker, not just in the post-mortem document.
| Action | Type | Priority | Owner | Due |
|---|---|---|---|---|
| Add query timeout to reporting service DB connections | Prevent | P0 | Backend team | This sprint |
| Add connection pool saturation alert | Detect | P1 | SRE team | This sprint |
| Add query plan validation to migration CI | Prevent | P2 | Platform team | Next sprint |
| Document DB connection pool sizing guidelines | Process | P3 | SRE team | This quarter |
Action types: Prevent (stop it from happening), Detect (find it faster), Mitigate (reduce impact), Process (improve response).
Post-Mortem Document Structure
# Post-Mortem: [INCIDENT TITLE] ## Metadata - Date: [INCIDENT DATE] - Duration: [START] to [END] ([TOTAL DURATION]) - Severity: [SEV LEVEL] - Incident Commander: [NAME] - Post-Mortem Author: [NAME] - Post-Mortem Review Date: [DATE] ## Executive Summary [2-3 sentences: what happened, who was affected, how long] ## Impact - Users affected: [NUMBER or PERCENTAGE] - Revenue impact: [IF APPLICABLE] - SLA impact: [ERROR BUDGET CONSUMED] - Support tickets: [COUNT] ## Timeline [CHRONOLOGICAL LIST of all significant events with timestamps] ## Root Cause Analysis [DETAILED TECHNICAL EXPLANATION] [5 WHYS or other analysis technique results] ## Contributing Factors [LIST of system and process weaknesses that enabled the incident] ## What Went Well [LIST of things that worked during the response] ## What Could Be Improved [LIST of things that slowed detection, response, or recovery] ## Action Items [TABLE: Action | Type | Priority | Owner | Due Date | Ticket Link] ## Lessons Learned [KEY TAKEAWAYS for the broader organization] ## Follow-Up Schedule - [DATE]: Review action item progress - [DATE]: Verify fixes in production - [DATE]: Close post-mortem
Follow-Up Tracking
Schedule review meetings
Set a follow-up review at 2 weeks and 4 weeks post-incident. Check that all P0 and P1 action items are complete or have a clear plan. Unfinished action items from post-mortems are one of the most common failure patterns in incident management.
Track completion rate
Measure the percentage of post-mortem action items completed within their target timeframe. A healthy organization completes 80%+ of P0/P1 items on time. If completion rate is low, the issue is usually prioritization, not engineering capacity.
Share learnings broadly
Publish post-mortem summaries (with appropriate detail level) across the organization. Other teams may have the same vulnerabilities. A monthly digest of post-mortem learnings is an effective format.