Incident Communication Templates
Copy-and-adapt templates for every stage of an incident lifecycle. Each template includes placeholders for service names, timestamps, and technical details.
Severity Levels and Response Expectations
| Severity | Definition | Initial Update | Update Frequency | Response Team |
|---|---|---|---|---|
| SEV1 | Complete service outage or data loss affecting all users | Within 5 minutes | Every 15 minutes | Incident Commander + full on-call rotation |
| SEV2 | Major feature degraded or unavailable; significant user impact | Within 15 minutes | Every 30 minutes | On-call engineer + team lead |
| SEV3 | Minor feature issue; workaround available; limited user impact | Within 30 minutes | Every 2 hours | On-call engineer |
| SEV4 | Cosmetic issue or minor bug; no functional impact | Within 4 hours | Daily or as needed | Assigned engineer during business hours |
Initial Notification Template
Post this as soon as an incident is confirmed. Speed of acknowledgment matters more than completeness at this stage.
Title: [Investigating] Increased error rates on [SERVICE NAME] We are investigating reports of [BRIEF DESCRIPTION OF SYMPTOMS]. Affected services: [LIST AFFECTED COMPONENTS] Start time: [TIMESTAMP in UTC] Current status: Investigating We will provide an update within [TIMEFRAME based on severity level].
Status Update: Investigating
Title: [Investigating] [SERVICE NAME] -- [BRIEF ISSUE] We are continuing to investigate [ISSUE DESCRIPTION]. What we know so far: - [OBSERVATION 1] - [OBSERVATION 2] Impact: [DESCRIBE USER-FACING IMPACT] Workaround: [IF AVAILABLE, DESCRIBE WORKAROUND] Next update: [TIMESTAMP or TIMEFRAME]
Status Update: Identified
Title: [Identified] [SERVICE NAME] -- [BRIEF ISSUE] We have identified the root cause of [ISSUE DESCRIPTION]. Root cause: [BRIEF TECHNICAL EXPLANATION appropriate for your audience] Remediation: [DESCRIBE THE FIX BEING APPLIED] Expected resolution: [ESTIMATED TIME or "We will update as progress is made"] Impact: [CURRENT USER-FACING IMPACT] Next update: [TIMESTAMP or TIMEFRAME]
Status Update: Monitoring
Title: [Monitoring] [SERVICE NAME] -- [BRIEF ISSUE] A fix has been implemented for [ISSUE DESCRIPTION]. We are monitoring the results. Fix applied: [DESCRIBE WHAT WAS DONE] Current metrics: [KEY METRICS showing recovery] Monitoring period: [HOW LONG you will monitor before resolving] If you continue to experience issues, please [CONTACT METHOD]. Next update: [TIMESTAMP or "when monitoring period completes"]
Status Update: Resolved
Title: [Resolved] [SERVICE NAME] -- [BRIEF ISSUE] This incident has been resolved. Duration: [START TIME] to [END TIME] ([TOTAL DURATION]) Root cause: [BRIEF SUMMARY] Resolution: [WHAT FIXED IT] A post-mortem will be published within [TIMEFRAME, typically 48-72 hours]. We apologize for the disruption and thank you for your patience.
Post-Mortem Summary Template
A condensed post-mortem summary suitable for publishing on your status page. The full internal post-mortem document typically contains more detail.
Post-Mortem: [INCIDENT TITLE] Date: [DATE] Duration: [TOTAL DURATION] Severity: [SEV LEVEL] Incident Commander: [NAME or ROLE] Summary ------- [2-3 sentence summary of what happened and the impact] Timeline (all times UTC) ------------------------ [HH:MM] - [EVENT: e.g., Monitoring alert triggered] [HH:MM] - [EVENT: e.g., On-call engineer paged] [HH:MM] - [EVENT: e.g., Root cause identified] [HH:MM] - [EVENT: e.g., Fix deployed] [HH:MM] - [EVENT: e.g., Service fully recovered] Root Cause ---------- [Technical explanation of what caused the incident] Resolution ---------- [What was done to resolve the incident] Lessons Learned --------------- What went well: - [ITEM] What could be improved: - [ITEM] Action Items ------------ - [ACTION] -- Owner: [TEAM/PERSON] -- Due: [DATE] - [ACTION] -- Owner: [TEAM/PERSON] -- Due: [DATE]
Communication Best Practices
Acknowledge quickly
A brief acknowledgment within minutes is more valuable than a detailed update after 30 minutes of silence.
Use plain language
Describe user-visible symptoms, not internal system names. "Login is failing" not "Auth service 503s from pod-7."
Commit to next update time
Every update should include when the next update will come. This reduces "are you still working on it?" support tickets.
Separate internal and external comms
Your status page audience is customers and users. Internal war-room details belong in Slack or your incident management tool.