#Incident Postmortem
#Incident Summary
Incident ID: [INC-XXX]
Title: [Brief description]
Date/Time: [When it occurred]
Duration: [How long it lasted]
Severity: [Critical/High/Medium/Low]
Date of Postmortem: [When this was written]
#Timeline
| Time |
Event |
| [HH:MM] |
[What happened and who discovered it] |
| [HH:MM] |
[Response action taken] |
| [HH:MM] |
[Escalation/notification] |
| [HH:MM] |
[Mitigation steps] |
| [HH:MM] |
[Issue resolved] |
#Impact
#External Impact
- Number of affected users
- Features unavailable
- User experience degradation
- Business impact (revenue, customers, SLA violation, etc.)
#Internal Impact
- Teams affected
- Services impacted
- Data consistency issues (if any)
- Incident response overhead
#Metrics
- Traffic during incident
- Error rate
- Response time degradation
- Downtime duration
#Root Cause Analysis
#What Happened
Detailed narrative of the incident without blame. What occurred technically?
#Why It Happened
Root cause explanation. Why did this event occur? Consider multiple contributing factors:
- Immediate Cause: What directly triggered the incident?
- Contributing Factors: What conditions made this possible?
- Systemic Issues: What deeper issues allowed this to happen?
#Why It Wasn't Caught
- Monitoring/alert gaps
- Testing gaps
- Process gaps
- Assumptions that proved wrong
#Lessons Learned
#What Went Well
Positive aspects of the response and recovery:
- Fast detection
- Good coordination
- Clear communication
- Effective mitigation
- Good documentation
#What Could Improve
Areas for improvement:
- Detection speed
- Response procedures
- Communication channels
- Monitoring coverage
- Testing approach
- Process clarity
#Action Items
| ID |
Action |
Owner |
Due Date |
Priority |
| [A1] |
[Specific, measurable action] |
[Person/Team] |
[Date] |
[P0/P1/P2] |
| [A2] |
[Specific, measurable action] |
[Person/Team] |
[Date] |
[P0/P1/P2] |
| [A3] |
[Specific, measurable action] |
[Person/Team] |
[Date] |
[P0/P1/P2] |
#Example Action Items
- Improve monitoring alert for [metric]
- Add integration test for [scenario]
- Implement circuit breaker for [external dependency]
- Update runbook with new procedure
- Increase database connection pool limit
#Prevention Measures
- What can be done immediately to prevent recurrence?
#Long-term (systemic improvements)
- What systemic changes should be made?
- Which of these are included in action items above?
#Appendices
#Configuration/Code Snippets
- Relevant config at time of incident
- Code involved in the incident
#Monitoring Gaps Identified
- What alerts should have fired but didn't?
- Why didn't they?
#Documentation
- Links to related runbooks
- Links to alerting configuration
- Links to deployment procedures
#Postmortem Sign-off
- Facilitated by: [Name]
- Reviewed by: [Name]
- Date: [Date]
- Next review: [Date if follow-up needed]