# Incident Postmortem ## Incident Summary **Incident ID:** [INC-XXX] **Title:** [Brief description] **Date/Time:** [When it occurred] **Duration:** [How long it lasted] **Severity:** [Critical/High/Medium/Low] **Date of Postmortem:** [When this was written] ## Timeline | Time | Event | |------|-------| | [HH:MM] | [What happened and who discovered it] | | [HH:MM] | [Response action taken] | | [HH:MM] | [Escalation/notification] | | [HH:MM] | [Mitigation steps] | | [HH:MM] | [Issue resolved] | ## Impact ### External Impact - Number of affected users - Features unavailable - User experience degradation - Business impact (revenue, customers, SLA violation, etc.) ### Internal Impact - Teams affected - Services impacted - Data consistency issues (if any) - Incident response overhead ### Metrics - Traffic during incident - Error rate - Response time degradation - Downtime duration ## Root Cause Analysis ### What Happened Detailed narrative of the incident without blame. What occurred technically? ### Why It Happened Root cause explanation. Why did this event occur? Consider multiple contributing factors: 1. **Immediate Cause:** What directly triggered the incident? 2. **Contributing Factors:** What conditions made this possible? 3. **Systemic Issues:** What deeper issues allowed this to happen? ### Why It Wasn't Caught - Monitoring/alert gaps - Testing gaps - Process gaps - Assumptions that proved wrong ## Lessons Learned ### What Went Well Positive aspects of the response and recovery: - Fast detection - Good coordination - Clear communication - Effective mitigation - Good documentation ### What Could Improve Areas for improvement: - Detection speed - Response procedures - Communication channels - Monitoring coverage - Testing approach - Process clarity ## Action Items | ID | Action | Owner | Due Date | Priority | |----|--------|-------|----------|----------| | [A1] | [Specific, measurable action] | [Person/Team] | [Date] | [P0/P1/P2] | | [A2] | [Specific, measurable action] | [Person/Team] | [Date] | [P0/P1/P2] | | [A3] | [Specific, measurable action] | [Person/Team] | [Date] | [P0/P1/P2] | ### Example Action Items - Improve monitoring alert for [metric] - Add integration test for [scenario] - Implement circuit breaker for [external dependency] - Update runbook with new procedure - Increase database connection pool limit ## Prevention Measures ### Short-term (immediate fixes) - What can be done immediately to prevent recurrence? ### Long-term (systemic improvements) - What systemic changes should be made? - Which of these are included in action items above? ## Appendices ### Configuration/Code Snippets - Relevant config at time of incident - Code involved in the incident ### Monitoring Gaps Identified - What alerts should have fired but didn't? - Why didn't they? ### Documentation - Links to related runbooks - Links to alerting configuration - Links to deployment procedures ## Postmortem Sign-off - Facilitated by: [Name] - Reviewed by: [Name] - Date: [Date] - Next review: [Date if follow-up needed]