M README.md => README.md +2 -1
@@ 12,4 12,5 @@ Comprehensive collection of templates for solutions design, architecture decisio
* Runbook / Operations Guide
* Code Review Standards & Guidelines
* Technical Debt Registry
-* Capacity Planning & Scaling Document>
\ No newline at end of file
+* Capacity Planning & Scaling Document
+* Incident Postmortem<
\ No newline at end of file
A templates/incident-postmortem.md => templates/incident-postmortem.md +121 -0
@@ 0,0 1,121 @@
+# Incident Postmortem
+
+## Incident Summary
+**Incident ID:** [INC-XXX]
+**Title:** [Brief description]
+**Date/Time:** [When it occurred]
+**Duration:** [How long it lasted]
+**Severity:** [Critical/High/Medium/Low]
+**Date of Postmortem:** [When this was written]
+
+## Timeline
+
+| Time | Event |
+|------|-------|
+| [HH:MM] | [What happened and who discovered it] |
+| [HH:MM] | [Response action taken] |
+| [HH:MM] | [Escalation/notification] |
+| [HH:MM] | [Mitigation steps] |
+| [HH:MM] | [Issue resolved] |
+
+## Impact
+
+### External Impact
+- Number of affected users
+- Features unavailable
+- User experience degradation
+- Business impact (revenue, customers, SLA violation, etc.)
+
+### Internal Impact
+- Teams affected
+- Services impacted
+- Data consistency issues (if any)
+- Incident response overhead
+
+### Metrics
+- Traffic during incident
+- Error rate
+- Response time degradation
+- Downtime duration
+
+## Root Cause Analysis
+
+### What Happened
+Detailed narrative of the incident without blame. What occurred technically?
+
+### Why It Happened
+Root cause explanation. Why did this event occur? Consider multiple contributing factors:
+
+1. **Immediate Cause:** What directly triggered the incident?
+2. **Contributing Factors:** What conditions made this possible?
+3. **Systemic Issues:** What deeper issues allowed this to happen?
+
+### Why It Wasn't Caught
+- Monitoring/alert gaps
+- Testing gaps
+- Process gaps
+- Assumptions that proved wrong
+
+## Lessons Learned
+
+### What Went Well
+Positive aspects of the response and recovery:
+- Fast detection
+- Good coordination
+- Clear communication
+- Effective mitigation
+- Good documentation
+
+### What Could Improve
+Areas for improvement:
+- Detection speed
+- Response procedures
+- Communication channels
+- Monitoring coverage
+- Testing approach
+- Process clarity
+
+## Action Items
+
+| ID | Action | Owner | Due Date | Priority |
+|----|--------|-------|----------|----------|
+| [A1] | [Specific, measurable action] | [Person/Team] | [Date] | [P0/P1/P2] |
+| [A2] | [Specific, measurable action] | [Person/Team] | [Date] | [P0/P1/P2] |
+| [A3] | [Specific, measurable action] | [Person/Team] | [Date] | [P0/P1/P2] |
+
+### Example Action Items
+- Improve monitoring alert for [metric]
+- Add integration test for [scenario]
+- Implement circuit breaker for [external dependency]
+- Update runbook with new procedure
+- Increase database connection pool limit
+
+## Prevention Measures
+
+### Short-term (immediate fixes)
+- What can be done immediately to prevent recurrence?
+
+### Long-term (systemic improvements)
+- What systemic changes should be made?
+- Which of these are included in action items above?
+
+## Appendices
+
+### Configuration/Code Snippets
+- Relevant config at time of incident
+- Code involved in the incident
+
+### Monitoring Gaps Identified
+- What alerts should have fired but didn't?
+- Why didn't they?
+
+### Documentation
+- Links to related runbooks
+- Links to alerting configuration
+- Links to deployment procedures
+
+## Postmortem Sign-off
+- Facilitated by: [Name]
+- Reviewed by: [Name]
+- Date: [Date]
+- Next review: [Date if follow-up needed]<
\ No newline at end of file