~linuxgoose/engineering-templates

d641bd1a558898166134c7ad5019f0ab00f2a353 — Jordan Robinson 3 months ago 0bf73b4
add incident postmortem template
2 files changed, 123 insertions(+), 1 deletions(-)

M README.md
A templates/incident-postmortem.md
M README.md => README.md +2 -1
@@ 12,4 12,5 @@ Comprehensive collection of templates for solutions design, architecture decisio
* Runbook / Operations Guide
* Code Review Standards & Guidelines
* Technical Debt Registry
* Capacity Planning & Scaling Document
\ No newline at end of file
* Capacity Planning & Scaling Document
* Incident Postmortem
\ No newline at end of file

A templates/incident-postmortem.md => templates/incident-postmortem.md +121 -0
@@ 0,0 1,121 @@
# Incident Postmortem

## Incident Summary
**Incident ID:** [INC-XXX]  
**Title:** [Brief description]  
**Date/Time:** [When it occurred]  
**Duration:** [How long it lasted]  
**Severity:** [Critical/High/Medium/Low]  
**Date of Postmortem:** [When this was written]

## Timeline

| Time | Event |
|------|-------|
| [HH:MM] | [What happened and who discovered it] |
| [HH:MM] | [Response action taken] |
| [HH:MM] | [Escalation/notification] |
| [HH:MM] | [Mitigation steps] |
| [HH:MM] | [Issue resolved] |

## Impact

### External Impact
- Number of affected users
- Features unavailable
- User experience degradation
- Business impact (revenue, customers, SLA violation, etc.)

### Internal Impact
- Teams affected
- Services impacted
- Data consistency issues (if any)
- Incident response overhead

### Metrics
- Traffic during incident
- Error rate
- Response time degradation
- Downtime duration

## Root Cause Analysis

### What Happened
Detailed narrative of the incident without blame. What occurred technically?

### Why It Happened
Root cause explanation. Why did this event occur? Consider multiple contributing factors:

1. **Immediate Cause:** What directly triggered the incident?
2. **Contributing Factors:** What conditions made this possible?
3. **Systemic Issues:** What deeper issues allowed this to happen?

### Why It Wasn't Caught
- Monitoring/alert gaps
- Testing gaps
- Process gaps
- Assumptions that proved wrong

## Lessons Learned

### What Went Well
Positive aspects of the response and recovery:
- Fast detection
- Good coordination
- Clear communication
- Effective mitigation
- Good documentation

### What Could Improve
Areas for improvement:
- Detection speed
- Response procedures
- Communication channels
- Monitoring coverage
- Testing approach
- Process clarity

## Action Items

| ID | Action | Owner | Due Date | Priority |
|----|--------|-------|----------|----------|
| [A1] | [Specific, measurable action] | [Person/Team] | [Date] | [P0/P1/P2] |
| [A2] | [Specific, measurable action] | [Person/Team] | [Date] | [P0/P1/P2] |
| [A3] | [Specific, measurable action] | [Person/Team] | [Date] | [P0/P1/P2] |

### Example Action Items
- Improve monitoring alert for [metric]
- Add integration test for [scenario]
- Implement circuit breaker for [external dependency]
- Update runbook with new procedure
- Increase database connection pool limit

## Prevention Measures

### Short-term (immediate fixes)
- What can be done immediately to prevent recurrence?

### Long-term (systemic improvements)
- What systemic changes should be made?
- Which of these are included in action items above?

## Appendices

### Configuration/Code Snippets
- Relevant config at time of incident
- Code involved in the incident

### Monitoring Gaps Identified
- What alerts should have fired but didn't?
- Why didn't they?

### Documentation
- Links to related runbooks
- Links to alerting configuration
- Links to deployment procedures

## Postmortem Sign-off
- Facilitated by: [Name]
- Reviewed by: [Name]
- Date: [Date]
- Next review: [Date if follow-up needed]
\ No newline at end of file