~linuxgoose/engineering-templates (f066b4677fc7a0877df86c3588c22c1221a4242d): templates/incident-postmortem.md

#Incident Postmortem

#Incident Summary

Incident ID: [INC-XXX]
Title: [Brief description]
Date/Time: [When it occurred]
Duration: [How long it lasted]
Severity: [Critical/High/Medium/Low]
Date of Postmortem: [When this was written]

#Timeline

Time	Event
[HH:MM]	[What happened and who discovered it]
[HH:MM]	[Response action taken]
[HH:MM]	[Escalation/notification]
[HH:MM]	[Mitigation steps]
[HH:MM]	[Issue resolved]

#Impact

#External Impact

Number of affected users
Features unavailable
User experience degradation
Business impact (revenue, customers, SLA violation, etc.)

#Internal Impact

Teams affected
Services impacted
Data consistency issues (if any)
Incident response overhead

#Metrics

Traffic during incident
Error rate
Response time degradation
Downtime duration

#Root Cause Analysis

#What Happened

Detailed narrative of the incident without blame. What occurred technically?

#Why It Happened

Root cause explanation. Why did this event occur? Consider multiple contributing factors:

Immediate Cause: What directly triggered the incident?
Contributing Factors: What conditions made this possible?
Systemic Issues: What deeper issues allowed this to happen?

#Why It Wasn't Caught

Monitoring/alert gaps
Testing gaps
Process gaps
Assumptions that proved wrong

#Lessons Learned

#What Went Well

Positive aspects of the response and recovery:

Fast detection
Good coordination
Clear communication
Effective mitigation
Good documentation

#What Could Improve

Areas for improvement:

Detection speed
Response procedures
Communication channels
Monitoring coverage
Testing approach
Process clarity

#Action Items

ID	Action	Owner	Due Date	Priority
[A1]	[Specific, measurable action]	[Person/Team]	[Date]	[P0/P1/P2]
[A2]	[Specific, measurable action]	[Person/Team]	[Date]	[P0/P1/P2]
[A3]	[Specific, measurable action]	[Person/Team]	[Date]	[P0/P1/P2]

#Example Action Items

Improve monitoring alert for [metric]
Add integration test for [scenario]
Implement circuit breaker for [external dependency]
Update runbook with new procedure
Increase database connection pool limit

#Prevention Measures

#Short-term (immediate fixes)

What can be done immediately to prevent recurrence?

#Long-term (systemic improvements)

What systemic changes should be made?
Which of these are included in action items above?

#Appendices

#Configuration/Code Snippets

Relevant config at time of incident
Code involved in the incident

#Monitoring Gaps Identified

What alerts should have fired but didn't?
Why didn't they?

#Documentation

Links to related runbooks
Links to alerting configuration
Links to deployment procedures

#Postmortem Sign-off

Facilitated by: [Name]
Reviewed by: [Name]
Date: [Date]
Next review: [Date if follow-up needed]