~linuxgoose/engineering-templates

ref: f290fa18bc6b37632849b5c1f606ed3f62a4cf35 engineering-templates/templates/incident-postmortem.md -rw-r--r-- 3.0 KiB
f290fa18Jordan Robinson add tips / help section 3 months ago

#Incident Postmortem

#Incident Summary

Incident ID: [INC-XXX]
Title: [Brief description]
Date/Time: [When it occurred]
Duration: [How long it lasted]
Severity: [Critical/High/Medium/Low]
Date of Postmortem: [When this was written]

#Timeline

Time Event
[HH:MM] [What happened and who discovered it]
[HH:MM] [Response action taken]
[HH:MM] [Escalation/notification]
[HH:MM] [Mitigation steps]
[HH:MM] [Issue resolved]

#Impact

#External Impact

  • Number of affected users
  • Features unavailable
  • User experience degradation
  • Business impact (revenue, customers, SLA violation, etc.)

#Internal Impact

  • Teams affected
  • Services impacted
  • Data consistency issues (if any)
  • Incident response overhead

#Metrics

  • Traffic during incident
  • Error rate
  • Response time degradation
  • Downtime duration

#Root Cause Analysis

#What Happened

Detailed narrative of the incident without blame. What occurred technically?

#Why It Happened

Root cause explanation. Why did this event occur? Consider multiple contributing factors:

  1. Immediate Cause: What directly triggered the incident?
  2. Contributing Factors: What conditions made this possible?
  3. Systemic Issues: What deeper issues allowed this to happen?

#Why It Wasn't Caught

  • Monitoring/alert gaps
  • Testing gaps
  • Process gaps
  • Assumptions that proved wrong

#Lessons Learned

#What Went Well

Positive aspects of the response and recovery:

  • Fast detection
  • Good coordination
  • Clear communication
  • Effective mitigation
  • Good documentation

#What Could Improve

Areas for improvement:

  • Detection speed
  • Response procedures
  • Communication channels
  • Monitoring coverage
  • Testing approach
  • Process clarity

#Action Items

ID Action Owner Due Date Priority
[A1] [Specific, measurable action] [Person/Team] [Date] [P0/P1/P2]
[A2] [Specific, measurable action] [Person/Team] [Date] [P0/P1/P2]
[A3] [Specific, measurable action] [Person/Team] [Date] [P0/P1/P2]

#Example Action Items

  • Improve monitoring alert for [metric]
  • Add integration test for [scenario]
  • Implement circuit breaker for [external dependency]
  • Update runbook with new procedure
  • Increase database connection pool limit

#Prevention Measures

#Short-term (immediate fixes)

  • What can be done immediately to prevent recurrence?

#Long-term (systemic improvements)

  • What systemic changes should be made?
  • Which of these are included in action items above?

#Appendices

#Configuration/Code Snippets

  • Relevant config at time of incident
  • Code involved in the incident

#Monitoring Gaps Identified

  • What alerts should have fired but didn't?
  • Why didn't they?

#Documentation

  • Links to related runbooks
  • Links to alerting configuration
  • Links to deployment procedures

#Postmortem Sign-off

  • Facilitated by: [Name]
  • Reviewed by: [Name]
  • Date: [Date]
  • Next review: [Date if follow-up needed]