From d641bd1a558898166134c7ad5019f0ab00f2a353 Mon Sep 17 00:00:00 2001 From: Jordan Robinson Date: Sun, 19 Oct 2025 15:40:35 +0100 Subject: [PATCH] add incident postmortem template --- README.md | 3 +- templates/incident-postmortem.md | 121 +++++++++++++++++++++++++++++++ 2 files changed, 123 insertions(+), 1 deletion(-) create mode 100644 templates/incident-postmortem.md diff --git a/README.md b/README.md index ff6c5acec6ba1a2d036a8e7bb6aa5cd249567768..5f5a718148dbdb2051a922c8721d2dc5da7e4331 100644 --- a/README.md +++ b/README.md @@ -12,4 +12,5 @@ Comprehensive collection of templates for solutions design, architecture decisio * Runbook / Operations Guide * Code Review Standards & Guidelines * Technical Debt Registry -* Capacity Planning & Scaling Document \ No newline at end of file +* Capacity Planning & Scaling Document +* Incident Postmortem \ No newline at end of file diff --git a/templates/incident-postmortem.md b/templates/incident-postmortem.md new file mode 100644 index 0000000000000000000000000000000000000000..e35c0139c8f08fae65c5ead527e7d44a67a9eb99 --- /dev/null +++ b/templates/incident-postmortem.md @@ -0,0 +1,121 @@ +# Incident Postmortem + +## Incident Summary +**Incident ID:** [INC-XXX] +**Title:** [Brief description] +**Date/Time:** [When it occurred] +**Duration:** [How long it lasted] +**Severity:** [Critical/High/Medium/Low] +**Date of Postmortem:** [When this was written] + +## Timeline + +| Time | Event | +|------|-------| +| [HH:MM] | [What happened and who discovered it] | +| [HH:MM] | [Response action taken] | +| [HH:MM] | [Escalation/notification] | +| [HH:MM] | [Mitigation steps] | +| [HH:MM] | [Issue resolved] | + +## Impact + +### External Impact +- Number of affected users +- Features unavailable +- User experience degradation +- Business impact (revenue, customers, SLA violation, etc.) + +### Internal Impact +- Teams affected +- Services impacted +- Data consistency issues (if any) +- Incident response overhead + +### Metrics +- Traffic during incident +- Error rate +- Response time degradation +- Downtime duration + +## Root Cause Analysis + +### What Happened +Detailed narrative of the incident without blame. What occurred technically? + +### Why It Happened +Root cause explanation. Why did this event occur? Consider multiple contributing factors: + +1. **Immediate Cause:** What directly triggered the incident? +2. **Contributing Factors:** What conditions made this possible? +3. **Systemic Issues:** What deeper issues allowed this to happen? + +### Why It Wasn't Caught +- Monitoring/alert gaps +- Testing gaps +- Process gaps +- Assumptions that proved wrong + +## Lessons Learned + +### What Went Well +Positive aspects of the response and recovery: +- Fast detection +- Good coordination +- Clear communication +- Effective mitigation +- Good documentation + +### What Could Improve +Areas for improvement: +- Detection speed +- Response procedures +- Communication channels +- Monitoring coverage +- Testing approach +- Process clarity + +## Action Items + +| ID | Action | Owner | Due Date | Priority | +|----|--------|-------|----------|----------| +| [A1] | [Specific, measurable action] | [Person/Team] | [Date] | [P0/P1/P2] | +| [A2] | [Specific, measurable action] | [Person/Team] | [Date] | [P0/P1/P2] | +| [A3] | [Specific, measurable action] | [Person/Team] | [Date] | [P0/P1/P2] | + +### Example Action Items +- Improve monitoring alert for [metric] +- Add integration test for [scenario] +- Implement circuit breaker for [external dependency] +- Update runbook with new procedure +- Increase database connection pool limit + +## Prevention Measures + +### Short-term (immediate fixes) +- What can be done immediately to prevent recurrence? + +### Long-term (systemic improvements) +- What systemic changes should be made? +- Which of these are included in action items above? + +## Appendices + +### Configuration/Code Snippets +- Relevant config at time of incident +- Code involved in the incident + +### Monitoring Gaps Identified +- What alerts should have fired but didn't? +- Why didn't they? + +### Documentation +- Links to related runbooks +- Links to alerting configuration +- Links to deployment procedures + +## Postmortem Sign-off +- Facilitated by: [Name] +- Reviewed by: [Name] +- Date: [Date] +- Next review: [Date if follow-up needed] \ No newline at end of file