M README.md => README.md +2 -1
@@ 8,4 8,5 @@ Comprehensive collection of templates for solutions design, architecture decisio
* Bug Fix
* Performance Optimisation
* Architecture Decision Record (ADR)
-* API Documentation>
\ No newline at end of file
+* API Documentation
+* Runbook / Operations Guide<
\ No newline at end of file
A templates/runbook-operations-guide.md => templates/runbook-operations-guide.md +166 -0
@@ 0,0 1,166 @@
+# Runbook / Operations Guide
+
+## Service Overview
+**Service Name:** [Name]
+**Owner Team:** [Team]
+**On-Call:** [How to reach]
+**PagerDuty Integration:** [Link if applicable]
+
+## Service Architecture
+- High-level system diagram
+- Key components and their purpose
+- Dependencies (internal and external services)
+- Data flow overview
+
+## Deployment
+
+### Environments
+- Production environment details
+- Staging environment details
+- Development environment details
+
+### How to Deploy
+- Deployment process step-by-step
+- CI/CD pipeline overview
+- Approval requirements
+- Rollback procedure
+- Deployment checklist
+
+### Monitoring Post-Deployment
+- Key metrics to verify
+- Health checks
+- Expected log messages
+- Alert verification
+
+## Common Operations
+
+### Starting the Service
+- Prerequisites
+- Startup commands
+- Verification steps
+- Startup time expectations
+
+### Stopping the Service
+- Graceful shutdown procedure
+- Connection draining
+- Backup considerations
+- Shutdown verification
+
+### Configuration Management
+- Configuration file locations
+- Environment variables
+- How to reload configuration without restart
+- Configuration validation
+
+### Database Operations
+- Connection details
+- Backup/restore procedures
+- Migration process
+- Replication monitoring (if applicable)
+
+## Troubleshooting Guide
+
+### Symptom: [Issue 1]
+**Indicators:**
+- Log messages to look for
+- Metric anomalies
+- User-facing symptoms
+
+**Diagnosis:**
+- How to verify the issue
+- Relevant commands to run
+- Where to check logs
+
+**Resolution:**
+- Step-by-step fix procedure
+- Workarounds if permanent fix unavailable
+- Prevention measures
+
+### Symptom: [Issue 2]
+[Same structure as above]
+
+## Performance & Scaling
+
+### Current Capacity
+- Current load metrics
+- Resource utilization
+- Scaling limits
+
+### Scaling Procedures
+- How to scale horizontally
+- How to scale vertically
+- Load balancer configuration
+- Cache invalidation (if needed)
+
+### Performance Optimization
+- Known bottlenecks
+- Tuning parameters
+- Caching strategies
+
+## Monitoring & Alerts
+
+### Key Dashboards
+- Dashboard names and links
+- What each dashboard monitors
+- Expected ranges for key metrics
+
+### Critical Alerts
+| Alert Name | Threshold | Action | Escalation |
+|------------|-----------|--------|-----------|
+| [Alert 1] | [Condition] | [Steps] | [Next step] |
+| [Alert 2] | [Condition] | [Steps] | [Next step] |
+
+### Log Locations
+- Application logs path
+- Error logs path
+- Access logs path
+- How to tail logs remotely
+
+### Metrics & KPIs
+- Response time SLO
+- Error rate threshold
+- Throughput capacity
+- Custom metrics to monitor
+
+## Incident Response
+
+### Page Escalation
+- Who to page first
+- Escalation procedure
+- Communication channels
+- Status page updates
+
+### Common Incident Scenarios
+- Database connectivity loss
+- External dependency failure
+- Memory leak symptoms and response
+- High latency symptoms and response
+
+### Postmortem Process
+- How to trigger postmortem
+- Documentation requirements
+- Blameless culture note
+
+## Maintenance & Upgrades
+
+### Scheduled Maintenance
+- Maintenance windows
+- User notification procedures
+- Expected downtime
+- Verification steps
+
+### Dependency Updates
+- Update frequency
+- Testing requirements
+- Rollback plan
+- Security patch procedure
+
+### Data Cleanup
+- Retention policies
+- How to run cleanup jobs
+- Impact of cleanup operations
+
+## Runbook Verification
+- Last verified: [Date]
+- Verified by: [Name]
+- Next review due: [Date]<
\ No newline at end of file