Runbooks Index¶
Severity Classification¶
| Severity | Definition | Response Time |
|---|---|---|
| Critical | Production is down or data integrity is at risk | Immediate — drop everything |
| High | Production degraded, significant user impact | Within 15 minutes |
| Medium | Non-production affected, or low-impact production issue | Within 1 hour |
| Low | Cosmetic or minor operational issue | During business hours |
Runbook Directory¶
| Runbook | Severity | Triggers |
|---|---|---|
| Production Outage | Critical | Error rate spike, service unreachable, health checks failing |
| Database Failure | Critical | Cloud SQL unavailable, data corruption, failover needed |
| Security Incident | Critical | Unauthorized access, credential exposure, anomalous behavior |
| Certificate Rotation | High | TLS certificate expired, cert-manager not renewing |
| Scaling Events | Medium | Node capacity exhausted, manual scaling needed |
Alerting & Escalation¶
[NEEDS TEAM INPUT: document the alerting pipeline — who gets paged, on what conditions, via what tool (PagerDuty, Slack alerts, etc.). Include escalation path: - L1: On-call engineer - L2: Platform team lead - L3: Engineering manager + CTO]
Slack channels for alerts: [NEEDS TEAM INPUT: #alerts-prod, #alerts-staging, etc.]
Incident Tracking¶
[NEEDS TEAM INPUT: where do incidents get tracked? (Linear, Jira, Confluence post-mortems, etc.)]
See Also¶
- Common Issues — non-incident issues with self-service fixes
- Change Management — safe process for infrastructure changes