Skip to content

Runbooks Index

Severity Classification

Severity Definition Response Time
Critical Production is down or data integrity is at risk Immediate — drop everything
High Production degraded, significant user impact Within 15 minutes
Medium Non-production affected, or low-impact production issue Within 1 hour
Low Cosmetic or minor operational issue During business hours

Runbook Directory

Runbook Severity Triggers
Production Outage Critical Error rate spike, service unreachable, health checks failing
Database Failure Critical Cloud SQL unavailable, data corruption, failover needed
Security Incident Critical Unauthorized access, credential exposure, anomalous behavior
Certificate Rotation High TLS certificate expired, cert-manager not renewing
Scaling Events Medium Node capacity exhausted, manual scaling needed

Alerting & Escalation

[NEEDS TEAM INPUT: document the alerting pipeline — who gets paged, on what conditions, via what tool (PagerDuty, Slack alerts, etc.). Include escalation path: - L1: On-call engineer - L2: Platform team lead - L3: Engineering manager + CTO]

Slack channels for alerts: [NEEDS TEAM INPUT: #alerts-prod, #alerts-staging, etc.]

Incident Tracking

[NEEDS TEAM INPUT: where do incidents get tracked? (Linear, Jira, Confluence post-mortems, etc.)]

See Also