Skip to content

Orofi Infrastructure Docs

Runbooks Index

oro-codebase/infra

Runbooks Index¶

Severity Classification¶

Severity	Definition	Response Time
Critical	Production is down or data integrity is at risk	Immediate — drop everything
High	Production degraded, significant user impact	Within 15 minutes
Medium	Non-production affected, or low-impact production issue	Within 1 hour
Low	Cosmetic or minor operational issue	During business hours

Runbook Directory¶

Runbook	Severity	Triggers
Production Outage	Critical	Error rate spike, service unreachable, health checks failing
Database Failure	Critical	Cloud SQL unavailable, data corruption, failover needed
Security Incident	Critical	Unauthorized access, credential exposure, anomalous behavior
Certificate Rotation	High	TLS certificate expired, cert-manager not renewing
Scaling Events	Medium	Node capacity exhausted, manual scaling needed

Alerting & Escalation¶

[NEEDS TEAM INPUT: document the alerting pipeline — who gets paged, on what conditions, via what tool (PagerDuty, Slack alerts, etc.). Include escalation path: - L1: On-call engineer - L2: Platform team lead - L3: Engineering manager + CTO]

Slack channels for alerts: [NEEDS TEAM INPUT: #alerts-prod, #alerts-staging, etc.]

Incident Tracking¶

[NEEDS TEAM INPUT: where do incidents get tracked? (Linear, Jira, Confluence post-mortems, etc.)]

See Also¶

Common Issues — non-incident issues with self-service fixes
Change Management — safe process for infrastructure changes