Operate¶
This section is for people running the platform day-to-day — responding to incidents, making infrastructure changes, and maintaining reliability.
Runbooks¶
Use these during incidents. Each runbook follows a standard structure: Severity → Symptoms → Impact → Steps → Verification → Escalation.
| Runbook | Severity | Use When |
|---|---|---|
| Production Outage | Critical | Services are down or severely degraded |
| Database Failure | Critical | Cloud SQL is unreachable or corrupted |
| Security Incident | Critical | Suspected breach, unauthorized access |
| Certificate Rotation | High | TLS cert expired or failing to renew |
| Scaling Events | Medium | Manual scaling needed, cluster capacity issues |
Infrastructure Reference¶
Detailed reference for the infrastructure components.
| Page | What's Documented |
|---|---|
| Terraform Modules | Every module: inputs, outputs, dependencies |
| Terragrunt Structure | Project layout, variable inheritance chain |
| Cluster Configuration | Node pools, autoscaling, workload identity |
| Backup & Recovery | Backup schedules, restore procedures, RTOs |
| DNS & TLS | Domain management, cert-manager, Cloudflare config |
Change Management¶
Read this before making any infrastructure change: