Operate¶

This section is for people running the platform day-to-day — responding to incidents, making infrastructure changes, and maintaining reliability.

Runbooks¶

Use these during incidents. Each runbook follows a standard structure: Severity → Symptoms → Impact → Steps → Verification → Escalation.

Runbook	Severity	Use When
Production Outage	Critical	Services are down or severely degraded
Database Failure	Critical	Cloud SQL is unreachable or corrupted
Security Incident	Critical	Suspected breach, unauthorized access
Certificate Rotation	High	TLS cert expired or failing to renew
Scaling Events	Medium	Manual scaling needed, cluster capacity issues

Detailed reference for the infrastructure components.

Page	What's Documented
Terraform Modules	Every module: inputs, outputs, dependencies
Terragrunt Structure	Project layout, variable inheritance chain
Cluster Configuration	Node pools, autoscaling, workload identity
Backup & Recovery	Backup schedules, restore procedures, RTOs
DNS & TLS	Domain management, cert-manager, Cloudflare config

Read this before making any infrastructure change: