Skip to content

Operate

This section is for people running the platform day-to-day — responding to incidents, making infrastructure changes, and maintaining reliability.

Runbooks

Use these during incidents. Each runbook follows a standard structure: Severity → Symptoms → Impact → Steps → Verification → Escalation.

Runbook Severity Use When
Production Outage Critical Services are down or severely degraded
Database Failure Critical Cloud SQL is unreachable or corrupted
Security Incident Critical Suspected breach, unauthorized access
Certificate Rotation High TLS cert expired or failing to renew
Scaling Events Medium Manual scaling needed, cluster capacity issues

View all runbooks →

Infrastructure Reference

Detailed reference for the infrastructure components.

Page What's Documented
Terraform Modules Every module: inputs, outputs, dependencies
Terragrunt Structure Project layout, variable inheritance chain
Cluster Configuration Node pools, autoscaling, workload identity
Backup & Recovery Backup schedules, restore procedures, RTOs
DNS & TLS Domain management, cert-manager, Cloudflare config

Change Management

Read this before making any infrastructure change:

Change Management →