Change Management¶

Principles¶

All changes go through Git — no manual console edits that aren't tracked
Test in dev, then staging, then production — never skip an environment
One change at a time — don't bundle unrelated infrastructure changes
Plan before applying — always review terraform plan output before terraform apply
Communicate before production changes — notify the team before making production infrastructure changes

Change Risk Classification¶

Risk Level	Examples	Process
Low	Adding a new DNS record, creating a new secret, updating a Helm chart value	Dev → Staging → standard PR process
Medium	Modifying firewall rules, changing Redis config, adding a new service account	Dev → Staging → team review → Production
High	Modifying Cloud SQL, changing Istio configuration, updating VPC CIDR	Dev → Staging → scheduled maintenance window → Production
Critical	Deleting resources, changing KMS keys, modifying IAM on production	Requires senior platform engineer + explicit team approval

Standard Change Process (Terraform)¶

Step 1: Make the Change in Dev First¶

cd infrastructure-management/projects/orofi-dev

# Authenticate as dev Terraform SA
gcloud auth activate-service-account terraform-mnl@orofi-dev-cloud.iam.gserviceaccount.com \
  --key-file=/path/to/key.json

# Review what will change
terraform plan

# If the plan looks correct, apply
terraform apply

Step 2: Verify in Dev¶

After applying, verify the change works as expected: - Check the GCP console for the new/modified resource - Test that dependent services are still working - Check for any errors in pod logs after a configuration change

Step 3: Apply to Staging¶

cd infrastructure-management/projects/orofi-staging

gcloud auth activate-service-account orofi-mnl-sa-terraform@orofi-stage-cloud.iam.gserviceaccount.com \
  --key-file=/path/to/key.json

terraform plan
# Review carefully — staging has real data, HA components, and PSC connections

terraform apply

Step 4: Apply to Production¶

Production Changes

Production changes require: - Successful apply to both dev and staging - Team notification (Slack channel: [NEEDS TEAM INPUT]) - If High or Critical risk: a scheduled maintenance window - A rollback plan documented before applying

cd infrastructure-management/projects/orofi-prod

# [NEEDS TEAM INPUT: document production Terraform SA and authentication]

terraform plan
# Someone else should review this plan output
terraform apply

Kubernetes Configuration Changes (ArgoCD)¶

For changes to Kubernetes manifests (Helm values, new resources):

Create a branch in infrastructure-configuration
Make the change
Open a PR — get it reviewed
Merge to the appropriate branch (dev → staging → production)
ArgoCD auto-syncs after merge

This is covered in detail in GitOps Workflow.

Risky Operations — Extra Care¶

Deleting Resources¶

Before deleting any Terraform resource: 1. Verify it's no longer in use (no services depending on it) 2. Check if deletion_protection = true is set (Cloud SQL has this) 3. Remove the Terraform resource but NOT the actual GCP resource first (using terraform state rm) 4. Verify nothing breaks 5. Only then delete the actual GCP resource

For Cloud SQL: deletion protection must be explicitly disabled before deletion:

settings {
  deletion_protection_enabled = false  # Must set this before destroy
}

Changing Secrets¶

When rotating a secret: 1. Add a new version in GCP Secret Manager (don't immediately disable the old one) 2. Force ESO resync 3. Restart the dependent service 4. Verify the service is working with the new secret 5. Then disable the old secret version

Changing Firewall Rules¶

Zero-trust firewall changes (dev) can lock out legitimate access. Always: 1. Verify your own IP will be in the allowed list before applying 2. Have an alternative access path (e.g., GCP console, Cloud Shell) ready in case you get locked out 3. Apply during low-traffic periods

Changing KMS Keys¶

KMS key changes (identity service) require application-level coordination: - If adding a new key version: update the application to use the new version before disabling the old one - Key material cannot be recovered if accidentally disabled - [NEEDS TEAM INPUT: document KMS key rotation procedure for microservice-identity]

Modifying Istio¶

Istio changes can affect all traffic routing. When modifying modules/helm/: 1. Apply to dev and test all services thoroughly 2. Apply to staging and run load tests 3. Have the certificate rotation runbook ready (Istio restarts invalidate TLS state temporarily) 4. Schedule during low-traffic window for production

Planned: Atlantis Automation¶

Per atlantis-integration-plan.md, Atlantis will automate terraform plan on PR open and terraform apply on PR merge. When implemented: - Dev and staging changes will be automated via PR workflow - Production will remain manual-only - The Atlantis server will run on a GCE VM (e2-medium) in us-east1

Until Atlantis is running, all Terraform changes are applied manually following this guide.

Emergency Changes¶

In a production incident, speed sometimes matters more than process. After an emergency change:

Document what was changed and why in the incident report
Ensure the change is reflected in Terraform/Git within 24 hours
If a manual console change was made, use terraform import to bring it into Terraform state

Drift risk

Manual GCP console changes not backed by Terraform will be reverted on the next terraform apply. Always codify emergency fixes immediately after the incident.