Skip to content

Change Management

Principles

  1. All changes go through Git — no manual console edits that aren't tracked
  2. Test in dev, then staging, then production — never skip an environment
  3. One change at a time — don't bundle unrelated infrastructure changes
  4. Plan before applying — always review terraform plan output before terraform apply
  5. Communicate before production changes — notify the team before making production infrastructure changes

Change Risk Classification

Risk Level Examples Process
Low Adding a new DNS record, creating a new secret, updating a Helm chart value Dev → Staging → standard PR process
Medium Modifying firewall rules, changing Redis config, adding a new service account Dev → Staging → team review → Production
High Modifying Cloud SQL, changing Istio configuration, updating VPC CIDR Dev → Staging → scheduled maintenance window → Production
Critical Deleting resources, changing KMS keys, modifying IAM on production Requires senior platform engineer + explicit team approval

Standard Change Process (Terraform)

Step 1: Make the Change in Dev First

cd infrastructure-management/projects/orofi-dev

# Authenticate as dev Terraform SA
gcloud auth activate-service-account terraform-mnl@orofi-dev-cloud.iam.gserviceaccount.com \
  --key-file=/path/to/key.json

# Review what will change
terraform plan

# If the plan looks correct, apply
terraform apply

Step 2: Verify in Dev

After applying, verify the change works as expected: - Check the GCP console for the new/modified resource - Test that dependent services are still working - Check for any errors in pod logs after a configuration change

Step 3: Apply to Staging

cd infrastructure-management/projects/orofi-staging

gcloud auth activate-service-account orofi-mnl-sa-terraform@orofi-stage-cloud.iam.gserviceaccount.com \
  --key-file=/path/to/key.json

terraform plan
# Review carefully — staging has real data, HA components, and PSC connections

terraform apply

Step 4: Apply to Production

Production Changes

Production changes require: - Successful apply to both dev and staging - Team notification (Slack channel: [NEEDS TEAM INPUT]) - If High or Critical risk: a scheduled maintenance window - A rollback plan documented before applying

cd infrastructure-management/projects/orofi-prod

# [NEEDS TEAM INPUT: document production Terraform SA and authentication]

terraform plan
# Someone else should review this plan output
terraform apply

Kubernetes Configuration Changes (ArgoCD)

For changes to Kubernetes manifests (Helm values, new resources):

  1. Create a branch in infrastructure-configuration
  2. Make the change
  3. Open a PR — get it reviewed
  4. Merge to the appropriate branch (dev → staging → production)
  5. ArgoCD auto-syncs after merge

This is covered in detail in GitOps Workflow.

Risky Operations — Extra Care

Deleting Resources

Before deleting any Terraform resource: 1. Verify it's no longer in use (no services depending on it) 2. Check if deletion_protection = true is set (Cloud SQL has this) 3. Remove the Terraform resource but NOT the actual GCP resource first (using terraform state rm) 4. Verify nothing breaks 5. Only then delete the actual GCP resource

For Cloud SQL: deletion protection must be explicitly disabled before deletion:

settings {
  deletion_protection_enabled = false  # Must set this before destroy
}

Changing Secrets

When rotating a secret: 1. Add a new version in GCP Secret Manager (don't immediately disable the old one) 2. Force ESO resync 3. Restart the dependent service 4. Verify the service is working with the new secret 5. Then disable the old secret version

Changing Firewall Rules

Zero-trust firewall changes (dev) can lock out legitimate access. Always: 1. Verify your own IP will be in the allowed list before applying 2. Have an alternative access path (e.g., GCP console, Cloud Shell) ready in case you get locked out 3. Apply during low-traffic periods

Changing KMS Keys

KMS key changes (identity service) require application-level coordination: - If adding a new key version: update the application to use the new version before disabling the old one - Key material cannot be recovered if accidentally disabled - [NEEDS TEAM INPUT: document KMS key rotation procedure for microservice-identity]

Modifying Istio

Istio changes can affect all traffic routing. When modifying modules/helm/: 1. Apply to dev and test all services thoroughly 2. Apply to staging and run load tests 3. Have the certificate rotation runbook ready (Istio restarts invalidate TLS state temporarily) 4. Schedule during low-traffic window for production

Planned: Atlantis Automation

Per atlantis-integration-plan.md, Atlantis will automate terraform plan on PR open and terraform apply on PR merge. When implemented: - Dev and staging changes will be automated via PR workflow - Production will remain manual-only - The Atlantis server will run on a GCE VM (e2-medium) in us-east1

Until Atlantis is running, all Terraform changes are applied manually following this guide.

Emergency Changes

In a production incident, speed sometimes matters more than process. After an emergency change:

  1. Document what was changed and why in the incident report
  2. Ensure the change is reflected in Terraform/Git within 24 hours
  3. If a manual console change was made, use terraform import to bring it into Terraform state

Drift risk

Manual GCP console changes not backed by Terraform will be reverted on the next terraform apply. Always codify emergency fixes immediately after the incident.

See Also