Change Management¶
Principles¶
- All changes go through Git — no manual console edits that aren't tracked
- Test in dev, then staging, then production — never skip an environment
- One change at a time — don't bundle unrelated infrastructure changes
- Plan before applying — always review
terraform planoutput beforeterraform apply - Communicate before production changes — notify the team before making production infrastructure changes
Change Risk Classification¶
| Risk Level | Examples | Process |
|---|---|---|
| Low | Adding a new DNS record, creating a new secret, updating a Helm chart value | Dev → Staging → standard PR process |
| Medium | Modifying firewall rules, changing Redis config, adding a new service account | Dev → Staging → team review → Production |
| High | Modifying Cloud SQL, changing Istio configuration, updating VPC CIDR | Dev → Staging → scheduled maintenance window → Production |
| Critical | Deleting resources, changing KMS keys, modifying IAM on production | Requires senior platform engineer + explicit team approval |
Standard Change Process (Terraform)¶
Step 1: Make the Change in Dev First¶
cd infrastructure-management/projects/orofi-dev
# Authenticate as dev Terraform SA
gcloud auth activate-service-account terraform-mnl@orofi-dev-cloud.iam.gserviceaccount.com \
--key-file=/path/to/key.json
# Review what will change
terraform plan
# If the plan looks correct, apply
terraform apply
Step 2: Verify in Dev¶
After applying, verify the change works as expected: - Check the GCP console for the new/modified resource - Test that dependent services are still working - Check for any errors in pod logs after a configuration change
Step 3: Apply to Staging¶
cd infrastructure-management/projects/orofi-staging
gcloud auth activate-service-account orofi-mnl-sa-terraform@orofi-stage-cloud.iam.gserviceaccount.com \
--key-file=/path/to/key.json
terraform plan
# Review carefully — staging has real data, HA components, and PSC connections
terraform apply
Step 4: Apply to Production¶
Production Changes
Production changes require: - Successful apply to both dev and staging - Team notification (Slack channel: [NEEDS TEAM INPUT]) - If High or Critical risk: a scheduled maintenance window - A rollback plan documented before applying
cd infrastructure-management/projects/orofi-prod
# [NEEDS TEAM INPUT: document production Terraform SA and authentication]
terraform plan
# Someone else should review this plan output
terraform apply
Kubernetes Configuration Changes (ArgoCD)¶
For changes to Kubernetes manifests (Helm values, new resources):
- Create a branch in
infrastructure-configuration - Make the change
- Open a PR — get it reviewed
- Merge to the appropriate branch (dev → staging → production)
- ArgoCD auto-syncs after merge
This is covered in detail in GitOps Workflow.
Risky Operations — Extra Care¶
Deleting Resources¶
Before deleting any Terraform resource:
1. Verify it's no longer in use (no services depending on it)
2. Check if deletion_protection = true is set (Cloud SQL has this)
3. Remove the Terraform resource but NOT the actual GCP resource first (using terraform state rm)
4. Verify nothing breaks
5. Only then delete the actual GCP resource
For Cloud SQL: deletion protection must be explicitly disabled before deletion:
Changing Secrets¶
When rotating a secret: 1. Add a new version in GCP Secret Manager (don't immediately disable the old one) 2. Force ESO resync 3. Restart the dependent service 4. Verify the service is working with the new secret 5. Then disable the old secret version
Changing Firewall Rules¶
Zero-trust firewall changes (dev) can lock out legitimate access. Always: 1. Verify your own IP will be in the allowed list before applying 2. Have an alternative access path (e.g., GCP console, Cloud Shell) ready in case you get locked out 3. Apply during low-traffic periods
Changing KMS Keys¶
KMS key changes (identity service) require application-level coordination: - If adding a new key version: update the application to use the new version before disabling the old one - Key material cannot be recovered if accidentally disabled - [NEEDS TEAM INPUT: document KMS key rotation procedure for microservice-identity]
Modifying Istio¶
Istio changes can affect all traffic routing. When modifying modules/helm/:
1. Apply to dev and test all services thoroughly
2. Apply to staging and run load tests
3. Have the certificate rotation runbook ready (Istio restarts invalidate TLS state temporarily)
4. Schedule during low-traffic window for production
Planned: Atlantis Automation¶
Per atlantis-integration-plan.md, Atlantis will automate terraform plan on PR open and terraform apply on PR merge. When implemented:
- Dev and staging changes will be automated via PR workflow
- Production will remain manual-only
- The Atlantis server will run on a GCE VM (e2-medium) in us-east1
Until Atlantis is running, all Terraform changes are applied manually following this guide.
Emergency Changes¶
In a production incident, speed sometimes matters more than process. After an emergency change:
- Document what was changed and why in the incident report
- Ensure the change is reflected in Terraform/Git within 24 hours
- If a manual console change was made, use
terraform importto bring it into Terraform state
Drift risk
Manual GCP console changes not backed by Terraform will be reverted on the next terraform apply. Always codify emergency fixes immediately after the incident.