Backup & Recovery Reference¶
Backup Summary¶
| Data Store | Backup Type | Frequency | Retention | Environment |
|---|---|---|---|---|
| Cloud SQL MySQL | Automated | Daily | 30 backups | Staging |
| Cloud SQL MySQL | [NEEDS TEAM INPUT] | [NEEDS TEAM INPUT] | [NEEDS TEAM INPUT] | Production |
| Cloud SQL MySQL | [NEEDS TEAM INPUT] | [NEEDS TEAM INPUT] | [NEEDS TEAM INPUT] | Dev |
| Cloud SQL MySQL (PITR) | Binary log | Continuous | 7 days | Staging |
| Cloud Memorystore Redis | RDB snapshot | Every 6 hours | [NEEDS TEAM INPUT] | Staging |
| Cloud Memorystore Redis | RDB snapshot | Every 12 hours | [NEEDS TEAM INPUT] | Dev |
| MongoDB | [NEEDS TEAM INPUT] | [NEEDS TEAM INPUT] | [NEEDS TEAM INPUT] | Staging |
| Kafka | Topics replicated | 3 replicas | [NEEDS TEAM INPUT] | Staging |
| GCS Buckets | [NEEDS TEAM INPUT] | [NEEDS TEAM INPUT] | [NEEDS TEAM INPUT] | All |
Cloud SQL Backups¶
Configuration (Staging)¶
Defined in infrastructure-management/modules/datastore/main.tf:
backup_configuration {
enabled = true
binary_log_enabled = true # enables PITR
start_time = "02:00" # [NEEDS TEAM INPUT: confirm backup window]
backup_retention_settings {
retained_backups = 30
retention_unit = "COUNT"
}
}
Viewing Available Backups¶
gcloud sql backups list \
--instance=orofi-stage-cloud-stage-oro-mysql-instance \
--project=orofi-stage-cloud \
--format="table(id,status,startTime,endTime,type)"
Performing a Restore¶
See Database Failure Runbook for the full procedure.
Quick reference:
# Restore to same instance (overwrites all current data)
gcloud sql backups restore {backup-id} \
--restore-instance=orofi-stage-cloud-stage-oro-mysql-instance \
--backup-instance=orofi-stage-cloud-stage-oro-mysql-instance \
--project=orofi-stage-cloud
# PITR — restore to a point in time (clones to a new instance)
gcloud sql instances clone \
orofi-stage-cloud-stage-oro-mysql-instance \
orofi-stage-cloud-stage-oro-mysql-restore \
--point-in-time "2026-04-02T09:00:00.000Z" \
--project=orofi-stage-cloud
Recovery Time Objective (RTO)¶
| Restore Type | Estimated RTO |
|---|---|
| Automated backup restore (same instance) | [NEEDS TEAM INPUT: typically 15–60 min depending on size] |
| PITR clone | [NEEDS TEAM INPUT: typically 30–90 min] |
| Failover (REGIONAL instance) | 1–2 minutes (automatic) |
Recovery Point Objective (RPO)¶
| Restore Type | RPO |
|---|---|
| Daily backup restore | Up to 24 hours data loss |
| PITR | Up to ~5 minutes data loss (binary log flush interval) |
| Automatic failover (REGIONAL) | Near-zero data loss (synchronous replication) |
Redis Backups¶
Cloud Memorystore Redis uses RDB (Redis Database Backup) snapshots.
Staging Configuration¶
- Persistence mode:
RDB - Snapshot period: Every 6 hours
Dev Configuration¶
- Persistence mode:
RDB - Snapshot period: Every 12 hours
Redis Recovery¶
Redis data can be lost if the instance fails between snapshots. Redis is a cache — the application must tolerate cache misses and rebuild state from the source of truth (MySQL/MongoDB).
Redis is not a system of record
Redis holds cached data only. If Redis is lost entirely, services fall back to database queries. There is no need to restore Redis from backup — a fresh empty instance is fine.
If a Redis instance must be recreated:
Then restart all services so they reconnect:
for ns in microservice-communication microservice-identity microservice-monolith microservice-analytics \
api-gateway-public api-gateway-account api-gateway-oro api-gateway-admin-dashboard; do
kubectl rollout restart deployment -n $ns
done
MongoDB Backups¶
[NEEDS TEAM INPUT: the Percona PSMDB Operator has built-in backup capabilities. Document: - Whether backups are configured (PBM — Percona Backup for MongoDB) - Backup storage location (GCS bucket?) - Backup schedule - How to trigger a restore]
Kafka Data Durability¶
Kafka is not backed up in the traditional sense. Instead: - Data is replicated across 3 brokers in staging (replication factor = 3, min ISR = 2) - Topic retention is configured per-topic [NEEDS TEAM INPUT: what is the default retention time/bytes for each topic?]
If Kafka data is lost (e.g., all brokers fail and PVCs are deleted), messages not yet consumed are gone. Services must be designed to handle this scenario (idempotent consumers, replayable event sources).
GCS Bucket Backups¶
GCS buckets are used for import/export by microservices. Bucket contents are [NEEDS TEAM INPUT: versioned? Cross-region replicated? Lifecycle rules configured?].
Terraform State Backups¶
Terraform state files in GCS buckets are critical — losing them means Terraform can no longer manage existing resources. Protect them:
| Bucket | Versioning | Cross-region |
|---|---|---|
oro-dev-infra |
[NEEDS TEAM INPUT] | [NEEDS TEAM INPUT] |
oro-infra-stag |
[NEEDS TEAM INPUT] | [NEEDS TEAM INPUT] |
oro-infra-production |
[NEEDS TEAM INPUT] | [NEEDS TEAM INPUT] |
Recommendation: Enable GCS object versioning on all state buckets.
Tested Recovery Procedures¶
| Scenario | Last Tested | Result | Notes |
|---|---|---|---|
| Cloud SQL automated backup restore | [NEEDS TEAM INPUT] | [NEEDS TEAM INPUT] | |
| Cloud SQL PITR clone | [NEEDS TEAM INPUT] | [NEEDS TEAM INPUT] | |
| Redis instance recreation | [NEEDS TEAM INPUT] | [NEEDS TEAM INPUT] | |
| MongoDB restore | [NEEDS TEAM INPUT] | [NEEDS TEAM INPUT] |
Test your backups
Backups are only useful if they can be restored. Run restore drills in dev at least quarterly.