Backup & Recovery Reference¶

Backup Summary¶

Data Store	Backup Type	Frequency	Retention	Environment
Cloud SQL MySQL	Automated	Daily	30 backups	Staging
Cloud SQL MySQL	[NEEDS TEAM INPUT]	[NEEDS TEAM INPUT]	[NEEDS TEAM INPUT]	Production
Cloud SQL MySQL	[NEEDS TEAM INPUT]	[NEEDS TEAM INPUT]	[NEEDS TEAM INPUT]	Dev
Cloud SQL MySQL (PITR)	Binary log	Continuous	7 days	Staging
Cloud Memorystore Redis	RDB snapshot	Every 6 hours	[NEEDS TEAM INPUT]	Staging
Cloud Memorystore Redis	RDB snapshot	Every 12 hours	[NEEDS TEAM INPUT]	Dev
MongoDB	[NEEDS TEAM INPUT]	[NEEDS TEAM INPUT]	[NEEDS TEAM INPUT]	Staging
Kafka	Topics replicated	3 replicas	[NEEDS TEAM INPUT]	Staging
GCS Buckets	[NEEDS TEAM INPUT]	[NEEDS TEAM INPUT]	[NEEDS TEAM INPUT]	All

Cloud SQL Backups¶

Configuration (Staging)¶

Defined in infrastructure-management/modules/datastore/main.tf:

backup_configuration {
  enabled                        = true
  binary_log_enabled             = true   # enables PITR
  start_time                     = "02:00"  # [NEEDS TEAM INPUT: confirm backup window]
  backup_retention_settings {
    retained_backups              = 30
    retention_unit                = "COUNT"
  }
}

Viewing Available Backups¶

gcloud sql backups list \
  --instance=orofi-stage-cloud-stage-oro-mysql-instance \
  --project=orofi-stage-cloud \
  --format="table(id,status,startTime,endTime,type)"

Performing a Restore¶

See Database Failure Runbook for the full procedure.

Quick reference:

# Restore to same instance (overwrites all current data)
gcloud sql backups restore {backup-id} \
  --restore-instance=orofi-stage-cloud-stage-oro-mysql-instance \
  --backup-instance=orofi-stage-cloud-stage-oro-mysql-instance \
  --project=orofi-stage-cloud

# PITR — restore to a point in time (clones to a new instance)
gcloud sql instances clone \
  orofi-stage-cloud-stage-oro-mysql-instance \
  orofi-stage-cloud-stage-oro-mysql-restore \
  --point-in-time "2026-04-02T09:00:00.000Z" \
  --project=orofi-stage-cloud

Recovery Time Objective (RTO)¶

Restore Type	Estimated RTO
Automated backup restore (same instance)	[NEEDS TEAM INPUT: typically 15–60 min depending on size]
PITR clone	[NEEDS TEAM INPUT: typically 30–90 min]
Failover (REGIONAL instance)	1–2 minutes (automatic)

Recovery Point Objective (RPO)¶

Restore Type	RPO
Daily backup restore	Up to 24 hours data loss
PITR	Up to ~5 minutes data loss (binary log flush interval)
Automatic failover (REGIONAL)	Near-zero data loss (synchronous replication)

Redis Backups¶

Cloud Memorystore Redis uses RDB (Redis Database Backup) snapshots.

Staging Configuration¶

Persistence mode: RDB
Snapshot period: Every 6 hours

Dev Configuration¶

Persistence mode: RDB
Snapshot period: Every 12 hours

Redis Recovery¶

Redis data can be lost if the instance fails between snapshots. Redis is a cache — the application must tolerate cache misses and rebuild state from the source of truth (MySQL/MongoDB).

Redis is not a system of record

Redis holds cached data only. If Redis is lost entirely, services fall back to database queries. There is no need to restore Redis from backup — a fresh empty instance is fine.

If a Redis instance must be recreated:

# Terraform will recreate it
terraform apply -target=module.redis -var="env=stage"

Then restart all services so they reconnect:

for ns in microservice-communication microservice-identity microservice-monolith microservice-analytics \
  api-gateway-public api-gateway-account api-gateway-oro api-gateway-admin-dashboard; do
  kubectl rollout restart deployment -n $ns
done

MongoDB Backups¶

[NEEDS TEAM INPUT: the Percona PSMDB Operator has built-in backup capabilities. Document: - Whether backups are configured (PBM — Percona Backup for MongoDB) - Backup storage location (GCS bucket?) - Backup schedule - How to trigger a restore]

Kafka Data Durability¶

Kafka is not backed up in the traditional sense. Instead: - Data is replicated across 3 brokers in staging (replication factor = 3, min ISR = 2) - Topic retention is configured per-topic [NEEDS TEAM INPUT: what is the default retention time/bytes for each topic?]

If Kafka data is lost (e.g., all brokers fail and PVCs are deleted), messages not yet consumed are gone. Services must be designed to handle this scenario (idempotent consumers, replayable event sources).

GCS Bucket Backups¶

GCS buckets are used for import/export by microservices. Bucket contents are [NEEDS TEAM INPUT: versioned? Cross-region replicated? Lifecycle rules configured?].

Terraform State Backups¶

Terraform state files in GCS buckets are critical — losing them means Terraform can no longer manage existing resources. Protect them:

Bucket	Versioning	Cross-region
`oro-dev-infra`	[NEEDS TEAM INPUT]	[NEEDS TEAM INPUT]
`oro-infra-stag`	[NEEDS TEAM INPUT]	[NEEDS TEAM INPUT]
`oro-infra-production`	[NEEDS TEAM INPUT]	[NEEDS TEAM INPUT]

Recommendation: Enable GCS object versioning on all state buckets.

Tested Recovery Procedures¶

Scenario	Last Tested	Result
Cloud SQL automated backup restore	[NEEDS TEAM INPUT]	[NEEDS TEAM INPUT]
Cloud SQL PITR clone	[NEEDS TEAM INPUT]	[NEEDS TEAM INPUT]
Redis instance recreation	[NEEDS TEAM INPUT]	[NEEDS TEAM INPUT]
MongoDB restore	[NEEDS TEAM INPUT]	[NEEDS TEAM INPUT]

Test your backups

Backups are only useful if they can be restored. Run restore drills in dev at least quarterly.