Skip to content

Backup & Recovery Reference

Backup Summary

Data Store Backup Type Frequency Retention Environment
Cloud SQL MySQL Automated Daily 30 backups Staging
Cloud SQL MySQL [NEEDS TEAM INPUT] [NEEDS TEAM INPUT] [NEEDS TEAM INPUT] Production
Cloud SQL MySQL [NEEDS TEAM INPUT] [NEEDS TEAM INPUT] [NEEDS TEAM INPUT] Dev
Cloud SQL MySQL (PITR) Binary log Continuous 7 days Staging
Cloud Memorystore Redis RDB snapshot Every 6 hours [NEEDS TEAM INPUT] Staging
Cloud Memorystore Redis RDB snapshot Every 12 hours [NEEDS TEAM INPUT] Dev
MongoDB [NEEDS TEAM INPUT] [NEEDS TEAM INPUT] [NEEDS TEAM INPUT] Staging
Kafka Topics replicated 3 replicas [NEEDS TEAM INPUT] Staging
GCS Buckets [NEEDS TEAM INPUT] [NEEDS TEAM INPUT] [NEEDS TEAM INPUT] All

Cloud SQL Backups

Configuration (Staging)

Defined in infrastructure-management/modules/datastore/main.tf:

backup_configuration {
  enabled                        = true
  binary_log_enabled             = true   # enables PITR
  start_time                     = "02:00"  # [NEEDS TEAM INPUT: confirm backup window]
  backup_retention_settings {
    retained_backups              = 30
    retention_unit                = "COUNT"
  }
}

Viewing Available Backups

gcloud sql backups list \
  --instance=orofi-stage-cloud-stage-oro-mysql-instance \
  --project=orofi-stage-cloud \
  --format="table(id,status,startTime,endTime,type)"

Performing a Restore

See Database Failure Runbook for the full procedure.

Quick reference:

# Restore to same instance (overwrites all current data)
gcloud sql backups restore {backup-id} \
  --restore-instance=orofi-stage-cloud-stage-oro-mysql-instance \
  --backup-instance=orofi-stage-cloud-stage-oro-mysql-instance \
  --project=orofi-stage-cloud

# PITR — restore to a point in time (clones to a new instance)
gcloud sql instances clone \
  orofi-stage-cloud-stage-oro-mysql-instance \
  orofi-stage-cloud-stage-oro-mysql-restore \
  --point-in-time "2026-04-02T09:00:00.000Z" \
  --project=orofi-stage-cloud

Recovery Time Objective (RTO)

Restore Type Estimated RTO
Automated backup restore (same instance) [NEEDS TEAM INPUT: typically 15–60 min depending on size]
PITR clone [NEEDS TEAM INPUT: typically 30–90 min]
Failover (REGIONAL instance) 1–2 minutes (automatic)

Recovery Point Objective (RPO)

Restore Type RPO
Daily backup restore Up to 24 hours data loss
PITR Up to ~5 minutes data loss (binary log flush interval)
Automatic failover (REGIONAL) Near-zero data loss (synchronous replication)

Redis Backups

Cloud Memorystore Redis uses RDB (Redis Database Backup) snapshots.

Staging Configuration

  • Persistence mode: RDB
  • Snapshot period: Every 6 hours

Dev Configuration

  • Persistence mode: RDB
  • Snapshot period: Every 12 hours

Redis Recovery

Redis data can be lost if the instance fails between snapshots. Redis is a cache — the application must tolerate cache misses and rebuild state from the source of truth (MySQL/MongoDB).

Redis is not a system of record

Redis holds cached data only. If Redis is lost entirely, services fall back to database queries. There is no need to restore Redis from backup — a fresh empty instance is fine.

If a Redis instance must be recreated:

# Terraform will recreate it
terraform apply -target=module.redis -var="env=stage"

Then restart all services so they reconnect:

for ns in microservice-communication microservice-identity microservice-monolith microservice-analytics \
  api-gateway-public api-gateway-account api-gateway-oro api-gateway-admin-dashboard; do
  kubectl rollout restart deployment -n $ns
done

MongoDB Backups

[NEEDS TEAM INPUT: the Percona PSMDB Operator has built-in backup capabilities. Document: - Whether backups are configured (PBM — Percona Backup for MongoDB) - Backup storage location (GCS bucket?) - Backup schedule - How to trigger a restore]

Kafka Data Durability

Kafka is not backed up in the traditional sense. Instead: - Data is replicated across 3 brokers in staging (replication factor = 3, min ISR = 2) - Topic retention is configured per-topic [NEEDS TEAM INPUT: what is the default retention time/bytes for each topic?]

If Kafka data is lost (e.g., all brokers fail and PVCs are deleted), messages not yet consumed are gone. Services must be designed to handle this scenario (idempotent consumers, replayable event sources).

GCS Bucket Backups

GCS buckets are used for import/export by microservices. Bucket contents are [NEEDS TEAM INPUT: versioned? Cross-region replicated? Lifecycle rules configured?].

Terraform State Backups

Terraform state files in GCS buckets are critical — losing them means Terraform can no longer manage existing resources. Protect them:

Bucket Versioning Cross-region
oro-dev-infra [NEEDS TEAM INPUT] [NEEDS TEAM INPUT]
oro-infra-stag [NEEDS TEAM INPUT] [NEEDS TEAM INPUT]
oro-infra-production [NEEDS TEAM INPUT] [NEEDS TEAM INPUT]

Recommendation: Enable GCS object versioning on all state buckets.

Tested Recovery Procedures

Scenario Last Tested Result Notes
Cloud SQL automated backup restore [NEEDS TEAM INPUT] [NEEDS TEAM INPUT]
Cloud SQL PITR clone [NEEDS TEAM INPUT] [NEEDS TEAM INPUT]
Redis instance recreation [NEEDS TEAM INPUT] [NEEDS TEAM INPUT]
MongoDB restore [NEEDS TEAM INPUT] [NEEDS TEAM INPUT]

Test your backups

Backups are only useful if they can be restored. Run restore drills in dev at least quarterly.

See Also