Skip to content

Database Failure Runbook

Severity: Critical


Symptoms

  • App logs: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
  • App logs: Unable to acquire JDBC Connection
  • gcloud sql instances describe shows state FAILED or MAINTENANCE
  • All services writing to MySQL return 500 errors
  • Flyway migrations fail on CI/CD

Impact

  • All microservices that write to or read from their MySQL schema are unavailable
  • Services with in-memory caches may continue serving stale reads briefly
  • Kafka consumers may continue processing if they don't require DB writes

Prerequisites

  • gcloud CLI authenticated to the affected project
  • kubectl access to the cluster
  • Access to GCP Secret Manager (for {env}-cloudsql-root-password)
  • GCP project IAM with roles/cloudsql.admin

Steps

Phase 1: Confirm the Failure

1. Check Cloud SQL instance state

# Staging
gcloud sql instances describe \
  orofi-stage-cloud-stage-oro-mysql-instance \
  --project=orofi-stage-cloud \
  --format="table(name,state,backendType,databaseVersion,settings.availabilityType)"

# Production [NEEDS TEAM INPUT: prod instance name]
gcloud sql instances describe \
  {prod-instance-name} \
  --project=orofi-prod \
  --format="table(name,state,backendType,databaseVersion)"

Expected state: RUNNABLE. If FAILED or MAINTENANCE, proceed to Phase 2.

2. Check Cloud SQL operations log

gcloud sql operations list \
  --instance=orofi-stage-cloud-stage-oro-mysql-instance \
  --project=orofi-stage-cloud \
  --limit=10

3. Test connectivity from inside the cluster

# Get a pod in the affected namespace
kubectl exec -it -n microservice-identity \
  $(kubectl get pod -n microservice-identity -l app=microservice-identity -o name | head -1) \
  -- nc -zv microservice-identity-db.stage.orofi.xyz 3306

Phase 2: Automatic Failover (Staging/Prod — REGIONAL availability)

The staging and production instances use REGIONAL availability type, which means Cloud SQL automatically fails over to a standby replica in the event of a zone failure. This happens without manual intervention.

Check if automatic failover occurred:

gcloud sql instances describe \
  orofi-stage-cloud-stage-oro-mysql-instance \
  --project=orofi-stage-cloud \
  --format="value(failoverReplica)"

# Check recent operations for failover events
gcloud sql operations list \
  --instance=orofi-stage-cloud-stage-oro-mysql-instance \
  --project=orofi-stage-cloud \
  | grep -i "failover\|RESTART"

After automatic failover: - The private IP of the instance does not change (Cloud SQL manages this transparently) - Application reconnects automatically on the next connection attempt (most connection pools retry) - Restart application pods to force connection pool reset if needed:

for ns in microservice-communication microservice-identity microservice-monolith microservice-analytics; do
  kubectl rollout restart deployment/$ns -n $ns
done

Phase 3: Manual Failover (if automatic failover did not occur)

# Trigger manual failover
gcloud sql instances failover \
  orofi-stage-cloud-stage-oro-mysql-instance \
  --project=orofi-stage-cloud

# Wait for instance to become RUNNABLE again
watch gcloud sql instances describe \
  orofi-stage-cloud-stage-oro-mysql-instance \
  --project=orofi-stage-cloud \
  --format="value(state)"

Phase 4: Restore from Backup (data corruption or accidental deletion)

Production Impact

Restoring from backup will overwrite all data since the backup was taken. Coordinate with the team before proceeding. For PITR (Point-In-Time Recovery), the restore will bring the database to a specific moment in time.

4.1 List available backups

gcloud sql backups list \
  --instance=orofi-stage-cloud-stage-oro-mysql-instance \
  --project=orofi-stage-cloud \
  --limit=10

4.2 Restore from a specific backup

# Note: restore will cause downtime while the instance restores
gcloud sql backups restore {backup-id} \
  --restore-instance=orofi-stage-cloud-stage-oro-mysql-instance \
  --backup-instance=orofi-stage-cloud-stage-oro-mysql-instance \
  --project=orofi-stage-cloud

4.3 Point-in-Time Recovery (PITR)

# Restore to a specific timestamp (binary logging must be enabled — it is)
gcloud sql instances clone \
  orofi-stage-cloud-stage-oro-mysql-instance \
  orofi-stage-cloud-stage-oro-mysql-restore \
  --point-in-time "2026-04-02T09:00:00.000Z" \
  --project=orofi-stage-cloud

After cloning, verify the restored data, then either: - Promote the clone as the new primary (update DNS records) - Extract the specific data needed and import it

4.4 Update application connections if using a restored instance

If the instance name or IP changed after restore, update the DNS record in Cloud DNS:

gcloud dns record-sets update microservice-identity-db.stage.orofi.xyz. \
  --type=A \
  --ttl=300 \
  --rrdatas={new-private-ip} \
  --zone=stage-orofi-xyz \
  --project=orofi-cloud

Then restart all affected services.


Phase 5: Dev Instance (ZONAL — no automatic failover)

The dev instance is ZONAL and has no standby replica. If the zone us-central1-a is unavailable, the instance will be down until GCP restores it.

Options: 1. Wait for GCP zone recovery 2. Clone the instance to a different zone (temporary):

gcloud sql instances clone \
  orofi-dev-cloud-dev-oro-mysql-instance \
  orofi-dev-cloud-dev-oro-mysql-instance-tmp \
  --project=orofi-dev-cloud

Verification

# Instance is RUNNABLE
gcloud sql instances describe {instance} --project={project} --format="value(state)"
# Expected: RUNNABLE

# TCP connectivity from cluster
kubectl exec -it -n microservice-identity \
  $(kubectl get pod -n microservice-identity -l app=microservice-identity -o name | head -1) \
  -- nc -zv microservice-identity-db.stage.orofi.xyz 3306
# Expected: Connection to ... 3306 port [tcp/mysql] succeeded!

# Application pods healthy
kubectl get pods -n microservice-identity
# Expected: all pods in Running state

Post-Incident

  1. Document the timeline and root cause
  2. Verify data integrity — compare record counts before/after
  3. Check PITR is still enabled after any operations that may have disabled it
  4. Review backup retention settings if restore revealed gaps
  5. Run Flyway migrations if schema is behind after restore: see Migrations Guide

Escalation

  • GCP Cloud SQL support: [NEEDS TEAM INPUT: GCP support ticket URL/contact]
  • Database owner: [NEEDS TEAM INPUT]

See Also