Skip to content

Production Outage Runbook

Severity: Critical Last Tested: [NEEDS TEAM INPUT]


Symptoms

  • Error rate in Grafana spikes above normal baseline
  • Health check endpoint returning non-200 status
  • Users reporting inability to access the application
  • kubectl get pods shows CrashLoopBackOff or Error state
  • IngressGateway returning 503 for all routes

Impact

  • All users affected if IngressGateway or identity service is down
  • Subset of users affected if a single microservice is down
  • Data writes may be lost if the outage includes database connectivity issues

Prerequisites

  • kubectl access to the production cluster
  • ArgoCD CLI or access to ArgoCD web UI
  • gcloud CLI authenticated to orofi-prod project
  • Access to Grafana production dashboards

Steps

Phase 1: Triage (< 5 minutes)

1. Determine scope

# Check all pod statuses across application namespaces
kubectl get pods -n api-gateway-public
kubectl get pods -n api-gateway-account
kubectl get pods -n api-gateway-oro
kubectl get pods -n api-gateway-admin-dashboard
kubectl get pods -n microservice-communication
kubectl get pods -n microservice-identity
kubectl get pods -n microservice-monolith
kubectl get pods -n microservice-analytics

# Quick overview of all namespaces
kubectl get pods -A | grep -v Running | grep -v Completed

2. Check IngressGateway

kubectl get pods -n istio-system -l app=istio-ingressgateway
kubectl logs -n istio-system -l app=istio-ingressgateway --tail=50 | grep -v "200\|204"

3. Check recent deployments in ArgoCD

argocd app list
# Look for recently changed sync timestamps

4. Open Grafana and check: - Error rate per service (last 30 minutes) - Request volume (drop = routing problem upstream) - P99 latency (spike = service or DB slowness) - Pod restarts (CrashLoopBackOff indicator)


Phase 2: Contain (< 15 minutes)

5. If caused by a recent deployment → Roll back immediately

# Identify the bad deployment
argocd app history {service-name}

# Roll back to previous known-good revision
argocd app rollback {service-name} {previous-revision}

# If auto-sync would re-apply bad version, disable it first
argocd app set {service-name} --sync-policy none
argocd app rollback {service-name} {revision}

See Rollbacks Guide for full rollback procedures.

6. If not deployment-related → Check infrastructure

# Check Cloud SQL status
gcloud sql instances describe \
  orofi-prod-cloud-prod-oro-mysql-instance \
  --project=orofi-prod \
  --format="value(state,backendType)"

# Check Redis
gcloud redis instances describe \
  orofi-prod-cloud-prod-redis-cache \
  --region=us-central1 \
  --project=orofi-prod \
  --format="value(state)"

# Check Kafka pods
kubectl get pods -n kafka

# Check MongoDB
kubectl get pods -n mongo-db

7. Restart unhealthy pods

# Restart a specific deployment
kubectl rollout restart deployment/{service-name} -n {namespace}

# Watch rollout progress
kubectl rollout status deployment/{service-name} -n {namespace}

8. If nodes are unhealthy

# Check node status
kubectl get nodes

# If nodes are NotReady, check GKE node pool health
gcloud container node-pools list \
  --cluster=orofi-prod-cloud-prod-k8s-cluster \
  --zone=us-central1-a \
  --project=orofi-prod

Phase 3: Communicate

9. Notify stakeholders

[NEEDS TEAM INPUT: describe the incident communication process: - Internal Slack channel to notify - Status page (if any) to update - Customer communication if applicable - Who to page if you can't resolve in 15 minutes]


Phase 4: Resolve

10. Apply the fix

Depending on the root cause identified in Phase 1–2:

Root Cause Action
Bad deployment Roll back via ArgoCD (Step 5)
Cloud SQL down Follow Database Failure Runbook
Certificate expired Follow Certificate Rotation Runbook
Node capacity Follow Scaling Events Runbook
Secrets out of sync Force ESO resync + restart pods
Kafka issues Restart Kafka pods, check consumer groups

Verification

After applying a fix:

# Verify pods are healthy
kubectl get pods -A | grep -v "Running\|Completed"

# Verify error rate returns to baseline in Grafana
# Check the "Error Rate" and "Request Success Rate" panels

# Run a synthetic health check against the service
curl -v https://api.{prod-domain}/health

# Confirm ArgoCD shows all apps as Synced + Healthy
argocd app list | grep -v "Synced.*Healthy"

Post-Incident

  1. Re-enable ArgoCD auto-sync if you disabled it
  2. Write up an incident report within 24 hours — [NEEDS TEAM INPUT: incident tracking location]
  3. Schedule a post-mortem for Critical incidents within 48 hours
  4. Identify and fix root cause before the incident recurs
  5. Update this runbook if the incident revealed a gap

Escalation

If the outage persists for more than 30 minutes and you can't identify the root cause:

  1. [NEEDS TEAM INPUT: name + contact of platform team lead]
  2. [NEEDS TEAM INPUT: name + contact of engineering manager]
  3. [NEEDS TEAM INPUT: GCP support contact if it's a GCP infrastructure issue]

See Also