Production Outage Runbook¶

Severity: Critical Last Tested: [NEEDS TEAM INPUT]

Symptoms¶

Error rate in Grafana spikes above normal baseline
Health check endpoint returning non-200 status
Users reporting inability to access the application
kubectl get pods shows CrashLoopBackOff or Error state
IngressGateway returning 503 for all routes

Impact¶

All users affected if IngressGateway or identity service is down
Subset of users affected if a single microservice is down
Data writes may be lost if the outage includes database connectivity issues

Prerequisites¶

kubectl access to the production cluster
ArgoCD CLI or access to ArgoCD web UI
gcloud CLI authenticated to orofi-prod project
Access to Grafana production dashboards

Steps¶

Phase 1: Triage (< 5 minutes)¶

1. Determine scope

# Check all pod statuses across application namespaces
kubectl get pods -n api-gateway-public
kubectl get pods -n api-gateway-account
kubectl get pods -n api-gateway-oro
kubectl get pods -n api-gateway-admin-dashboard
kubectl get pods -n microservice-communication
kubectl get pods -n microservice-identity
kubectl get pods -n microservice-monolith
kubectl get pods -n microservice-analytics

# Quick overview of all namespaces
kubectl get pods -A | grep -v Running | grep -v Completed

2. Check IngressGateway

kubectl get pods -n istio-system -l app=istio-ingressgateway
kubectl logs -n istio-system -l app=istio-ingressgateway --tail=50 | grep -v "200\|204"

3. Check recent deployments in ArgoCD

argocd app list
# Look for recently changed sync timestamps

4. Open Grafana and check: - Error rate per service (last 30 minutes) - Request volume (drop = routing problem upstream) - P99 latency (spike = service or DB slowness) - Pod restarts (CrashLoopBackOff indicator)

Phase 2: Contain (< 15 minutes)¶

5. If caused by a recent deployment → Roll back immediately

# Identify the bad deployment
argocd app history {service-name}

# Roll back to previous known-good revision
argocd app rollback {service-name} {previous-revision}

# If auto-sync would re-apply bad version, disable it first
argocd app set {service-name} --sync-policy none
argocd app rollback {service-name} {revision}

See Rollbacks Guide for full rollback procedures.

6. If not deployment-related → Check infrastructure

# Check Cloud SQL status
gcloud sql instances describe \
  orofi-prod-cloud-prod-oro-mysql-instance \
  --project=orofi-prod \
  --format="value(state,backendType)"

# Check Redis
gcloud redis instances describe \
  orofi-prod-cloud-prod-redis-cache \
  --region=us-central1 \
  --project=orofi-prod \
  --format="value(state)"

# Check Kafka pods
kubectl get pods -n kafka

# Check MongoDB
kubectl get pods -n mongo-db

7. Restart unhealthy pods

# Restart a specific deployment
kubectl rollout restart deployment/{service-name} -n {namespace}

# Watch rollout progress
kubectl rollout status deployment/{service-name} -n {namespace}

8. If nodes are unhealthy

# Check node status
kubectl get nodes

# If nodes are NotReady, check GKE node pool health
gcloud container node-pools list \
  --cluster=orofi-prod-cloud-prod-k8s-cluster \
  --zone=us-central1-a \
  --project=orofi-prod

Phase 3: Communicate¶

9. Notify stakeholders

[NEEDS TEAM INPUT: describe the incident communication process: - Internal Slack channel to notify - Status page (if any) to update - Customer communication if applicable - Who to page if you can't resolve in 15 minutes]

Phase 4: Resolve¶

10. Apply the fix

Depending on the root cause identified in Phase 1–2:

Root Cause	Action
Bad deployment	Roll back via ArgoCD (Step 5)
Cloud SQL down	Follow Database Failure Runbook
Certificate expired	Follow Certificate Rotation Runbook
Node capacity	Follow Scaling Events Runbook
Secrets out of sync	Force ESO resync + restart pods
Kafka issues	Restart Kafka pods, check consumer groups

Verification¶

After applying a fix:

# Verify pods are healthy
kubectl get pods -A | grep -v "Running\|Completed"

# Verify error rate returns to baseline in Grafana
# Check the "Error Rate" and "Request Success Rate" panels

# Run a synthetic health check against the service
curl -v https://api.{prod-domain}/health

# Confirm ArgoCD shows all apps as Synced + Healthy
argocd app list | grep -v "Synced.*Healthy"

Post-Incident¶

Re-enable ArgoCD auto-sync if you disabled it
Write up an incident report within 24 hours — [NEEDS TEAM INPUT: incident tracking location]
Schedule a post-mortem for Critical incidents within 48 hours
Identify and fix root cause before the incident recurs
Update this runbook if the incident revealed a gap

Escalation¶

If the outage persists for more than 30 minutes and you can't identify the root cause:

[NEEDS TEAM INPUT: name + contact of platform team lead]
[NEEDS TEAM INPUT: name + contact of engineering manager]
[NEEDS TEAM INPUT: GCP support contact if it's a GCP infrastructure issue]