Production Outage Runbook¶
Severity: Critical Last Tested: [NEEDS TEAM INPUT]
Symptoms¶
- Error rate in Grafana spikes above normal baseline
- Health check endpoint returning non-200 status
- Users reporting inability to access the application
kubectl get podsshowsCrashLoopBackOfforErrorstate- IngressGateway returning 503 for all routes
Impact¶
- All users affected if IngressGateway or identity service is down
- Subset of users affected if a single microservice is down
- Data writes may be lost if the outage includes database connectivity issues
Prerequisites¶
- kubectl access to the production cluster
- ArgoCD CLI or access to ArgoCD web UI
- gcloud CLI authenticated to
orofi-prodproject - Access to Grafana production dashboards
Steps¶
Phase 1: Triage (< 5 minutes)¶
1. Determine scope
# Check all pod statuses across application namespaces
kubectl get pods -n api-gateway-public
kubectl get pods -n api-gateway-account
kubectl get pods -n api-gateway-oro
kubectl get pods -n api-gateway-admin-dashboard
kubectl get pods -n microservice-communication
kubectl get pods -n microservice-identity
kubectl get pods -n microservice-monolith
kubectl get pods -n microservice-analytics
# Quick overview of all namespaces
kubectl get pods -A | grep -v Running | grep -v Completed
2. Check IngressGateway
kubectl get pods -n istio-system -l app=istio-ingressgateway
kubectl logs -n istio-system -l app=istio-ingressgateway --tail=50 | grep -v "200\|204"
3. Check recent deployments in ArgoCD
4. Open Grafana and check: - Error rate per service (last 30 minutes) - Request volume (drop = routing problem upstream) - P99 latency (spike = service or DB slowness) - Pod restarts (CrashLoopBackOff indicator)
Phase 2: Contain (< 15 minutes)¶
5. If caused by a recent deployment → Roll back immediately
# Identify the bad deployment
argocd app history {service-name}
# Roll back to previous known-good revision
argocd app rollback {service-name} {previous-revision}
# If auto-sync would re-apply bad version, disable it first
argocd app set {service-name} --sync-policy none
argocd app rollback {service-name} {revision}
See Rollbacks Guide for full rollback procedures.
6. If not deployment-related → Check infrastructure
# Check Cloud SQL status
gcloud sql instances describe \
orofi-prod-cloud-prod-oro-mysql-instance \
--project=orofi-prod \
--format="value(state,backendType)"
# Check Redis
gcloud redis instances describe \
orofi-prod-cloud-prod-redis-cache \
--region=us-central1 \
--project=orofi-prod \
--format="value(state)"
# Check Kafka pods
kubectl get pods -n kafka
# Check MongoDB
kubectl get pods -n mongo-db
7. Restart unhealthy pods
# Restart a specific deployment
kubectl rollout restart deployment/{service-name} -n {namespace}
# Watch rollout progress
kubectl rollout status deployment/{service-name} -n {namespace}
8. If nodes are unhealthy
# Check node status
kubectl get nodes
# If nodes are NotReady, check GKE node pool health
gcloud container node-pools list \
--cluster=orofi-prod-cloud-prod-k8s-cluster \
--zone=us-central1-a \
--project=orofi-prod
Phase 3: Communicate¶
9. Notify stakeholders
[NEEDS TEAM INPUT: describe the incident communication process: - Internal Slack channel to notify - Status page (if any) to update - Customer communication if applicable - Who to page if you can't resolve in 15 minutes]
Phase 4: Resolve¶
10. Apply the fix
Depending on the root cause identified in Phase 1–2:
| Root Cause | Action |
|---|---|
| Bad deployment | Roll back via ArgoCD (Step 5) |
| Cloud SQL down | Follow Database Failure Runbook |
| Certificate expired | Follow Certificate Rotation Runbook |
| Node capacity | Follow Scaling Events Runbook |
| Secrets out of sync | Force ESO resync + restart pods |
| Kafka issues | Restart Kafka pods, check consumer groups |
Verification¶
After applying a fix:
# Verify pods are healthy
kubectl get pods -A | grep -v "Running\|Completed"
# Verify error rate returns to baseline in Grafana
# Check the "Error Rate" and "Request Success Rate" panels
# Run a synthetic health check against the service
curl -v https://api.{prod-domain}/health
# Confirm ArgoCD shows all apps as Synced + Healthy
argocd app list | grep -v "Synced.*Healthy"
Post-Incident¶
- Re-enable ArgoCD auto-sync if you disabled it
- Write up an incident report within 24 hours — [NEEDS TEAM INPUT: incident tracking location]
- Schedule a post-mortem for Critical incidents within 48 hours
- Identify and fix root cause before the incident recurs
- Update this runbook if the incident revealed a gap
Escalation¶
If the outage persists for more than 30 minutes and you can't identify the root cause:
- [NEEDS TEAM INPUT: name + contact of platform team lead]
- [NEEDS TEAM INPUT: name + contact of engineering manager]
- [NEEDS TEAM INPUT: GCP support contact if it's a GCP infrastructure issue]