Common Issues¶
Quick Navigation¶
| # | Issue | Severity |
|---|---|---|
| 1 | Pod stuck in CrashLoopBackOff | High |
| 2 | ExternalSecret not syncing | High |
| 3 | ArgoCD shows OutOfSync but won't sync | Medium |
| 4 | Service returns 503 via IngressGateway | High |
| 5 | Database connection refused | High |
| 6 | Redis AUTH failure | High |
| 7 | Kafka consumer lag building up | Medium |
| 8 | Certificate expired or not renewing | Critical |
| 9 | Pod stuck in Pending (resource pressure) | Medium |
| 10 | KEDA not scaling pods | Medium |
| 11 | Istio sidecar injection not happening | Medium |
| 12 | Workload Identity: permission denied | High |
| 13 | Dev cluster unreachable (zero-trust blocking) | Medium |
| 14 | MongoDB connection pool exhaustion | High |
| 15 | ArgoCD out of sync after Helm template change | Low |
1. Pod Stuck in CrashLoopBackOff¶
Symptoms: kubectl get pods shows CrashLoopBackOff. Pod restarts repeatedly.
Root cause: Application crashes on startup — usually a missing secret, misconfigured env var, or database connection failure.
Steps:
# Check logs from the crashing container (current instance)
kubectl logs -n {namespace} -l app={service} --tail=100
# Check logs from the PREVIOUS crash (often more useful)
kubectl logs -n {namespace} -l app={service} --previous --tail=100
# Check pod events for clues
kubectl describe pod -n {namespace} \
$(kubectl get pod -n {namespace} -l app={service} -o name | head -1)
Look for in logs:
- secret not found → Issue 2: ExternalSecret not syncing
- connection refused → Issue 5: Database connection refused
- OOMKilled → increase memory limit in Helm values
2. ExternalSecret Not Syncing¶
Symptoms: Kubernetes Secret doesn't exist or has stale values. Pod logs show secret not found.
Steps:
# Check ExternalSecret status
kubectl get externalsecret -n {namespace}
# Check detailed status — look for error messages
kubectl describe externalsecret {secret-name} -n {namespace}
# Check ESO controller logs
kubectl logs -n external-secrets \
-l app.kubernetes.io/name=external-secrets \
--tail=100 | grep -i "error\|{secret-name}"
# Force a resync
kubectl annotate externalsecret {secret-name} -n {namespace} \
force-sync=$(date +%s) --overwrite
Common causes:
- {env}-ext-secrets-manager SA doesn't have access to the secret → add roles/secretmanager.secretAccessor via Terraform
- Secret doesn't exist in GCP → create it (see Secrets Management)
- ESO controller pod is crashing → restart it: kubectl rollout restart deployment -n external-secrets
3. ArgoCD Shows OutOfSync But Won't Sync¶
Symptoms: ArgoCD UI shows OutOfSync with a red status. Clicking Sync does nothing or fails.
Steps:
# Check ArgoCD app status
argocd app get {app-name}
# Check for sync errors
argocd app sync {app-name} --dry-run
# Check ArgoCD logs
kubectl logs -n argocd \
-l app.kubernetes.io/name=argocd-application-controller \
--tail=100 | grep -i "error\|{app-name}"
Common causes:
- Helm template error → argocd app sync --dry-run will show the Helm error
- CRD not installed → check if the required CRD exists: kubectl get crd | grep {crd-name}
- Git repo unreachable → check SSH key secret bitbucket-ssh-key in argocd namespace
- Resource conflict (someone applied something manually) → argocd app sync --force
4. Service Returns 503 via IngressGateway¶
Symptoms: curl https://{hostname}.{env}.orofi.xyz/path returns 503 Service Unavailable.
Steps:
# Check if pods are running
kubectl get pods -n {namespace}
# Check IngressGateway logs
kubectl logs -n istio-system \
-l app=istio-ingressgateway --tail=50 | grep "503\|error"
# Check VirtualService is correctly configured
kubectl get virtualservice -n {namespace} -o yaml | grep -A10 "hosts\|route"
# Test service directly (bypassing ingress)
kubectl port-forward -n {namespace} svc/{service} 8080:80
curl http://localhost:8080/health
Common causes:
- No healthy pods → fix pod issue first
- VirtualService host mismatch → verify hostname matches DNS record
- Service selector mismatch → kubectl get endpoints -n {namespace} (should show pod IPs)
- Istio DestinationRule TLS misconfiguration
5. Database Connection Refused¶
Symptoms: App logs show connection refused or JDBC Connection Error for MySQL.
Steps:
# Test DNS resolution from inside the cluster
kubectl exec -it -n {namespace} \
$(kubectl get pod -n {namespace} -l app={service} -o name | head -1) \
-- nslookup microservice-{name}-db.{env}.orofi.xyz
# Test TCP connectivity
kubectl exec -it -n {namespace} \
$(kubectl get pod -n {namespace} -l app={service} -o name | head -1) \
-- nc -zv microservice-{name}-db.{env}.orofi.xyz 3306
# Check Cloud SQL instance status
gcloud sql instances describe orofi-{env}-cloud-{env}-oro-mysql-instance \
--project=orofi-{env}-cloud \
--format="value(state)"
Common causes:
- Cloud SQL instance is stopped → start it via GCP console or gcloud sql instances patch
- DNS record not created → check Terraform for the DNS record
- VPC peering misconfiguration (staging PSC issue) → check Private Service Access in GCP console
6. Redis AUTH Failure¶
Symptoms: App logs show WRONGPASS invalid username-password pair or NOAUTH Authentication required.
Steps:
# Check if Redis secret exists and has correct format
kubectl get secret -n {namespace} | grep redis
# Verify the secret value (base64 decode)
kubectl get secret redis-auth -n {namespace} \
-o jsonpath='{.data.password}' | base64 -d
# Force ESO to re-sync the Redis secret
kubectl annotate externalsecret redis-auth-secret -n {namespace} \
force-sync=$(date +%s) --overwrite
# Restart the pod after secret update
kubectl rollout restart deployment/{service} -n {namespace}
Common cause: Redis auth password was rotated in GCP Secret Manager but the pod hasn't picked up the new value. ESO syncs on schedule (typically 1h) — force a resync or restart the pod.
7. Kafka Consumer Lag Building Up¶
Symptoms: Grafana shows increasing consumer lag for a service. Events are not being processed.
Steps:
# Check Kafka pod health
kubectl get pods -n kafka
# Open Kafka UI to inspect consumer groups
# https://kafka-ui.{env}.orofi.xyz
# Check the consumer service logs for errors
kubectl logs -n microservice-{name} -l app=microservice-{name} --tail=200 \
| grep -i "kafka\|consumer\|error"
# Check if topic exists
# (in Kafka UI → Topics → verify topic name)
Common causes:
- Consumer is crashing → fix pod issue
- Topic doesn't exist → create it (check topic configuration in tools/kafka-new/values-{env}.yaml)
- SASL credentials expired → rotate {env}-kafka-secrets
- Consumer stuck on a poison-pill message → [NEEDS TEAM INPUT: document DLQ or skip-offset procedure]
8. Certificate Expired or Not Renewing¶
Symptoms: Browser shows certificate error. curl returns SSL certificate problem.
See Certificate Rotation Runbook for the full procedure.
Quick check:
# Check certificate status
kubectl get certificate -n istio-system istio-tls-cert
# Check cert-manager logs
kubectl logs -n cert-manager \
-l app=cert-manager --tail=100 | grep -i "error\|certificate"
9. Pod Stuck in Pending (Resource Pressure)¶
Symptoms: kubectl get pods shows Pending. kubectl describe pod shows Insufficient cpu or Insufficient memory.
Steps:
# See why a pod can't schedule
kubectl describe pod -n {namespace} {pod-name} | grep -A10 "Events\|Conditions"
# Check node resource usage
kubectl describe nodes | grep -A5 "Allocated resources"
# Check if cluster autoscaler is adding nodes
kubectl get events -A | grep -i "scale\|node"
Solutions:
- Wait for cluster autoscaler to add a node (can take 2–3 minutes)
- If autoscaler is not adding nodes, check autoscaler logs: kubectl logs -n kube-system -l app=cluster-autoscaler --tail=100
- If the cluster is at max nodes (15) and still pressure: scale up manually via scale-up-{env} Bitbucket pipeline, or increase max node count in Terraform
10. KEDA Not Scaling Pods¶
Symptoms: Consumer lag is high but pods aren't scaling up. KEDA ScaledObject exists.
Steps:
# Check ScaledObject status
kubectl get scaledobject -n {namespace}
kubectl describe scaledobject -n {namespace} {scaledobject-name}
# Check KEDA operator logs
kubectl logs -n keda -l app=keda-operator --tail=100 | grep -i "error\|{namespace}"
# Check HPA generated by KEDA
kubectl get hpa -n {namespace}
Common causes:
- KEDA can't reach the Kafka/MongoDB metrics endpoint → verify connectivity
- ScaledObject minReplicas is set too high → check config
- KEDA operator is unhealthy → kubectl rollout restart deployment -n keda
11. Istio Sidecar Injection Not Happening¶
Symptoms: Pod has only 1 container (should have 2 — app + istio-proxy). mTLS failures.
Steps:
# Check if sidecar injection label is on the namespace
kubectl get namespace {namespace} --show-labels | grep istio-injection
# Should see: istio-injection=enabled
# If not, add it:
kubectl label namespace {namespace} istio-injection=enabled
# Restart pods to get sidecar injected
kubectl rollout restart deployment/{service} -n {namespace}
12. Workload Identity: Permission Denied¶
Symptoms: App logs show 403 PERMISSION_DENIED when accessing GCP Secret Manager or Cloud Storage.
Steps:
# Check if the K8s ServiceAccount has the correct annotation
kubectl get serviceaccount -n {namespace} {sa-name} -o yaml | grep annotation
# Should see: iam.gke.io/gcp-service-account: {gsa}@{project}.iam.gserviceaccount.com
# Verify the IAM binding exists in GCP
gcloud iam service-accounts get-iam-policy \
{gsa}@orofi-{env}-cloud.iam.gserviceaccount.com \
--project=orofi-{env}-cloud \
| grep {k8s-namespace}
# Check if the GCP SA has the right role on the resource
gcloud secrets get-iam-policy {secret-name} \
--project=orofi-{env}-cloud
Fix: The Terraform modules/service-accounts module creates both the IAM binding and the annotation. If this is missing, add the service account to the Terraform configuration and apply.
13. Dev Cluster Unreachable (Zero-Trust Blocking)¶
Symptoms: Can't kubectl to dev cluster from your machine. gcloud container clusters get-credentials works but kubectl get nodes times out.
Root cause: The dev cluster has zero-trust firewall rules that only allow traffic from 35.226.57.140/32 (Bitbucket), 10.0.0.0/8, and 11.0.0.0/16. Your IP is not in these ranges.
Fix: 1. Use a VPN or bastion host that routes through the allowed IP ranges 2. [NEEDS TEAM INPUT: document the VPN or bastion host access procedure] 3. Or, temporarily add your IP to the firewall allowlist via Terraform (remember to remove it after)
14. MongoDB Connection Pool Exhaustion¶
Symptoms: App logs show MongoWaitQueueFullError or connection pool exhausted. KEDA may be scaling MongoDB.
Steps:
# Check current MongoDB connections (in Mongo Express or via mongosh)
# db.serverStatus().connections
# Check KEDA ScaledObject for MongoDB
kubectl get scaledobject -n mongo-db
# Check MongoDB pod count
kubectl get pods -n mongo-db
# Check MongoDB logs
kubectl logs -n mongo-db -l app=mongo-db --tail=100
Solutions:
- KEDA should auto-scale MongoDB replicas (up to 5) when connections exceed 50
- If KEDA is not scaling, see Issue 10
- If connections are legitimate load spike: check for connection leaks in application code
- Emergency: manually scale MongoDB: kubectl scale statefulset mongo-db -n mongo-db --replicas=3
15. ArgoCD Out of Sync After Helm Template Change¶
Symptoms: After changing a Helm template in the infra repo, ArgoCD shows OutOfSync but the diff looks correct. Auto-sync is not applying.
Root cause: Helm generated a resource that differs in annotations or labels (e.g., helm.sh/chart label changed with chart version bump).
Steps:
# See the exact diff
argocd app diff {app-name}
# Force sync with pruning (removes resources no longer in Helm output)
argocd app sync {app-name} --prune
# If there are immutable fields (e.g., on a Job), you may need to delete and recreate