Skip to content

Common Issues

Quick Navigation

# Issue Severity
1 Pod stuck in CrashLoopBackOff High
2 ExternalSecret not syncing High
3 ArgoCD shows OutOfSync but won't sync Medium
4 Service returns 503 via IngressGateway High
5 Database connection refused High
6 Redis AUTH failure High
7 Kafka consumer lag building up Medium
8 Certificate expired or not renewing Critical
9 Pod stuck in Pending (resource pressure) Medium
10 KEDA not scaling pods Medium
11 Istio sidecar injection not happening Medium
12 Workload Identity: permission denied High
13 Dev cluster unreachable (zero-trust blocking) Medium
14 MongoDB connection pool exhaustion High
15 ArgoCD out of sync after Helm template change Low

1. Pod Stuck in CrashLoopBackOff

Symptoms: kubectl get pods shows CrashLoopBackOff. Pod restarts repeatedly.

Root cause: Application crashes on startup — usually a missing secret, misconfigured env var, or database connection failure.

Steps:

# Check logs from the crashing container (current instance)
kubectl logs -n {namespace} -l app={service} --tail=100

# Check logs from the PREVIOUS crash (often more useful)
kubectl logs -n {namespace} -l app={service} --previous --tail=100

# Check pod events for clues
kubectl describe pod -n {namespace} \
  $(kubectl get pod -n {namespace} -l app={service} -o name | head -1)

Look for in logs: - secret not foundIssue 2: ExternalSecret not syncing - connection refusedIssue 5: Database connection refused - OOMKilled → increase memory limit in Helm values


2. ExternalSecret Not Syncing

Symptoms: Kubernetes Secret doesn't exist or has stale values. Pod logs show secret not found.

Steps:

# Check ExternalSecret status
kubectl get externalsecret -n {namespace}

# Check detailed status — look for error messages
kubectl describe externalsecret {secret-name} -n {namespace}

# Check ESO controller logs
kubectl logs -n external-secrets \
  -l app.kubernetes.io/name=external-secrets \
  --tail=100 | grep -i "error\|{secret-name}"

# Force a resync
kubectl annotate externalsecret {secret-name} -n {namespace} \
  force-sync=$(date +%s) --overwrite

Common causes: - {env}-ext-secrets-manager SA doesn't have access to the secret → add roles/secretmanager.secretAccessor via Terraform - Secret doesn't exist in GCP → create it (see Secrets Management) - ESO controller pod is crashing → restart it: kubectl rollout restart deployment -n external-secrets


3. ArgoCD Shows OutOfSync But Won't Sync

Symptoms: ArgoCD UI shows OutOfSync with a red status. Clicking Sync does nothing or fails.

Steps:

# Check ArgoCD app status
argocd app get {app-name}

# Check for sync errors
argocd app sync {app-name} --dry-run

# Check ArgoCD logs
kubectl logs -n argocd \
  -l app.kubernetes.io/name=argocd-application-controller \
  --tail=100 | grep -i "error\|{app-name}"

Common causes: - Helm template error → argocd app sync --dry-run will show the Helm error - CRD not installed → check if the required CRD exists: kubectl get crd | grep {crd-name} - Git repo unreachable → check SSH key secret bitbucket-ssh-key in argocd namespace - Resource conflict (someone applied something manually) → argocd app sync --force


4. Service Returns 503 via IngressGateway

Symptoms: curl https://{hostname}.{env}.orofi.xyz/path returns 503 Service Unavailable.

Steps:

# Check if pods are running
kubectl get pods -n {namespace}

# Check IngressGateway logs
kubectl logs -n istio-system \
  -l app=istio-ingressgateway --tail=50 | grep "503\|error"

# Check VirtualService is correctly configured
kubectl get virtualservice -n {namespace} -o yaml | grep -A10 "hosts\|route"

# Test service directly (bypassing ingress)
kubectl port-forward -n {namespace} svc/{service} 8080:80
curl http://localhost:8080/health

Common causes: - No healthy pods → fix pod issue first - VirtualService host mismatch → verify hostname matches DNS record - Service selector mismatch → kubectl get endpoints -n {namespace} (should show pod IPs) - Istio DestinationRule TLS misconfiguration


5. Database Connection Refused

Symptoms: App logs show connection refused or JDBC Connection Error for MySQL.

Steps:

# Test DNS resolution from inside the cluster
kubectl exec -it -n {namespace} \
  $(kubectl get pod -n {namespace} -l app={service} -o name | head -1) \
  -- nslookup microservice-{name}-db.{env}.orofi.xyz

# Test TCP connectivity
kubectl exec -it -n {namespace} \
  $(kubectl get pod -n {namespace} -l app={service} -o name | head -1) \
  -- nc -zv microservice-{name}-db.{env}.orofi.xyz 3306

# Check Cloud SQL instance status
gcloud sql instances describe orofi-{env}-cloud-{env}-oro-mysql-instance \
  --project=orofi-{env}-cloud \
  --format="value(state)"

Common causes: - Cloud SQL instance is stopped → start it via GCP console or gcloud sql instances patch - DNS record not created → check Terraform for the DNS record - VPC peering misconfiguration (staging PSC issue) → check Private Service Access in GCP console


6. Redis AUTH Failure

Symptoms: App logs show WRONGPASS invalid username-password pair or NOAUTH Authentication required.

Steps:

# Check if Redis secret exists and has correct format
kubectl get secret -n {namespace} | grep redis

# Verify the secret value (base64 decode)
kubectl get secret redis-auth -n {namespace} \
  -o jsonpath='{.data.password}' | base64 -d

# Force ESO to re-sync the Redis secret
kubectl annotate externalsecret redis-auth-secret -n {namespace} \
  force-sync=$(date +%s) --overwrite

# Restart the pod after secret update
kubectl rollout restart deployment/{service} -n {namespace}

Common cause: Redis auth password was rotated in GCP Secret Manager but the pod hasn't picked up the new value. ESO syncs on schedule (typically 1h) — force a resync or restart the pod.


7. Kafka Consumer Lag Building Up

Symptoms: Grafana shows increasing consumer lag for a service. Events are not being processed.

Steps:

# Check Kafka pod health
kubectl get pods -n kafka

# Open Kafka UI to inspect consumer groups
# https://kafka-ui.{env}.orofi.xyz

# Check the consumer service logs for errors
kubectl logs -n microservice-{name} -l app=microservice-{name} --tail=200 \
  | grep -i "kafka\|consumer\|error"

# Check if topic exists
# (in Kafka UI → Topics → verify topic name)

Common causes: - Consumer is crashing → fix pod issue - Topic doesn't exist → create it (check topic configuration in tools/kafka-new/values-{env}.yaml) - SASL credentials expired → rotate {env}-kafka-secrets - Consumer stuck on a poison-pill message → [NEEDS TEAM INPUT: document DLQ or skip-offset procedure]


8. Certificate Expired or Not Renewing

Symptoms: Browser shows certificate error. curl returns SSL certificate problem.

See Certificate Rotation Runbook for the full procedure.

Quick check:

# Check certificate status
kubectl get certificate -n istio-system istio-tls-cert

# Check cert-manager logs
kubectl logs -n cert-manager \
  -l app=cert-manager --tail=100 | grep -i "error\|certificate"


9. Pod Stuck in Pending (Resource Pressure)

Symptoms: kubectl get pods shows Pending. kubectl describe pod shows Insufficient cpu or Insufficient memory.

Steps:

# See why a pod can't schedule
kubectl describe pod -n {namespace} {pod-name} | grep -A10 "Events\|Conditions"

# Check node resource usage
kubectl describe nodes | grep -A5 "Allocated resources"

# Check if cluster autoscaler is adding nodes
kubectl get events -A | grep -i "scale\|node"

Solutions: - Wait for cluster autoscaler to add a node (can take 2–3 minutes) - If autoscaler is not adding nodes, check autoscaler logs: kubectl logs -n kube-system -l app=cluster-autoscaler --tail=100 - If the cluster is at max nodes (15) and still pressure: scale up manually via scale-up-{env} Bitbucket pipeline, or increase max node count in Terraform


10. KEDA Not Scaling Pods

Symptoms: Consumer lag is high but pods aren't scaling up. KEDA ScaledObject exists.

Steps:

# Check ScaledObject status
kubectl get scaledobject -n {namespace}
kubectl describe scaledobject -n {namespace} {scaledobject-name}

# Check KEDA operator logs
kubectl logs -n keda -l app=keda-operator --tail=100 | grep -i "error\|{namespace}"

# Check HPA generated by KEDA
kubectl get hpa -n {namespace}

Common causes: - KEDA can't reach the Kafka/MongoDB metrics endpoint → verify connectivity - ScaledObject minReplicas is set too high → check config - KEDA operator is unhealthy → kubectl rollout restart deployment -n keda


11. Istio Sidecar Injection Not Happening

Symptoms: Pod has only 1 container (should have 2 — app + istio-proxy). mTLS failures.

Steps:

# Check if sidecar injection label is on the namespace
kubectl get namespace {namespace} --show-labels | grep istio-injection

# Should see: istio-injection=enabled
# If not, add it:
kubectl label namespace {namespace} istio-injection=enabled

# Restart pods to get sidecar injected
kubectl rollout restart deployment/{service} -n {namespace}


12. Workload Identity: Permission Denied

Symptoms: App logs show 403 PERMISSION_DENIED when accessing GCP Secret Manager or Cloud Storage.

Steps:

# Check if the K8s ServiceAccount has the correct annotation
kubectl get serviceaccount -n {namespace} {sa-name} -o yaml | grep annotation

# Should see: iam.gke.io/gcp-service-account: {gsa}@{project}.iam.gserviceaccount.com

# Verify the IAM binding exists in GCP
gcloud iam service-accounts get-iam-policy \
  {gsa}@orofi-{env}-cloud.iam.gserviceaccount.com \
  --project=orofi-{env}-cloud \
  | grep {k8s-namespace}

# Check if the GCP SA has the right role on the resource
gcloud secrets get-iam-policy {secret-name} \
  --project=orofi-{env}-cloud

Fix: The Terraform modules/service-accounts module creates both the IAM binding and the annotation. If this is missing, add the service account to the Terraform configuration and apply.


13. Dev Cluster Unreachable (Zero-Trust Blocking)

Symptoms: Can't kubectl to dev cluster from your machine. gcloud container clusters get-credentials works but kubectl get nodes times out.

Root cause: The dev cluster has zero-trust firewall rules that only allow traffic from 35.226.57.140/32 (Bitbucket), 10.0.0.0/8, and 11.0.0.0/16. Your IP is not in these ranges.

Fix: 1. Use a VPN or bastion host that routes through the allowed IP ranges 2. [NEEDS TEAM INPUT: document the VPN or bastion host access procedure] 3. Or, temporarily add your IP to the firewall allowlist via Terraform (remember to remove it after)


14. MongoDB Connection Pool Exhaustion

Symptoms: App logs show MongoWaitQueueFullError or connection pool exhausted. KEDA may be scaling MongoDB.

Steps:

# Check current MongoDB connections (in Mongo Express or via mongosh)
# db.serverStatus().connections

# Check KEDA ScaledObject for MongoDB
kubectl get scaledobject -n mongo-db

# Check MongoDB pod count
kubectl get pods -n mongo-db

# Check MongoDB logs
kubectl logs -n mongo-db -l app=mongo-db --tail=100

Solutions: - KEDA should auto-scale MongoDB replicas (up to 5) when connections exceed 50 - If KEDA is not scaling, see Issue 10 - If connections are legitimate load spike: check for connection leaks in application code - Emergency: manually scale MongoDB: kubectl scale statefulset mongo-db -n mongo-db --replicas=3


15. ArgoCD Out of Sync After Helm Template Change

Symptoms: After changing a Helm template in the infra repo, ArgoCD shows OutOfSync but the diff looks correct. Auto-sync is not applying.

Root cause: Helm generated a resource that differs in annotations or labels (e.g., helm.sh/chart label changed with chart version bump).

Steps:

# See the exact diff
argocd app diff {app-name}

# Force sync with pruning (removes resources no longer in Helm output)
argocd app sync {app-name} --prune

# If there are immutable fields (e.g., on a Job), you may need to delete and recreate


See Also