Common Issues¶

#	Issue	Severity
1	Pod stuck in CrashLoopBackOff	High
2	ExternalSecret not syncing	High
3	ArgoCD shows OutOfSync but won't sync	Medium
4	Service returns 503 via IngressGateway	High
5	Database connection refused	High
6	Redis AUTH failure	High
7	Kafka consumer lag building up	Medium
8	Certificate expired or not renewing	Critical
9	Pod stuck in Pending (resource pressure)	Medium
10	KEDA not scaling pods	Medium
11	Istio sidecar injection not happening	Medium
12	Workload Identity: permission denied	High
13	Dev cluster unreachable (zero-trust blocking)	Medium
14	MongoDB connection pool exhaustion	High
15	ArgoCD out of sync after Helm template change	Low

1. Pod Stuck in CrashLoopBackOff¶

Symptoms: kubectl get pods shows CrashLoopBackOff. Pod restarts repeatedly.

Root cause: Application crashes on startup — usually a missing secret, misconfigured env var, or database connection failure.

Steps:

# Check logs from the crashing container (current instance)
kubectl logs -n {namespace} -l app={service} --tail=100

# Check logs from the PREVIOUS crash (often more useful)
kubectl logs -n {namespace} -l app={service} --previous --tail=100

# Check pod events for clues
kubectl describe pod -n {namespace} \
  $(kubectl get pod -n {namespace} -l app={service} -o name | head -1)

Look for in logs: - secret not found → Issue 2: ExternalSecret not syncing - connection refused → Issue 5: Database connection refused - OOMKilled → increase memory limit in Helm values

2. ExternalSecret Not Syncing¶

Symptoms: Kubernetes Secret doesn't exist or has stale values. Pod logs show secret not found.

Steps:

# Check ExternalSecret status
kubectl get externalsecret -n {namespace}

# Check detailed status — look for error messages
kubectl describe externalsecret {secret-name} -n {namespace}

# Check ESO controller logs
kubectl logs -n external-secrets \
  -l app.kubernetes.io/name=external-secrets \
  --tail=100 | grep -i "error\|{secret-name}"

# Force a resync
kubectl annotate externalsecret {secret-name} -n {namespace} \
  force-sync=$(date +%s) --overwrite

Common causes: - {env}-ext-secrets-manager SA doesn't have access to the secret → add roles/secretmanager.secretAccessor via Terraform - Secret doesn't exist in GCP → create it (see Secrets Management) - ESO controller pod is crashing → restart it: kubectl rollout restart deployment -n external-secrets

3. ArgoCD Shows OutOfSync But Won't Sync¶

Symptoms: ArgoCD UI shows OutOfSync with a red status. Clicking Sync does nothing or fails.

Steps:

# Check ArgoCD app status
argocd app get {app-name}

# Check for sync errors
argocd app sync {app-name} --dry-run

# Check ArgoCD logs
kubectl logs -n argocd \
  -l app.kubernetes.io/name=argocd-application-controller \
  --tail=100 | grep -i "error\|{app-name}"

Common causes: - Helm template error → argocd app sync --dry-run will show the Helm error - CRD not installed → check if the required CRD exists: kubectl get crd | grep {crd-name} - Git repo unreachable → check SSH key secret bitbucket-ssh-key in argocd namespace - Resource conflict (someone applied something manually) → argocd app sync --force

4. Service Returns 503 via IngressGateway¶

Symptoms: curl https://{hostname}.{env}.orofi.xyz/path returns 503 Service Unavailable.

Steps:

# Check if pods are running
kubectl get pods -n {namespace}

# Check IngressGateway logs
kubectl logs -n istio-system \
  -l app=istio-ingressgateway --tail=50 | grep "503\|error"

# Check VirtualService is correctly configured
kubectl get virtualservice -n {namespace} -o yaml | grep -A10 "hosts\|route"

# Test service directly (bypassing ingress)
kubectl port-forward -n {namespace} svc/{service} 8080:80
curl http://localhost:8080/health

Common causes: - No healthy pods → fix pod issue first - VirtualService host mismatch → verify hostname matches DNS record - Service selector mismatch → kubectl get endpoints -n {namespace} (should show pod IPs) - Istio DestinationRule TLS misconfiguration

5. Database Connection Refused¶

Symptoms: App logs show connection refused or JDBC Connection Error for MySQL.

Steps:

# Test DNS resolution from inside the cluster
kubectl exec -it -n {namespace} \
  $(kubectl get pod -n {namespace} -l app={service} -o name | head -1) \
  -- nslookup microservice-{name}-db.{env}.orofi.xyz

# Test TCP connectivity
kubectl exec -it -n {namespace} \
  $(kubectl get pod -n {namespace} -l app={service} -o name | head -1) \
  -- nc -zv microservice-{name}-db.{env}.orofi.xyz 3306

# Check Cloud SQL instance status
gcloud sql instances describe orofi-{env}-cloud-{env}-oro-mysql-instance \
  --project=orofi-{env}-cloud \
  --format="value(state)"

Common causes: - Cloud SQL instance is stopped → start it via GCP console or gcloud sql instances patch - DNS record not created → check Terraform for the DNS record - VPC peering misconfiguration (staging PSC issue) → check Private Service Access in GCP console

6. Redis AUTH Failure¶

Symptoms: App logs show WRONGPASS invalid username-password pair or NOAUTH Authentication required.

Steps:

# Check if Redis secret exists and has correct format
kubectl get secret -n {namespace} | grep redis

# Verify the secret value (base64 decode)
kubectl get secret redis-auth -n {namespace} \
  -o jsonpath='{.data.password}' | base64 -d

# Force ESO to re-sync the Redis secret
kubectl annotate externalsecret redis-auth-secret -n {namespace} \
  force-sync=$(date +%s) --overwrite

# Restart the pod after secret update
kubectl rollout restart deployment/{service} -n {namespace}

Common cause: Redis auth password was rotated in GCP Secret Manager but the pod hasn't picked up the new value. ESO syncs on schedule (typically 1h) — force a resync or restart the pod.

7. Kafka Consumer Lag Building Up¶

Symptoms: Grafana shows increasing consumer lag for a service. Events are not being processed.

Steps:

# Check Kafka pod health
kubectl get pods -n kafka

# Open Kafka UI to inspect consumer groups
# https://kafka-ui.{env}.orofi.xyz

# Check the consumer service logs for errors
kubectl logs -n microservice-{name} -l app=microservice-{name} --tail=200 \
  | grep -i "kafka\|consumer\|error"

# Check if topic exists
# (in Kafka UI → Topics → verify topic name)

Common causes: - Consumer is crashing → fix pod issue - Topic doesn't exist → create it (check topic configuration in tools/kafka-new/values-{env}.yaml) - SASL credentials expired → rotate {env}-kafka-secrets - Consumer stuck on a poison-pill message → [NEEDS TEAM INPUT: document DLQ or skip-offset procedure]

8. Certificate Expired or Not Renewing¶

Symptoms: Browser shows certificate error. curl returns SSL certificate problem.

See Certificate Rotation Runbook for the full procedure.

Quick check:

# Check certificate status
kubectl get certificate -n istio-system istio-tls-cert

# Check cert-manager logs
kubectl logs -n cert-manager \
  -l app=cert-manager --tail=100 | grep -i "error\|certificate"

9. Pod Stuck in Pending (Resource Pressure)¶

Symptoms: kubectl get pods shows Pending. kubectl describe pod shows Insufficient cpu or Insufficient memory.

Steps:

# See why a pod can't schedule
kubectl describe pod -n {namespace} {pod-name} | grep -A10 "Events\|Conditions"

# Check node resource usage
kubectl describe nodes | grep -A5 "Allocated resources"

# Check if cluster autoscaler is adding nodes
kubectl get events -A | grep -i "scale\|node"

Solutions: - Wait for cluster autoscaler to add a node (can take 2–3 minutes) - If autoscaler is not adding nodes, check autoscaler logs: kubectl logs -n kube-system -l app=cluster-autoscaler --tail=100 - If the cluster is at max nodes (15) and still pressure: scale up manually via scale-up-{env} Bitbucket pipeline, or increase max node count in Terraform

10. KEDA Not Scaling Pods¶

Symptoms: Consumer lag is high but pods aren't scaling up. KEDA ScaledObject exists.

Steps:

# Check ScaledObject status
kubectl get scaledobject -n {namespace}
kubectl describe scaledobject -n {namespace} {scaledobject-name}

# Check KEDA operator logs
kubectl logs -n keda -l app=keda-operator --tail=100 | grep -i "error\|{namespace}"

# Check HPA generated by KEDA
kubectl get hpa -n {namespace}

Common causes: - KEDA can't reach the Kafka/MongoDB metrics endpoint → verify connectivity - ScaledObject minReplicas is set too high → check config - KEDA operator is unhealthy → kubectl rollout restart deployment -n keda

11. Istio Sidecar Injection Not Happening¶

Symptoms: Pod has only 1 container (should have 2 — app + istio-proxy). mTLS failures.

Steps:

# Check if sidecar injection label is on the namespace
kubectl get namespace {namespace} --show-labels | grep istio-injection

# Should see: istio-injection=enabled
# If not, add it:
kubectl label namespace {namespace} istio-injection=enabled

# Restart pods to get sidecar injected
kubectl rollout restart deployment/{service} -n {namespace}

12. Workload Identity: Permission Denied¶

Symptoms: App logs show 403 PERMISSION_DENIED when accessing GCP Secret Manager or Cloud Storage.

Steps:

# Check if the K8s ServiceAccount has the correct annotation
kubectl get serviceaccount -n {namespace} {sa-name} -o yaml | grep annotation

# Should see: iam.gke.io/gcp-service-account: {gsa}@{project}.iam.gserviceaccount.com

# Verify the IAM binding exists in GCP
gcloud iam service-accounts get-iam-policy \
  {gsa}@orofi-{env}-cloud.iam.gserviceaccount.com \
  --project=orofi-{env}-cloud \
  | grep {k8s-namespace}

# Check if the GCP SA has the right role on the resource
gcloud secrets get-iam-policy {secret-name} \
  --project=orofi-{env}-cloud

Fix: The Terraform modules/service-accounts module creates both the IAM binding and the annotation. If this is missing, add the service account to the Terraform configuration and apply.

13. Dev Cluster Unreachable (Zero-Trust Blocking)¶

Symptoms: Can't kubectl to dev cluster from your machine. gcloud container clusters get-credentials works but kubectl get nodes times out.

Root cause: The dev cluster has zero-trust firewall rules that only allow traffic from 35.226.57.140/32 (Bitbucket), 10.0.0.0/8, and 11.0.0.0/16. Your IP is not in these ranges.

Fix: 1. Use a VPN or bastion host that routes through the allowed IP ranges 2. [NEEDS TEAM INPUT: document the VPN or bastion host access procedure] 3. Or, temporarily add your IP to the firewall allowlist via Terraform (remember to remove it after)

14. MongoDB Connection Pool Exhaustion¶

Symptoms: App logs show MongoWaitQueueFullError or connection pool exhausted. KEDA may be scaling MongoDB.

Steps:

# Check current MongoDB connections (in Mongo Express or via mongosh)
# db.serverStatus().connections

# Check KEDA ScaledObject for MongoDB
kubectl get scaledobject -n mongo-db

# Check MongoDB pod count
kubectl get pods -n mongo-db

# Check MongoDB logs
kubectl logs -n mongo-db -l app=mongo-db --tail=100

Solutions: - KEDA should auto-scale MongoDB replicas (up to 5) when connections exceed 50 - If KEDA is not scaling, see Issue 10 - If connections are legitimate load spike: check for connection leaks in application code - Emergency: manually scale MongoDB: kubectl scale statefulset mongo-db -n mongo-db --replicas=3

15. ArgoCD Out of Sync After Helm Template Change¶

Symptoms: After changing a Helm template in the infra repo, ArgoCD shows OutOfSync but the diff looks correct. Auto-sync is not applying.

Root cause: Helm generated a resource that differs in annotations or labels (e.g., helm.sh/chart label changed with chart version bump).

Steps:

# See the exact diff
argocd app diff {app-name}

# Force sync with pruning (removes resources no longer in Helm output)
argocd app sync {app-name} --prune

# If there are immutable fields (e.g., on a Job), you may need to delete and recreate

Common Issues¶

Quick Navigation¶

1. Pod Stuck in CrashLoopBackOff¶

2. ExternalSecret Not Syncing¶

3. ArgoCD Shows OutOfSync But Won't Sync¶

4. Service Returns 503 via IngressGateway¶

5. Database Connection Refused¶

6. Redis AUTH Failure¶

7. Kafka Consumer Lag Building Up¶

8. Certificate Expired or Not Renewing¶

9. Pod Stuck in Pending (Resource Pressure)¶

10. KEDA Not Scaling Pods¶

11. Istio Sidecar Injection Not Happening¶

12. Workload Identity: Permission Denied¶

13. Dev Cluster Unreachable (Zero-Trust Blocking)¶

14. MongoDB Connection Pool Exhaustion¶

15. ArgoCD Out of Sync After Helm Template Change¶

See Also¶