Scaling Events Runbook¶

Severity: Medium

Symptoms¶

Pods stuck in Pending due to Insufficient cpu or Insufficient memory
Cluster autoscaler not adding nodes fast enough during a traffic spike
Node count at maximum (15) and still under pressure
After a scale-up pipeline, services take too long to recover

Impact¶

New pod replicas cannot schedule, limiting horizontal scaling
Traffic spikes may cause service degradation if pods can't scale fast enough
KEDA-triggered scale events may not complete

Prerequisites¶

kubectl access to the cluster
gcloud CLI authenticated to the project
Access to run the Bitbucket manual pipeline (scale-up-{env}, scale-down-{env})

Steps¶

Checking Current Capacity¶

1. Check node utilization

# List nodes with capacity info
kubectl get nodes -o custom-columns=\
"NAME:.metadata.name,STATUS:.status.conditions[-1].type,CPU:.status.capacity.cpu,MEMORY:.status.capacity.memory"

# Check allocated vs available resources per node
kubectl describe nodes | grep -A5 "Allocated resources"

# Top nodes (requires metrics-server)
kubectl top nodes

2. Check pending pods

# See what's pending
kubectl get pods -A --field-selector=status.phase=Pending

# Understand why a pod is pending
kubectl describe pod -n {namespace} {pod-name} | grep -A10 "Events:"
# Look for: "0/N nodes are available: N Insufficient cpu"

3. Check cluster autoscaler activity

kubectl logs -n kube-system \
  -l app=cluster-autoscaler \
  --tail=100 | grep -i "scale\|node\|error"

Manual Scale-Up (Bitbucket Pipeline)¶

Use the predefined Bitbucket pipelines for controlled cluster scaling. These use the k8s-scaler-cross service account.

Via Bitbucket UI: 1. Go to the infra repository on Bitbucket 2. Pipelines → Run Pipeline 3. Select branch: main 4. Select pipeline: scale-up-dev or scale-up-staging 5. Click Run

Effect: The pipeline scales the GKE node pool to the configured target count.

[NEEDS TEAM INPUT: document what node count scale-up targets. Is it a fixed number or configurable via pipeline variable?]

Manual Scale-Down (Cost Saving)¶

Scale to zero nodes when the environment is not needed (e.g., nights/weekends):

Via Bitbucket UI: 1. Select pipeline: scale-down-dev or scale-down-staging 2. Click Run

Scale-down impact

All pods are evicted when nodes scale to zero. ArgoCD will redeploy everything when the cluster scales back up. Expect 5–10 minutes for full recovery after scale-up.

Never scale production to zero

The scale-down pipelines are for dev and staging only. Production must always have running nodes.

Emergency Manual Scaling via gcloud¶

If the Bitbucket pipeline is unavailable:

# Scale node pool up
gcloud container clusters resize orofi-{env}-cloud-{env}-k8s-cluster \
  --node-pool default-pool \
  --num-nodes 5 \
  --zone us-central1-a \
  --project orofi-{env}-cloud

# Scale node pool down (use with caution)
gcloud container clusters resize orofi-{env}-cloud-{env}-k8s-cluster \
  --node-pool default-pool \
  --num-nodes 0 \
  --zone us-central1-a \
  --project orofi-{env}-cloud

Increasing Autoscaler Max Nodes¶

If 15 nodes is not enough (during sustained high load), increase the maximum via Terraform:

Update max_nodes in infrastructure-management/projects/orofi-{env}/k8s.tf
Apply: terraform plan && terraform apply
The autoscaler will then be able to provision beyond the old limit

KEDA Autoscaling Troubleshooting¶

If KEDA is not scaling pods despite high metric values:

# Check KEDA ScaledObject status
kubectl get scaledobject -A
kubectl describe scaledobject {name} -n {namespace}

# Check the HPA generated by KEDA
kubectl get hpa -n {namespace}
kubectl describe hpa keda-hpa-{scaledobject-name} -n {namespace}

# Check KEDA operator logs
kubectl logs -n keda -l app=keda-operator --tail=100

MongoDB KEDA scaling (mongo-db namespace): - Scales replicas from 1 to 5 based on: - Connection count > 50 - CPU > 70% - Global lock queue > 5

Verification¶

# Pods are scheduling successfully
kubectl get pods -A | grep Pending
# Should be empty or reducing

# Nodes are healthy
kubectl get nodes
# All should show STATUS: Ready

# Cluster autoscaler is working
kubectl logs -n kube-system -l app=cluster-autoscaler --tail=20 \
  | grep "Successfully added\|scaled up"

Post-Incident¶

If autoscaler was slow: review scan-interval, scale-up-delay settings in autoscaler config
If hitting resource limits frequently: review resource requests/limits in Helm charts — over-requesting is the most common cause
Document why the scale event happened (traffic spike, deployment, load test, etc.)

Escalation¶

GCP GKE quota limits: raise a quota increase request via GCP console
[NEEDS TEAM INPUT: platform team contact for sustained capacity issues]