Scaling Events Runbook¶
Severity: Medium
Symptoms¶
- Pods stuck in
Pendingdue toInsufficient cpuorInsufficient memory - Cluster autoscaler not adding nodes fast enough during a traffic spike
- Node count at maximum (15) and still under pressure
- After a scale-up pipeline, services take too long to recover
Impact¶
- New pod replicas cannot schedule, limiting horizontal scaling
- Traffic spikes may cause service degradation if pods can't scale fast enough
- KEDA-triggered scale events may not complete
Prerequisites¶
- kubectl access to the cluster
- gcloud CLI authenticated to the project
- Access to run the Bitbucket manual pipeline (
scale-up-{env},scale-down-{env})
Steps¶
Checking Current Capacity¶
1. Check node utilization
# List nodes with capacity info
kubectl get nodes -o custom-columns=\
"NAME:.metadata.name,STATUS:.status.conditions[-1].type,CPU:.status.capacity.cpu,MEMORY:.status.capacity.memory"
# Check allocated vs available resources per node
kubectl describe nodes | grep -A5 "Allocated resources"
# Top nodes (requires metrics-server)
kubectl top nodes
2. Check pending pods
# See what's pending
kubectl get pods -A --field-selector=status.phase=Pending
# Understand why a pod is pending
kubectl describe pod -n {namespace} {pod-name} | grep -A10 "Events:"
# Look for: "0/N nodes are available: N Insufficient cpu"
3. Check cluster autoscaler activity
Manual Scale-Up (Bitbucket Pipeline)¶
Use the predefined Bitbucket pipelines for controlled cluster scaling. These use the k8s-scaler-cross service account.
Via Bitbucket UI:
1. Go to the infra repository on Bitbucket
2. Pipelines → Run Pipeline
3. Select branch: main
4. Select pipeline: scale-up-dev or scale-up-staging
5. Click Run
Effect: The pipeline scales the GKE node pool to the configured target count.
[NEEDS TEAM INPUT: document what node count scale-up targets. Is it a fixed number or configurable via pipeline variable?]
Manual Scale-Down (Cost Saving)¶
Scale to zero nodes when the environment is not needed (e.g., nights/weekends):
Via Bitbucket UI:
1. Select pipeline: scale-down-dev or scale-down-staging
2. Click Run
Scale-down impact
All pods are evicted when nodes scale to zero. ArgoCD will redeploy everything when the cluster scales back up. Expect 5–10 minutes for full recovery after scale-up.
Never scale production to zero
The scale-down pipelines are for dev and staging only. Production must always have running nodes.
Emergency Manual Scaling via gcloud¶
If the Bitbucket pipeline is unavailable:
# Scale node pool up
gcloud container clusters resize orofi-{env}-cloud-{env}-k8s-cluster \
--node-pool default-pool \
--num-nodes 5 \
--zone us-central1-a \
--project orofi-{env}-cloud
# Scale node pool down (use with caution)
gcloud container clusters resize orofi-{env}-cloud-{env}-k8s-cluster \
--node-pool default-pool \
--num-nodes 0 \
--zone us-central1-a \
--project orofi-{env}-cloud
Increasing Autoscaler Max Nodes¶
If 15 nodes is not enough (during sustained high load), increase the maximum via Terraform:
- Update
max_nodesininfrastructure-management/projects/orofi-{env}/k8s.tf - Apply:
terraform plan && terraform apply - The autoscaler will then be able to provision beyond the old limit
KEDA Autoscaling Troubleshooting¶
If KEDA is not scaling pods despite high metric values:
# Check KEDA ScaledObject status
kubectl get scaledobject -A
kubectl describe scaledobject {name} -n {namespace}
# Check the HPA generated by KEDA
kubectl get hpa -n {namespace}
kubectl describe hpa keda-hpa-{scaledobject-name} -n {namespace}
# Check KEDA operator logs
kubectl logs -n keda -l app=keda-operator --tail=100
MongoDB KEDA scaling (mongo-db namespace): - Scales replicas from 1 to 5 based on: - Connection count > 50 - CPU > 70% - Global lock queue > 5
Verification¶
# Pods are scheduling successfully
kubectl get pods -A | grep Pending
# Should be empty or reducing
# Nodes are healthy
kubectl get nodes
# All should show STATUS: Ready
# Cluster autoscaler is working
kubectl logs -n kube-system -l app=cluster-autoscaler --tail=20 \
| grep "Successfully added\|scaled up"
Post-Incident¶
- If autoscaler was slow: review
scan-interval,scale-up-delaysettings in autoscaler config - If hitting resource limits frequently: review resource requests/limits in Helm charts — over-requesting is the most common cause
- Document why the scale event happened (traffic spike, deployment, load test, etc.)
Escalation¶
- GCP GKE quota limits: raise a quota increase request via GCP console
- [NEEDS TEAM INPUT: platform team contact for sustained capacity issues]