Skip to content

Rollbacks

Before You Read

Rollbacks should be fast. If you're in the middle of an incident, go straight to the steps. For production outages also open the Production Outage Runbook.

When to Roll Back

Roll back immediately if: - A deployment causes error rate to spike above baseline - A deployment causes latency P99 to degrade significantly - Health checks start failing after a deployment - The application is returning 5xx errors for a newly deployed service

Do not roll back if: - The issue existed before the deployment (verify with Grafana) - The deployment is not yet live (sync is in progress — wait or abort sync first)

Rollback Options

There are three approaches, in order of preference:

Option Speed Complexity When to Use
ArgoCD Rollback Fast Low Deployed version is in ArgoCD history
Git Revert Medium Low Can open and merge a PR quickly
Helm Rollback Fast Medium ArgoCD rollback isn't available

Option 1: ArgoCD Rollback

The cleanest option. ArgoCD keeps a history of recent syncs.

Via ArgoCD UI

  1. Open https://argocd.{env}.orofi.xyz
  2. Find the application (e.g., microservice-identity)
  3. Click History tab
  4. Find the last known-good deployment
  5. Click Rollback → confirm

Via ArgoCD CLI

# List recent sync history
argocd app history microservice-identity

# Rollback to a specific revision (get revision number from history)
argocd app rollback microservice-identity {revision-number}

# Example — rollback to revision 42
argocd app rollback microservice-identity 42

Auto-sync conflict

If auto-sync is enabled, ArgoCD will immediately re-sync after a rollback. To prevent this, disable auto-sync first:

argocd app set microservice-identity --sync-policy none
# ... perform rollback ...
# Re-enable after the situation is stable
argocd app set microservice-identity --sync-policy automated

Option 2: Git Revert

This is the safest long-term approach — it creates an audit trail and ensures the rolled-back state is committed to Git.

# In infrastructure-configuration repo
git log --oneline -- projects/orofi/{env}/{service}/helm/values.yaml

# Identify the last good commit (e.g., abc123)
git revert HEAD --no-commit  # or revert a specific commit: git revert abc123

# Edit the revert commit message to be descriptive
git commit -m "revert: rollback microservice-identity to v1.2.2 due to elevated error rate"

# Open a PR — get it merged quickly

After merge, ArgoCD auto-syncs with the reverted values.

Option 3: Helm Rollback

Use this if ArgoCD is unavailable or unhealthy.

# List Helm release history
helm history microservice-identity -n microservice-identity

# Rollback to the previous release
helm rollback microservice-identity -n microservice-identity

# Or rollback to a specific revision number
helm rollback microservice-identity {revision} -n microservice-identity

ArgoCD drift

After a manual Helm rollback, ArgoCD will detect drift between Git and the cluster and may re-apply the bad version on next sync. Disable auto-sync as noted above and then do a Git revert.

ArgoCD Rollout Rollback (Canary Services)

For services using ArgoCD Rollouts (canary deployment):

# Abort the current rollout (reverts to stable version immediately)
kubectl argo rollouts abort microservice-identity -n microservice-identity

# Check rollout status
kubectl argo rollouts get rollout microservice-identity -n microservice-identity

Verifying the Rollback

After any rollback:

# Verify the image tag is the old version
kubectl get deployment microservice-identity -n microservice-identity \
  -o jsonpath='{.spec.template.spec.containers[0].image}'

# Watch pods stabilize
kubectl rollout status deployment/microservice-identity -n microservice-identity

# Check error rate in Grafana
# https://grafana.{env}.orofi.xyz

Confirm in Grafana that error rates have returned to baseline before declaring success.

Post-Rollback Actions

  1. Write up what happened in [NEEDS TEAM INPUT: incident tracking system]
  2. Identify root cause before re-deploying the problematic version
  3. If the bad version was in staging or production, notify affected stakeholders
  4. Re-enable ArgoCD auto-sync once the situation is stable

See Also