Skip to content

Certificate Rotation Runbook

Severity: High


Symptoms

  • Browser shows NET::ERR_CERT_DATE_INVALID or certificate warning
  • kubectl get certificate -n istio-system istio-tls-cert shows Ready: False
  • cert-manager logs show ACME challenge failed or rate limit exceeded
  • Monitoring alert for certificate expiry within N days

Impact

  • All HTTPS traffic to *.{env}.orofi.xyz fails if certificate expires
  • External users cannot access any service
  • OAuth2 Proxy authentication fails (requires HTTPS redirect)

Prerequisites

  • kubectl access to the cluster
  • gcloud CLI authenticated (cert-manager uses DNS-01 challenge via Cloud DNS)
  • cert-manager is running in cert-manager namespace

Steps

Phase 1: Diagnose

1. Check certificate status

# Check the primary TLS certificate
kubectl get certificate -n istio-system

# Get detailed status — look for condition messages
kubectl describe certificate istio-tls-cert -n istio-system

# Check certificate expiry date
kubectl get secret istio-tls-cert -n istio-system \
  -o jsonpath='{.data.tls\.crt}' | base64 -d | \
  openssl x509 -noout -dates

2. Check cert-manager controller logs

kubectl logs -n cert-manager \
  -l app=cert-manager \
  --tail=100 | grep -i "error\|istio-tls\|challenge\|order"

3. Check CertificateRequest and Order resources

kubectl get certificaterequest -n istio-system
kubectl get order -n istio-system
kubectl get challenge -n istio-system

# Get details on a failing challenge
kubectl describe challenge -n istio-system {challenge-name}

Phase 2: Fix Automatic Renewal (Preferred)

Case A: DNS-01 challenge failing — service account permissions

The cert-manager service account in orofi-cloud needs dns.admin or roles/dns.admin on the Cloud DNS zone.

# Check the cert-manager service account annotation
kubectl get serviceaccount -n cert-manager cert-manager -o yaml | grep annotation

# Verify GCP service account has DNS permissions
gcloud projects get-iam-policy orofi-cloud \
  --flatten="bindings[].members" \
  --filter="bindings.role:roles/dns.admin"

If missing, add via Terraform in infrastructure-management/modules/helm/ (the cert-manager Helm deployment configuration).

Case B: Let's Encrypt rate limit hit

Let's Encrypt rate limits: 50 certificates per registered domain per week.

# Switch to Let's Encrypt staging issuer temporarily
kubectl patch certificate istio-tls-cert -n istio-system \
  --type='json' \
  -p='[{"op":"replace","path":"/spec/issuerRef/name","value":"letsencrypt-staging"}]'

# Wait for staging cert to issue (it won't be browser-trusted but will work functionally)
kubectl get certificate -n istio-system -w

When rate limit resets (check at https://crt.sh/?q={domain} to see recent issuances), switch back:

kubectl patch certificate istio-tls-cert -n istio-system \
  --type='json' \
  -p='[{"op":"replace","path":"/spec/issuerRef/name","value":"letsencrypt-prod"}]'

Case C: Force certificate renewal

# Delete the current certificate secret (cert-manager will re-issue)
kubectl delete secret istio-tls-cert -n istio-system

# cert-manager will detect the missing secret and trigger a new certificate request
# Watch for the new cert to be issued
kubectl get certificate -n istio-system -w

Alternatively, annotate the Certificate resource to trigger renewal:

kubectl annotate certificate istio-tls-cert \
  -n istio-system \
  cert-manager.io/issuer-kind=ClusterIssuer \
  --overwrite

Phase 3: Manual Certificate (Emergency)

If cert-manager automation is completely broken and the certificate is already expired, issue a certificate manually.

Temporary measure only

A manually issued certificate bypasses cert-manager and must be replaced with an automated certificate within the validity period. Set a calendar reminder.

Option A: Use certbot locally

# Install certbot with Google DNS plugin
pip install certbot certbot-dns-google

# Issue certificate (DNS-01 challenge via Cloud DNS)
certbot certonly \
  --dns-google \
  --dns-google-credentials /path/to/service-account.json \
  -d "*.stage.orofi.xyz" \
  --server https://acme-v02.api.letsencrypt.org/directory

# The certificate will be in /etc/letsencrypt/live/*.stage.orofi.xyz/

Option B: Import into Kubernetes manually

# Replace the TLS secret with the manually issued cert
kubectl create secret tls istio-tls-cert \
  -n istio-system \
  --cert=/etc/letsencrypt/live/*.stage.orofi.xyz/fullchain.pem \
  --key=/etc/letsencrypt/live/*.stage.orofi.xyz/privkey.pem \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart Istio IngressGateway to pick up the new cert
kubectl rollout restart deployment/istio-ingressgateway -n istio-system

Verification

# Certificate is Ready
kubectl get certificate -n istio-system istio-tls-cert
# READY should be True

# Check expiry date
kubectl get secret istio-tls-cert -n istio-system \
  -o jsonpath='{.data.tls\.crt}' | base64 -d | \
  openssl x509 -noout -dates
# notAfter should be at least 60 days from now

# Test from outside
curl -v https://argocd.{env}.orofi.xyz 2>&1 | grep "SSL certificate verify"
# Should NOT show a certificate error

Post-Incident

  1. Investigate why automatic renewal failed (cert-manager logs, DNS permissions)
  2. Set up a certificate expiry alert in Grafana/Prometheus for certificates expiring within 30 days
  3. Test certificate renewal process in dev before it matters in production

Escalation

  • cert-manager GitHub issues for known bugs: https://github.com/cert-manager/cert-manager/issues
  • GCP Cloud DNS support if DNS-01 challenge is failing due to API errors

See Also