Certificate Rotation Runbook¶
Severity: High
Symptoms¶
- Browser shows
NET::ERR_CERT_DATE_INVALIDor certificate warning kubectl get certificate -n istio-system istio-tls-certshowsReady: False- cert-manager logs show
ACME challenge failedorrate limit exceeded - Monitoring alert for certificate expiry within N days
Impact¶
- All HTTPS traffic to
*.{env}.orofi.xyzfails if certificate expires - External users cannot access any service
- OAuth2 Proxy authentication fails (requires HTTPS redirect)
Prerequisites¶
- kubectl access to the cluster
gcloudCLI authenticated (cert-manager uses DNS-01 challenge via Cloud DNS)- cert-manager is running in
cert-managernamespace
Steps¶
Phase 1: Diagnose¶
1. Check certificate status
# Check the primary TLS certificate
kubectl get certificate -n istio-system
# Get detailed status — look for condition messages
kubectl describe certificate istio-tls-cert -n istio-system
# Check certificate expiry date
kubectl get secret istio-tls-cert -n istio-system \
-o jsonpath='{.data.tls\.crt}' | base64 -d | \
openssl x509 -noout -dates
2. Check cert-manager controller logs
kubectl logs -n cert-manager \
-l app=cert-manager \
--tail=100 | grep -i "error\|istio-tls\|challenge\|order"
3. Check CertificateRequest and Order resources
kubectl get certificaterequest -n istio-system
kubectl get order -n istio-system
kubectl get challenge -n istio-system
# Get details on a failing challenge
kubectl describe challenge -n istio-system {challenge-name}
Phase 2: Fix Automatic Renewal (Preferred)¶
Case A: DNS-01 challenge failing — service account permissions
The cert-manager service account in orofi-cloud needs dns.admin or roles/dns.admin on the Cloud DNS zone.
# Check the cert-manager service account annotation
kubectl get serviceaccount -n cert-manager cert-manager -o yaml | grep annotation
# Verify GCP service account has DNS permissions
gcloud projects get-iam-policy orofi-cloud \
--flatten="bindings[].members" \
--filter="bindings.role:roles/dns.admin"
If missing, add via Terraform in infrastructure-management/modules/helm/ (the cert-manager Helm deployment configuration).
Case B: Let's Encrypt rate limit hit
Let's Encrypt rate limits: 50 certificates per registered domain per week.
# Switch to Let's Encrypt staging issuer temporarily
kubectl patch certificate istio-tls-cert -n istio-system \
--type='json' \
-p='[{"op":"replace","path":"/spec/issuerRef/name","value":"letsencrypt-staging"}]'
# Wait for staging cert to issue (it won't be browser-trusted but will work functionally)
kubectl get certificate -n istio-system -w
When rate limit resets (check at https://crt.sh/?q={domain} to see recent issuances), switch back:
kubectl patch certificate istio-tls-cert -n istio-system \
--type='json' \
-p='[{"op":"replace","path":"/spec/issuerRef/name","value":"letsencrypt-prod"}]'
Case C: Force certificate renewal
# Delete the current certificate secret (cert-manager will re-issue)
kubectl delete secret istio-tls-cert -n istio-system
# cert-manager will detect the missing secret and trigger a new certificate request
# Watch for the new cert to be issued
kubectl get certificate -n istio-system -w
Alternatively, annotate the Certificate resource to trigger renewal:
kubectl annotate certificate istio-tls-cert \
-n istio-system \
cert-manager.io/issuer-kind=ClusterIssuer \
--overwrite
Phase 3: Manual Certificate (Emergency)¶
If cert-manager automation is completely broken and the certificate is already expired, issue a certificate manually.
Temporary measure only
A manually issued certificate bypasses cert-manager and must be replaced with an automated certificate within the validity period. Set a calendar reminder.
Option A: Use certbot locally
# Install certbot with Google DNS plugin
pip install certbot certbot-dns-google
# Issue certificate (DNS-01 challenge via Cloud DNS)
certbot certonly \
--dns-google \
--dns-google-credentials /path/to/service-account.json \
-d "*.stage.orofi.xyz" \
--server https://acme-v02.api.letsencrypt.org/directory
# The certificate will be in /etc/letsencrypt/live/*.stage.orofi.xyz/
Option B: Import into Kubernetes manually
# Replace the TLS secret with the manually issued cert
kubectl create secret tls istio-tls-cert \
-n istio-system \
--cert=/etc/letsencrypt/live/*.stage.orofi.xyz/fullchain.pem \
--key=/etc/letsencrypt/live/*.stage.orofi.xyz/privkey.pem \
--dry-run=client -o yaml | kubectl apply -f -
# Restart Istio IngressGateway to pick up the new cert
kubectl rollout restart deployment/istio-ingressgateway -n istio-system
Verification¶
# Certificate is Ready
kubectl get certificate -n istio-system istio-tls-cert
# READY should be True
# Check expiry date
kubectl get secret istio-tls-cert -n istio-system \
-o jsonpath='{.data.tls\.crt}' | base64 -d | \
openssl x509 -noout -dates
# notAfter should be at least 60 days from now
# Test from outside
curl -v https://argocd.{env}.orofi.xyz 2>&1 | grep "SSL certificate verify"
# Should NOT show a certificate error
Post-Incident¶
- Investigate why automatic renewal failed (cert-manager logs, DNS permissions)
- Set up a certificate expiry alert in Grafana/Prometheus for certificates expiring within 30 days
- Test certificate renewal process in dev before it matters in production
Escalation¶
- cert-manager GitHub issues for known bugs: https://github.com/cert-manager/cert-manager/issues
- GCP Cloud DNS support if DNS-01 challenge is failing due to API errors