Scenario 7: Bad Configuration Change

Symptoms

ArgoCD deployed broken configuration causing service outages
OpenTofu/Terraform apply destroyed or misconfigured infrastructure
Git commit introduced breaking changes to Kubernetes manifests
Cluster resources deleted or modified by automation
Applications failing to start after configuration update
Services unreachable after infrastructure change
Resource quotas exceeded due to misconfiguration

Impact Assessment

Recovery Time Objective (RTO): 30 minutes - 2 hours
Recovery Point Objective (RPO): Minimal (revert to previous git commit)
Data Loss Risk: Low (application data usually preserved, only configuration affected)
Service Availability: Partial or complete outage until reverted
Blast Radius: Can range from single application to entire cluster

Prerequisites

Git repository access with push permissions
kubectl access to the cluster
ArgoCD CLI or web UI access
OpenTofu/Terraform CLI installed
Knowledge of what changed (git history)
Backup access if data was affected

Recovery Procedure

Step 1: Identify the Bad Change

Determine what changed and when:

# Check recent git commits
cd /home/benjaminsanden/Dokument/Projects/homelab
git log --oneline --decorate --graph -20

# See what files were changed
git show HEAD
git diff HEAD~1

# Check ArgoCD application status
argocd app list
argocd app get <app-name>

# Check which apps are unhealthy
argocd app list --output json | jq '.[] | select(.status.health.status != "Healthy") | {name: .metadata.name, health: .status.health.status}'

# Check recent Kubernetes events
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -30

# Check which resources were recently modified
kubectl get all -A -o json | jq -r '.items[] | select(.metadata.creationTimestamp > "'$(date -u -d '1 hour ago' --rfc-3339=seconds)'") | "\(.kind)/\(.metadata.name) in \(.metadata.namespace)"'

Step 2: Quick Assessment of Impact

Determine the severity and scope:

# Check cluster overall health
kubectl get nodes
kubectl get pods -A | grep -v Running
kubectl top nodes

# Check critical services
kubectl -n kube-system get pods
kubectl -n argocd get pods
kubectl -n monitoring get pods

# Check if databases are affected
kubectl get clusters.postgresql.cnpg.io -A
kubectl -n database get pods

# Check persistent volumes
kubectl get pv,pvc -A | grep -v Bound

Step 3: Immediate Mitigation

Choose the fastest recovery path:

Option A: Git Revert (Recommended for most cases)

If the bad change was deployed via GitOps:

cd /home/benjaminsanden/Dokument/Projects/homelab

# View the problematic commit
git log --oneline -5
git show <commit-hash>

# Option 1: Revert the last commit (creates new revert commit)
git revert HEAD
git push origin main

# Option 2: Revert specific commit (if not the latest)
git revert <commit-hash>
git push origin main

# Option 3: Hard reset to previous commit (DESTRUCTIVE - use with caution)
# Only if you're sure no one else pushed commits
git log --oneline -5
git reset --hard <good-commit-hash>
git push --force origin main  # CAUTION: This rewrites history

# After git revert/reset, sync ArgoCD
argocd app sync --all
# Or sync specific app
argocd app sync <app-name>

Option B: Manual Kubernetes Rollback

If the change was applied directly to Kubernetes:

# Rollback a deployment to previous revision
kubectl -n <namespace> rollout undo deployment/<deployment-name>

# Rollback to specific revision
kubectl -n <namespace> rollout history deployment/<deployment-name>
kubectl -n <namespace> rollout undo deployment/<deployment-name> --to-revision=<number>

# Check rollback status
kubectl -n <namespace> rollout status deployment/<deployment-name>

# Rollback statefulset
kubectl -n <namespace> rollout undo statefulset/<statefulset-name>

# View current and previous configurations
kubectl -n <namespace> get deployment <deployment-name> -o yaml > current.yaml
kubectl -n <namespace> rollout history deployment/<deployment-name> --revision=<prev-num> > previous.yaml
diff previous.yaml current.yaml

Option C: ArgoCD Manual Sync to Previous Version

If the app is out of sync:

# Sync to specific git revision
argocd app sync <app-name> --revision <good-commit-hash>

# Or via ArgoCD UI:
# 1. Navigate to application
# 2. Click "History and Rollback"
# 3. Select previous successful deployment
# 4. Click "Rollback"

# Verify sync status
argocd app wait <app-name> --health

Step 4: OpenTofu/Terraform Recovery

If infrastructure was misconfigured or destroyed:

cd /home/benjaminsanden/Dokument/Projects/homelab/tofu

# Check what Terraform thinks changed
tofu plan

# View Terraform state to see what exists
tofu state list
tofu show

# If resources were deleted, restore from state
# Option 1: Revert git changes first
cd ..
git revert HEAD
cd tofu

# Option 2: Import existing resources back into state
# If resources still exist but were removed from state:
tofu import <resource-type>.<name> <resource-id>

# Example: Import Kubernetes namespace
tofu import kubernetes_namespace.auth auth

# If resources were destroyed, recreate them
tofu apply

# Verify no unexpected changes
tofu plan  # Should show "No changes"

If Terraform destroyed critical resources:

# Check Terraform state backup
ls -lah /home/benjaminsanden/Dokument/Projects/homelab/tofu/terraform.tfstate.backup
cp terraform.tfstate.backup terraform.tfstate

# Or restore from git history
git checkout HEAD~1 -- terraform.tfstate
tofu refresh
tofu plan

# Reapply infrastructure
tofu apply

Step 5: Verify Service Recovery

# Check all pods are running
kubectl get pods -A | grep -v "Running\|Completed"

# Check services are accessible
kubectl get svc -A

# Test critical applications
kubectl -n <namespace> port-forward svc/<service> 8080:80
curl http://localhost:8080/health

# Check ingress/loadbalancer status
kubectl get ingress -A
kubectl get svc -l type=LoadBalancer -A

# Verify databases are healthy
kubectl get clusters.postgresql.cnpg.io -A
kubectl -n database exec -it <postgres-pod> -- psql -U postgres -c "SELECT 1;"

# Check persistent volumes are bound
kubectl get pvc -A | grep -v Bound

Step 6: ArgoCD Re-sync All Applications

Ensure all applications are in sync with git:

# Get list of out-of-sync applications
argocd app list --output json | jq -r '.[] | select(.status.sync.status != "Synced") | .metadata.name'

# Sync all applications
argocd app sync --all

# Watch sync progress
watch -n 2 'argocd app list'

# Check for any errors
argocd app list --output json | jq -r '.[] | select(.status.sync.status == "OutOfSync" or .status.health.status != "Healthy") | {name: .metadata.name, sync: .status.sync.status, health: .status.health.status}'

# View application details if issues persist
argocd app get <app-name> --show-operation

# Force sync if needed (will delete resources not in git)
argocd app sync <app-name> --force

Step 7: Data Integrity Check

Verify no data was lost:

# Check PVC status
kubectl get pvc -A

# For PostgreSQL databases, verify data
kubectl -n database exec -it <postgres-pod> -- bash
psql -U postgres

# Inside PostgreSQL:
\l  # List databases
\c <database>  # Connect to database
SELECT COUNT(*) FROM <critical-table>;
SELECT MAX(created_at) FROM <critical-table>;  # Check latest record

# Check application data via API
curl -H "Authorization: Bearer $TOKEN" https://app.example.com/api/health

If data was affected, restore from backup:

# List recent backups
velero backup get | head -10

# Restore only affected namespace's PVCs
velero restore create restore-config-fix-$(date +%Y%m%d-%H%M%S) \
  --from-backup <backup-name> \
  --include-namespaces <namespace> \
  --include-resources persistentvolumeclaims,persistentvolumes

# For database, use CNPG point-in-time recovery
# See: 01-accidental-deletion.md for detailed CNPG recovery

Common Scenarios and Solutions

Scenario 1: ArgoCD Deployed Broken Manifest

# Symptoms: App shows "OutOfSync" or "Degraded"
argocd app get <app-name>

# Solution:
cd /home/benjaminsanden/Dokument/Projects/homelab
git revert HEAD
git push origin main
argocd app sync <app-name>

# Alternative: Edit in place, then update git
kubectl -n <namespace> edit deployment <name>
# Fix the issue
# Then update git to match

Scenario 2: Resource Quotas Exceeded

# Symptoms: Pods show "FailedCreate" with quota errors
kubectl describe pod <pod-name>

# Check quotas
kubectl get resourcequota -A
kubectl describe resourcequota -n <namespace>

# Solution: Revert resource changes or increase quota
git revert HEAD  # If resource requests were increased
# Or increase quota
kubectl -n <namespace> edit resourcequota <quota-name>

Scenario 3: NetworkPolicy Broke Connectivity

# Symptoms: Pods can't communicate, "no route to host"
kubectl -n <namespace> get networkpolicy

# Temporary fix: Delete blocking policy
kubectl -n <namespace> delete networkpolicy <policy-name>

# Permanent fix: Revert git change
git revert HEAD
git push origin main
argocd app sync <app-name>

Scenario 4: ConfigMap/Secret Update Broke App

# Symptoms: App crashes after ConfigMap/Secret change
kubectl -n <namespace> get configmap,secret

# View current vs previous
kubectl -n <namespace> get configmap <name> -o yaml

# Rollback by editing in place
kubectl -n <namespace> edit configmap <name>
# Or restore from git
git show HEAD~1:k8s/apps/<app>/configmap.yaml | kubectl apply -f -

# Restart pods to pick up change
kubectl -n <namespace> rollout restart deployment/<name>

Scenario 5: Helm Chart Upgrade Failed

# Symptoms: Helm release in failed state
helm list -A | grep -i failed

# Check release history
helm history <release-name> -n <namespace>

# Rollback to previous release
helm rollback <release-name> <revision> -n <namespace>

# Or uninstall and reinstall from git
helm uninstall <release-name> -n <namespace>
argocd app sync <app-name>

Post-Recovery Tasks

1. Root Cause Analysis

# Document what went wrong
cat > /home/benjaminsanden/Dokument/Projects/homelab/docs/incidents/bad-config-$(date +%Y%m%d).md <<EOF
# Configuration Change Incident

**Date**: $(date)
**Affected Services**: <list>
**Downtime**: <duration>
**Git Commit**: $(git rev-parse HEAD)

## What Happened
<description of the bad change>

## Impact
- Services affected: <list>
- Users impacted: <number/description>
- Data lost: <none/description>

## Recovery Steps
1. Reverted git commit <hash>
2. Synced ArgoCD applications
3. Verified service restoration

## Root Cause
<why the bad change was introduced>

## Prevention
- [ ] Add pre-commit validation
- [ ] Implement staging environment testing
- [ ] Review change approval process
- [ ] Add monitoring alerts for this failure mode
EOF

2. Implement Change Controls

# Add pre-commit hooks for validation
cd /home/benjaminsanden/Dokument/Projects/homelab
cat > .pre-commit-config.yaml <<EOF
repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.5.0
    hooks:
      - id: check-yaml
      - id: end-of-file-fixer
      - id: trailing-whitespace

  - repo: https://github.com/norwoodj/helm-docs
    rev: v1.11.0
    hooks:
      - id: helm-docs

  - repo: local
    hooks:
      - id: kubectl-validate
        name: Validate Kubernetes manifests
        entry: bash -c 'kubectl apply --dry-run=client -f'
        language: system
        files: \\.yaml$
        pass_filenames: true
EOF

pre-commit install

3. Add ArgoCD Sync Waves

Prevent cascading failures by controlling deployment order:

# In your Kubernetes manifests, add annotations:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: database
  annotations:
    argocd.argoproj.io/sync-wave: "1"  # Deploy first
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
  annotations:
    argocd.argoproj.io/sync-wave: "2"  # Deploy after database

4. Enable ArgoCD Auto-Rollback

# In ArgoCD Application spec:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
spec:
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
      # Rollback on failed deployment
    retry:
      limit: 2
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

5. Set up Staging Environment

# Create staging namespace/cluster
kubectl create namespace staging

# Deploy to staging first, then production
# Update ArgoCD ApplicationSet to deploy to staging first

Troubleshooting

Git Revert Doesn't Fix Issue

# Check if there are multiple bad commits
git log --oneline -10

# Revert multiple commits
git revert HEAD~3..HEAD
git push origin main

# Or reset to known-good commit
git log --oneline -20
# Find last known-good commit
git reset --hard <good-commit-hash>
git push --force origin main

ArgoCD Won't Sync

# Check ArgoCD application status
argocd app get <app-name>

# View sync errors
argocd app logs <app-name> --follow

# Check ArgoCD repo connection
argocd repo list
argocd repo get https://github.com/theepicsaxguy/homelab.git

# Force refresh repository
argocd app sync <app-name> --force --prune

# Check ArgoCD controller logs
kubectl -n argocd logs -l app.kubernetes.io/name=argocd-application-controller --tail=100

Terraform State is Corrupted

# Restore from backup
cd /home/benjaminsanden/Dokument/Projects/homelab/tofu
cp terraform.tfstate.backup terraform.tfstate

# Or restore from git
git log --all --full-history -- terraform.tfstate
git show <commit-hash>:tofu/terraform.tfstate > terraform.tfstate

# Refresh state from actual infrastructure
tofu refresh

# If completely broken, recreate state
# Import all resources one by one
tofu import <resource>.<name> <id>

Changes Reverted but Pods Still Failing

# Old pods may be running with old config
# Force restart all deployments
kubectl -n <namespace> rollout restart deployment --all

# Or delete pods to force recreation
kubectl -n <namespace> delete pods --all

# Check if ConfigMaps/Secrets need updating
kubectl -n <namespace> get configmap,secret -o yaml

# Check image pull errors
kubectl -n <namespace> describe pods | grep -A 5 "Failed"

Prevention Strategies

Immediate Actions

Enable pre-commit hooks: Validate YAML before commit
Require PR reviews: No direct commits to main branch
Add ArgoCD sync windows: Prevent syncs during critical times
Enable ArgoCD notifications: Alert on failed syncs
Document change procedures: Checklist for config changes

Long-term Improvements

# 1. Set up GitHub branch protection
# In GitHub repo settings:
# - Require pull request reviews
# - Require status checks (CI tests)
# - Require signed commits

# 2. Add CI validation pipeline
# .github/workflows/validate.yaml
cat > .github/workflows/validate.yaml <<EOF
name: Validate
on: [pull_request]
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Validate Kubernetes manifests
        run: |
          kubectl apply --dry-run=client -f k8s/ -R
      - name: Validate Terraform
        run: |
          cd tofu
          terraform init -backend=false
          terraform validate
EOF

# 3. Implement staging environment
# Deploy changes to staging first, verify, then promote to production

# 4. Add monitoring alerts
# Alert when ArgoCD apps become unhealthy
# Alert when pods crash loop

# 5. Regular backup testing
# Monthly: Test restoring from backup
# Verify rollback procedures work

Scenario 1: Accidental Deletion - If config change deleted resources
Scenario 2: Disk Failure - If bad change broke storage
Scenario 8: Data Corruption - If config change corrupted data

Symptoms​

Impact Assessment​

Prerequisites​

Recovery Procedure​

Step 1: Identify the Bad Change​

Step 2: Quick Assessment of Impact​

Step 3: Immediate Mitigation​

Option A: Git Revert (Recommended for most cases)​

Option B: Manual Kubernetes Rollback​

Option C: ArgoCD Manual Sync to Previous Version​

Step 4: OpenTofu/Terraform Recovery​

Step 5: Verify Service Recovery​

Step 6: ArgoCD Re-sync All Applications​

Step 7: Data Integrity Check​

Common Scenarios and Solutions​

Scenario 1: ArgoCD Deployed Broken Manifest​

Scenario 2: Resource Quotas Exceeded​

Scenario 3: NetworkPolicy Broke Connectivity​

Scenario 4: ConfigMap/Secret Update Broke App​

Scenario 5: Helm Chart Upgrade Failed​

Post-Recovery Tasks​

1. Root Cause Analysis​

2. Implement Change Controls​

3. Add ArgoCD Sync Waves​

4. Enable ArgoCD Auto-Rollback​

5. Set up Staging Environment​

Troubleshooting​

Git Revert Doesn't Fix Issue​

ArgoCD Won't Sync​

Terraform State is Corrupted​

Changes Reverted but Pods Still Failing​

Prevention Strategies​

Immediate Actions​

Long-term Improvements​

Related Scenarios​

Reference​