Velero Test Restore Procedure
Overview
Regular testing of backup restores is critical for disaster recovery readiness. This procedure provides a non-disruptive method to verify backups are restorable and complete.
Why Test Restores?
- Validate backup integrity - Ensure backups actually contain recoverable data
- Verify storage class migrations - Test restoring to different storage infrastructure
- Practice DR procedures - Keep team familiar with restore operations
- Catch issues early - Discover backup problems before a real disaster
- Measure RTO/RPO - Understand actual recovery time and data loss windows
Testing Schedule
Recommended frequency:
- Monthly: Test restore of critical applications (Home Assistant, databases, auth)
- Quarterly: Full cluster restore test to staging environment
- After major changes: Test after storage migrations, Velero upgrades, or infrastructure changes
Test Restore Procedure
Prerequisites
veleroCLI installedkubectlaccess to cluster with admin privileges- Recent backup to test (verify with
velero backup get) - Storage class mapping ConfigMap (if testing storage class migration)
Step 1: Select Backup to Test
# List available backups
velero backup get
# Choose a recent completed backup
velero backup describe <backup-name>
# Verify backup includes Pod Volume Backups
velero backup describe <backup-name> --details | grep -A 5 "Pod Volume"
Look for:
- ✅ Phase:
Completed - ✅ Errors:
0 - ✅ Pod Volume Backups:
Completed
Step 2: Prepare Test Namespace
Create an isolated test namespace to avoid impacting production:
# Create test namespace
kubectl create namespace <app>-test
# Label for easy identification
kubectl label namespace <app>-test test=disaster-recovery
# Add PodSecurity labels if application requires privileged mode
# (e.g., Home Assistant needs hostNetwork, NET_ADMIN capabilities)
kubectl label namespace <app>-test \
pod-security.kubernetes.io/enforce=privileged \
pod-security.kubernetes.io/audit=privileged \
pod-security.kubernetes.io/warn=privileged
When to add PodSecurity labels:
- Application uses
hostNetwork: true - Application requires special capabilities (NET_ADMIN, NET_RAW, etc.)
- Application uses hostPorts
- Check production namespace labels:
kubectl get namespace <app> --show-labels
Step 3: Configure Storage Class Mapping (Optional)
If testing storage class migration (e.g., longhorn → proxmox-csi):
# Create or verify storage class mapping ConfigMap
kubectl apply -f /path/to/homelab/k8s/infrastructure/controllers/velero/storage-class-mapping.yaml
# Verify ConfigMap exists
kubectl get configmap -n velero change-storage-class-config -o yaml
See Velero Storage Class Mapping for detailed configuration.
Step 4: Execute Test Restore
# Restore to test namespace with namespace mapping
velero restore create <app>-test-restore-$(date +%Y%m%d-%H%M%S) \
--from-backup <backup-name> \
--include-namespaces <source-namespace> \
--namespace-mappings <source-namespace>:<app>-test \
--wait
# Example for Home Assistant:
velero restore create ha-test-restore-$(date +%Y%m%d-%H%M%S) \
--from-backup home-assistant-manual-20251229-162851 \
--include-namespaces home-assistant \
--namespace-mappings home-assistant:home-assistant-test \
--wait
Command breakdown:
<app>-test-restore-$(date +%Y%m%d-%H%M%S)- Unique restore name with timestamp--from-backup <backup-name>- Source backup to restore from--include-namespaces <source-namespace>- Namespace to restore--namespace-mappings <source>:<target>- Map to test namespace (non-disruptive)--wait- Block until restore completes
Step 5: Monitor Restore Progress
# Check restore status
velero restore get
# Describe restore (watch for completion)
watch -n 5 'velero restore describe <restore-name> | head -30'
# Watch resources being created
kubectl get all,pvc -n <app>-test -w
# Check for errors
velero restore logs <restore-name>
Expected phases:
New→InProgress(resources being created)- PVC creation and binding
- Pod Volume Restore (Kopia data restore)
Completed(all resources restored)
Step 6: Verify Restore Success
6.1 Verify Restore Completion
# Check final status
velero restore describe <restore-name>
Expected output:
- ✅ Phase:
Completed - ✅ Errors:
0 - ✅ Warnings:
0(or only minor warnings about existing resources) - ✅ Items restored:
X/X(all items) - ✅ kopia Restores - Completed:
X(matching number of PVCs)
6.2 Verify Storage Class Transformation (If Testing Migration)
# Check PVC storage class
kubectl get pvc -n <app>-test
# Verify storage class in PVC spec
kubectl get pvc -n <app>-test -o yaml | grep storageClassName
# Expected: storageClassName: <target-storage-class> (e.g., proxmox-csi)
# Verify PV provisioner
kubectl get pvc -n <app>-test -o jsonpath='{.items[0].spec.volumeName}'
kubectl get pv <pv-name> -o yaml | grep -E "provisioner|storageClassName"
# Expected: provisioner: <target-csi-driver> (e.g., csi.proxmox.sinextra.dev)
Critical checks:
- PVC uses target storage class (NOT source storage class)
- PV provisioned by correct CSI driver
- PVC status is
Bound - No errors in PVC events
6.3 Verify Resource Creation
# List all restored resources
kubectl get all,pvc,configmap,secret,serviceaccount -n <app>-test
# Check pod status
kubectl get pods -n <app>-test
# Wait for pod to be ready
kubectl wait --for=condition=ready pod/<pod-name> -n <app>-test --timeout=300s
Expected resources (varies by application):
- StatefulSet or Deployment
- Pods (Running status)
- Services
- PVCs (Bound status)
- ConfigMaps
- Secrets (may need ExternalSecret sync)
- ServiceAccounts
6.4 Verify Data Integrity
# Check pod is running
kubectl get pods -n <app>-test
# Exec into pod to verify data
kubectl exec -n <app>-test <pod-name> -- ls -la <data-path>
# Check data size matches expected size
kubectl exec -n <app>-test <pod-name> -- du -sh <data-path>
# For applications with databases, verify critical files exist
kubectl exec -n <app>-test <pod-name> -- ls -la <data-path>/.storage
kubectl exec -n <app>-test <pod-name> -- ls -la <data-path>/*.db
Application-specific checks:
Home Assistant:
# Verify .storage directory (contains HA state)
kubectl exec -n home-assistant-test home-assistant-0 -- ls -la /config/.storage
# Check database exists
kubectl exec -n home-assistant-test home-assistant-0 -- ls -lh /config/home-assistant_v2.db
# Verify automations
kubectl exec -n home-assistant-test home-assistant-0 -- cat /config/automations.yaml | head -20
PostgreSQL (CNPG):
# Check cluster status
kubectl get cluster -n <namespace>-test
# Verify database is ready
kubectl wait --for=condition=ready cluster/<cluster-name> -n <namespace>-test --timeout=300s
# Connect and verify data
kubectl exec -n <namespace>-test <cluster-pod> -- psql -U postgres -c "SELECT COUNT(*) FROM <table>;"
6.5 Compare with Production (Optional)
# Compare resource counts
echo "Production:" && kubectl get all -n <source-namespace> | wc -l
echo "Test:" && kubectl get all -n <app>-test | wc -l
# Compare file counts (if applicable)
echo "Production files:" && kubectl exec -n <source-namespace> <pod> -- find <path> -type f | wc -l
echo "Test files:" && kubectl exec -n <app>-test <pod> -- find <path> -type f | wc -l
# Compare data sizes
echo "Production size:" && kubectl exec -n <source-namespace> <pod> -- du -sh <path>
echo "Test size:" && kubectl exec -n <app>-test <pod> -- du -sh <path>
Step 7: Document Test Results
# Save restore details
velero restore describe <restore-name> > ~/test-restore-results-$(date +%Y%m%d).txt
# Add verification results
cat >> ~/test-restore-results-$(date +%Y%m%d).txt <<EOF
## Verification Results
- Restore Phase: Completed
- Errors: 0
- Storage Class: <verified-storage-class>
- PVC Status: Bound
- Pod Status: Running
- Data Size: <verified-size>
- Duration: <total-time>
## Issues Encountered
<any issues or none>
## Conclusion
<success or issues to address>
EOF
Step 8: Cleanup Test Resources
# Delete test namespace (removes all resources)
kubectl delete namespace <app>-test
# Verify PVs are cleaned up
kubectl get pv | grep <app>-test
# Should show Released or be deleted (depending on reclaim policy)
# Delete test restore (optional)
velero restore delete <restore-name> --confirm
When to keep test namespace:
- Issues found during testing (investigate before cleanup)
- Performance testing needed
- User wants to manually validate application functionality
Troubleshooting
Restore Stuck InProgress
Symptoms: Restore doesn't complete, stays in InProgress phase
Diagnosis:
# Check Velero server logs
kubectl logs -n velero deployment/velero --tail=100
# Check node-agent logs (handles Kopia restore)
kubectl logs -n velero -l name=node-agent --tail=100 | grep -i error
# Check PVC binding status
kubectl get pvc -n <app>-test
kubectl describe pvc -n <app>-test <pvc-name>
Common causes:
- PVC pending (storage provisioner issue)
- Node-agent pod not running
- Kopia repository connection issues
- Insufficient storage capacity
PVC Won't Bind
Symptoms: PVC shows Pending status after restore
Diagnosis:
# Check storage class exists
kubectl get storageclass <target-storage-class>
# Check storage provisioner pods
kubectl get pods -n csi-proxmox # or longhorn-system
# Check PVC events
kubectl describe pvc -n <app>-test <pvc-name>
Common causes:
- Target storage class doesn't exist
- Storage provisioner not running
- Insufficient storage capacity
- Storage class access mode not supported
Pod Won't Start
Symptoms: Pod stuck in Pending, CrashLoopBackOff, or Error state
Diagnosis:
# Check pod status and events
kubectl describe pod -n <app>-test <pod-name>
# Check pod logs
kubectl logs -n <app>-test <pod-name>
# Check PodSecurity violations
kubectl get events -n <app>-test --field-selector reason=FailedCreate
Common causes:
- Missing PodSecurity labels on namespace
- ExternalSecret not synced yet
- ConfigMap or Secret missing
- Resource quotas exceeded
- Image pull errors
Data Not Restored
Symptoms: PVC is bound but empty or incomplete
Diagnosis:
# Check Pod Volume Restore status
velero restore describe <restore-name> --details | grep -A 10 "Pod Volume Restores"
# Check Kopia logs
kubectl logs -n velero -l name=node-agent | grep -i <namespace>
# Verify backup included pod volume backups
velero backup describe <backup-name> --details | grep -A 5 "Pod Volume"
Common causes:
- Backup didn't include pod volume backups (
defaultVolumesToFsBackup: false) - Kopia restore failed (check logs)
- Pod started before data restore completed
- Volume mount path mismatch
Best Practices
- Test Regularly: Monthly for critical apps, quarterly for full cluster
- Use Namespace Mapping: Always restore to test namespace first
- Document Results: Keep records of test restore outcomes
- Measure Time: Track RTO (how long restores take)
- Rotate Backups: Test different backup dates (not always the latest)
- Test Failures: Intentionally test restore scenarios (partial restores, specific resources)
- Automate Testing: Create scripts or CronJobs for automated testing
- Verify Completely: Don't just check pod status - verify actual data
Automated Testing
For automated monthly testing:
# Example: CronJob for automated test restore
apiVersion: batch/v1
kind: CronJob
metadata:
name: test-restore-home-assistant
namespace: velero
spec:
schedule: "0 4 1 * *" # First day of month at 4 AM
jobTemplate:
spec:
template:
spec:
serviceAccountName: velero
containers:
- name: test-restore
image: velero/velero:latest
command:
- /bin/sh
- -c
- |
# Test restore script
BACKUP=$(velero backup get --selector=app=home-assistant -o json | jq -r '.items[0].metadata.name')
velero restore create ha-test-$(date +%Y%m%d) \
--from-backup $BACKUP \
--include-namespaces home-assistant \
--namespace-mappings home-assistant:home-assistant-test
# Verify and cleanup
sleep 300
kubectl delete namespace home-assistant-test
restartPolicy: OnFailure
Related Documentation
- Velero Storage Class Mapping
- Velero Backup Setup
- Disaster Recovery Scenarios
- Scenario 01: Accidental Deletion
Metrics and Reporting
Track these metrics for each test restore:
- RTO (Recovery Time Objective): How long did the restore take?
- RPO (Recovery Point Objective): How old was the backup?
- Data Integrity: Was all data restored correctly?
- Success Rate: What percentage of test restores succeed?
- Issues Found: What problems were discovered during testing?
Create a tracking spreadsheet or dashboard to monitor trends over time.