Disaster Recovery: Talos + Longhorn

This walkthrough shows how I rebuild my Talos cluster and Longhorn volumes from S3 backups. It's specific to my environment, so adapt the steps for yours.

Prerequisites

Before starting the recovery process, ensure you have:

S3 Access: Full access to the S3 bucket containing your Longhorn backups
Administrative Access: Cluster admin privileges and kubectl configured
GitOps Tools: Working knowledge of your GitOps setup (ArgoCD, etc.)
Longhorn Access: Access to Longhorn UI or CLI/API tools
Infrastructure Code: Your OpenTofu/Terraform configurations ready

Recovery Workflow Overview

The disaster recovery process follows these sequential phases:

Infrastructure Rebuild - Recreate the Talos cluster and core components
Storage Preparation - Deploy and configure Longhorn with S3 connectivity
Application Deployment - Restore applications and create PVCs
Data Restoration - Restore volumes from S3 backups
Verification - Validate the complete recovery

Phase 1: Infrastructure Rebuild

1.1 Recreate the Talos Cluster

Rebuild your cluster infrastructure using OpenTofu:

# Clean up existing resources
tofu destroy
# Deploy the infrastructure
tofu apply

1.2 Deploy Core Infrastructure Components

Deploy the essential infrastructure components in the correct order:

# Deploy networking (Cilium)
kustomize build --enable-helm infrastructure/network/ | kubectl apply -f -

# Deploy CRDs
kubectl apply -k infrastructure/crds

# Deploy External Secrets Operator
kustomize build --enable-helm infrastructure/controllers/external-secrets/ | kubectl apply -f -

# Deploy Cert Manager
kustomize build --enable-helm infrastructure/controllers/cert-manager/ | kubectl apply -f -

# Configure Bitwarden access token for External Secrets
kubectl create secret generic bitwarden-access-token \
  --namespace external-secrets \
  --from-literal=token=<your-token>

# Reapply networking to ensure complete configuration
kustomize build --enable-helm infrastructure/network/ | kubectl apply -f -

# Deploy Longhorn storage
kustomize build --enable-helm infrastructure/storage/longhorn/ | kubectl apply -f -

# Deploy remaining infrastructure components
kustomize build --enable-helm infrastructure/ | kubectl apply -f -

⚠️ Important: Don't deploy applications with persistent volumes yet. This phase only sets up the core infrastructure.

Phase 2: Storage Preparation

2.1 Verify Longhorn Health and S3 Connectivity

Access Longhorn UI
- Use the LoadBalancer IP or configured domain (if ACME cert is approved)
- Navigate to the Longhorn dashboard
Prepare Nodes for Restore
- Go to Node section
- Disable scheduling on all nodes to prevent automatic PVC creation during restore
Verify S3 Backup Connection
- Navigate to Settings → Backup Target
- Confirm S3 connection is active and backups are visible

2.2 Prepare for Volume Restoration

Review Available Backups
- Go to Backup section
- Set page results to 100 to see all backups
- Identify the backups you need to restore

Note: Keep scheduling disabled until after volume restoration is complete.

Phase 3: Application Deployment

3.1 Deploy Applications and Create PVCs

Redeploy your applications using your GitOps workflow. This creates the PVCs that will be bound to the restored volumes:

# Using ArgoCD
argocd app sync <your-app>

# Or direct kubectl application
kubectl apply -f k8s/applications/

Expected State: Applications will be in pending state, waiting for persistent volumes. This is normal at this stage.

Phase 4: Data Restoration

4.1 Restore Volumes from S3 Backups

Via Longhorn UI (Recommended for small numbers of volumes):

Navigate to Backup section
For batch restoration:
- Select all required backups
- Click Restore Latest Backup
For individual volumes:
- Go to Volumes → Create Volume from Backup
- Select the appropriate backup from S3
- Critical: Use the exact PVC name as the volume name
- Complete the restore process

4.2 Enable Node Scheduling

After all volumes are restored:

Navigate to Node section
Enable scheduling on all nodes
Verify nodes show as schedulable

4.3 Alternative: Automated Restoration

For large scale deployments with many volumes, consider using automation scripts or the Longhorn API. Reference implementation details can be found in Longhorn Issue #1867.

Phase 5: Verification and Validation

5.1 Check Resource Status

Monitor the recovery progress:

# Check PVC and Pod status across all namespaces
kubectl get pvc,pods -A

# Watch for status changes
kubectl get pvc,pods -A -w

Expected Results:

All PVCs should show Bound status
Pods should transition from Pending to Running

5.2 Verify Longhorn Volume Health

In the Longhorn UI:

Navigate to Volume section
Confirm all volumes show Healthy status
Verify volumes are attached to appropriate nodes
Check replica status and distribution

5.3 Application-Level Verification

Test your applications to ensure data integrity:

# Check application logs
kubectl logs -n <namespace> <pod-name>

# Verify application functionality
kubectl exec -n <namespace> <pod-name> -- <verification-command>

5.4 Data Integrity Checks

Perform application-specific data validation:

Database connectivity and data consistency
File system integrity for file-based applications
Application-specific health checks

Troubleshooting Common Issues

Volume Restore Failures

Verify S3 connectivity and credentials
Check backup integrity in S3
Ensure sufficient storage space on nodes

PVC Binding Issues

Confirm volume names exactly match PVC names
Check StorageClass compatibility
Verify node scheduling is enabled

Application Startup Problems

Review pod events: kubectl describe pod <pod-name>
Check persistent volume mount points
Verify application configuration and secrets

Post-Recovery Tasks

After successful recovery:

Update Monitoring: Ensure all monitoring and alerting is functional
Test Backups: Ensure Longhorn and Velero snapshots run successfully.
Document Changes: Record any configuration changes made during recovery
Schedule DR Test: Plan the next disaster recovery test

Additional Resources

Quick Reference Commands

# Cluster rebuild
tofu destroy && tofu apply

# Infrastructure deployment
kustomize build --enable-helm infrastructure/network/ | kubectl apply -f -

# Status monitoring
kubectl get pvc,pods -A

# Application logs
kubectl logs -n <namespace> <pod-name>

Prerequisites​

Recovery Workflow Overview​

Phase 1: Infrastructure Rebuild​

1.1 Recreate the Talos Cluster​

1.2 Deploy Core Infrastructure Components​

Phase 2: Storage Preparation​

2.1 Verify Longhorn Health and S3 Connectivity​

2.2 Prepare for Volume Restoration​

Phase 3: Application Deployment​

3.1 Deploy Applications and Create PVCs​

Phase 4: Data Restoration​

4.1 Restore Volumes from S3 Backups​

4.2 Enable Node Scheduling​

4.3 Alternative: Automated Restoration​

Phase 5: Verification and Validation​

5.1 Check Resource Status​

5.2 Verify Longhorn Volume Health​

5.3 Application-Level Verification​

5.4 Data Integrity Checks​

Troubleshooting Common Issues​

Volume Restore Failures​

PVC Binding Issues​

Application Startup Problems​

Post-Recovery Tasks​

Additional Resources​

Quick Reference Commands​