Scenario 4: Rack Fire
Symptoms
- Complete physical destruction of server rack and all equipment
- Total loss of Proxmox host (host3.peekoff.com)
- All local storage destroyed (NVMe drives, TrueNAS)
- Network equipment destroyed
- No local infrastructure remains
- Local MinIO backups are gone (TrueNAS destroyed)
- Only offsite Backblaze B2 backups and GitHub repository remain intact
Impact Assessment
- Recovery Time Objective (RTO): 8-24 hours
- Recovery Point Objective (RPO): Up to 1 week (last weekly B2 backup)
- Data Loss Risk: High - limited to last weekly backup
- Service Availability: Complete outage until new hardware is provisioned
- Financial Impact: Hardware replacement costs, potential insurance claim
- Rebuilding Requirements: New server hardware, network equipment, storage
Prerequisites
Critical Access Requirements
You MUST have access to:
-
Bitwarden Account:
- Master password
- Contains all credentials needed for recovery
-
GitHub Repository:
- Account:
theepicsaxguy - Repository:
homelab - SSH key or Personal Access Token (stored in Bitwarden)
- Account:
-
Backblaze B2 Account:
- Account credentials (in Bitwarden)
- Application keys (in Bitwarden: "backblaze-b2-velero-offsite")
- Buckets:
homelab-velero-b2(Velero backups)homelab-cnpg-b2(PostgreSQL backups)homelab-terraform-state(OpenTofu state)
Required Hardware
Minimum Hardware Requirements:
-
Server: 1x physical server or new Proxmox-capable host
- CPU: 8+ cores
- RAM: 64GB minimum (128GB recommended)
- Storage: 2x NVMe drives (500GB+ each)
- Network: Gigabit Ethernet
-
Network Equipment:
- Router with VLAN support
- Managed switch (if needed)
- Firewall (optional, depending on setup)
Software Requirements
On your workstation (laptop/desktop):
tofuorterraformCLItalosctlCLIkubectlCLIveleroCLIargocdCLI (optional)gitCLI- SSH client
- Web browser
Recovery Procedure
Phase 1: Acquire New Hardware
Step 1: Procure Replacement Hardware
Hardware Shopping List:
[ ] Server/Workstation with virtualization support
- Check: Intel VT-x/VT-d or AMD-V/AMD-Vi enabled
- Recommended: Dell PowerEdge, HP ProLiant, or custom build
[ ] NVMe SSDs (2x minimum)
- 500GB+ for VM storage
- 1TB+ recommended for production use
[ ] RAM (64GB minimum)
- 4x 16GB or 8x 8GB modules
[ ] Network equipment
- Managed switch with VLAN support (if needed)
- Router (if needed)
[ ] Proxmox VE installation media
- Download from: https://www.proxmox.com/en/downloads
- Create bootable USB
Step 2: Verify Network Configuration
Ensure you can recreate the same network:
- Network: 10.25.150.0/24
- Gateway: 10.25.150.1
- VLAN: 150
- Proxmox Host IP: 10.25.150.3 (or your original IP)
If using different network:
You'll need to update OpenTofu configurations before applying.
Phase 2: Install Base Infrastructure
Step 3: Install Proxmox VE
Installation:
-
Boot from Proxmox installation media
-
Follow installation wizard:
Hostname: host3.peekoff.com (or new hostname)
IP Address: 10.25.150.3
Netmask: 255.255.255.0
Gateway: 10.25.150.1
DNS Server: 10.25.150.1 (or 8.8.8.8) -
Set root password (store in Bitwarden immediately!)
-
Complete installation and reboot
Post-Installation:
# SSH to Proxmox host
ssh [email protected]
# Update system
apt update && apt dist-upgrade -y
# Install useful tools
apt install -y smartmontools lm-sensors ethtool
Configure Storage:
# Identify NVMe drives
lsblk
# Create ZFS pools (recommended)
zpool create -f Nvme1 /dev/nvme0n1
zpool create -f Nvme2 /dev/nvme1n1
# Verify
pvesm status
zpool list
Configure Network:
# Edit network configuration
nano /etc/network/interfaces
# Ensure bridge is configured:
auto vmbr0
iface vmbr0 inet static
address 10.25.150.3/24
gateway 10.25.150.1
bridge-ports eno1
bridge-stp off
bridge-fd 0
# For VLAN 150 support
bridge-vlan-aware yes
# Apply changes
ifreload -a
Phase 3: Retrieve Infrastructure Code and State
Step 4: Clone GitHub Repository
On your workstation:
# Clone homelab repository
git clone [email protected]:theepicsaxguy/homelab.git
cd homelab
# Verify you have the latest code
git checkout main
git pull origin main
Step 5: Configure B2 Credentials
Set up Backblaze B2 access for OpenTofu state:
# Get B2 credentials from Bitwarden
# Item: "backblaze-b2-velero-offsite"
# Fields: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY
export AWS_ACCESS_KEY_ID="<B2_KEY_ID>"
export AWS_SECRET_ACCESS_KEY="<B2_APPLICATION_KEY>"
# Verify credentials work
aws s3 ls s3://homelab-terraform-state \
--endpoint-url=https://s3.us-west-000.backblazeb2.com
Step 6: Enable OpenTofu Remote State Backend
Edit /home/benjaminsanden/Dokument/Projects/homelab/tofu/backend.tf:
Uncomment the backend configuration:
terraform {
backend "s3" {
# B2 Bucket configuration
bucket = "homelab-terraform-state"
key = "proxmox/terraform.tfstate"
region = "us-west-000"
# Backblaze B2 S3-compatible endpoint
endpoint = "https://s3.us-west-000.backblazeb2.com"
# Required for B2 compatibility
skip_credentials_validation = true
skip_metadata_api_check = true
skip_region_validation = true
skip_requesting_account_id = true
use_path_style = false
}
}
Phase 4: Deploy Infrastructure with OpenTofu
Step 7: Initialize OpenTofu with B2 State
cd /path/to/homelab/tofu
# Initialize OpenTofu with remote backend
tofu init
# Verify state was pulled from B2
tofu show
# You should see your previous infrastructure state
Step 8: Update Proxmox Provider Configuration
Create or update Proxmox credentials:
# Get Proxmox API token
# Either create new token in Proxmox UI:
# Datacenter → Permissions → API Tokens → Add
# Or create terraform.auto.tfvars with credentials
cat > terraform.auto.tfvars <<EOF
proxmox = {
name = "host3"
cluster_name = "host3"
endpoint = "https://host3.peekoff.com:8006"
insecure = true
username = "root@pam"
api_token = "<PROXMOX_API_TOKEN>"
}
EOF
chmod 600 terraform.auto.tfvars
Update Network Configuration (if changed):
If your new network is different, update config.auto.tfvars:
network = {
gateway = "10.25.150.1"
vip = "10.25.150.10"
api_lb_vip = "10.25.150.9"
cidr_prefix = 24
dns_servers = ["10.25.150.1"]
bridge = "vmbr0"
vlan_id = 150
}
Step 9: Apply Infrastructure
Deploy VMs with OpenTofu:
# Review what will be created
tofu plan
# Apply infrastructure
tofu apply
# Type 'yes' to confirm
This creates:
- 3 control plane VMs: 10.25.150.11-13
- 3 worker VMs: 10.25.150.21-23
- 2 load balancer VMs: 10.25.150.5-6 (if enabled)
- Talos Linux installed on all VMs
Phase 5: Bootstrap Kubernetes Cluster
Step 10: Bootstrap Talos
# Navigate to tofu directory
cd /path/to/homelab/tofu
# Export talosconfig
export TALOSCONFIG=$(pwd)/outputs/talosconfig
# Bootstrap first control plane
talosctl bootstrap -n 10.25.150.11
# Wait for bootstrap (5-10 minutes)
talosctl -n 10.25.150.11 health --wait-timeout 10m
Step 11: Configure kubectl Access
# Generate kubeconfig
talosctl -n 10.25.150.11 kubeconfig ~/.kube/config --force
# Set context
kubectl config use-context talos
# Verify cluster
kubectl get nodes
Wait for all nodes to become Ready:
kubectl get nodes -w
# Expected output:
# NAME STATUS ROLES AGE VERSION
# ctrl-00 Ready control-plane 5m v1.34.3
# ctrl-01 Ready control-plane 5m v1.34.3
# ctrl-02 Ready control-plane 5m v1.34.3
# work-00 Ready <none> 5m v1.34.3
# work-01 Ready <none> 5m v1.34.3
# work-02 Ready <none> 5m v1.34.3
Phase 6: Deploy Core Infrastructure
Step 12: Deploy Infrastructure Components
cd /path/to/homelab/k8s
# 1. Apply CRDs
kubectl apply -k infrastructure/crds/
# 2. Deploy External Secrets Operator
kustomize build --enable-helm infrastructure/controllers/external-secrets/ | kubectl apply -f -
# Wait for External Secrets
kubectl -n external-secrets wait --for=condition=available deployment/external-secrets --timeout=300s
kubectl -n external-secrets wait --for=condition=available deployment/external-secrets-cert-controller --timeout=300s
kubectl -n external-secrets wait --for=condition=available deployment/external-secrets-webhook --timeout=300s
# 3. Create Bitwarden access token
kubectl create secret generic bitwarden-access-token \
--namespace external-secrets \
--from-literal=token="<BITWARDEN_ACCESS_TOKEN>"
# Verify connection
kubectl -n external-secrets get clustersecretstore bitwarden-backend
# 4. Deploy Cert Manager
kustomize build --enable-helm infrastructure/controllers/cert-manager/ | kubectl apply -f -
kubectl -n cert-manager wait --for=condition=available deployment/cert-manager --timeout=300s
# 5. Deploy Longhorn
kustomize build --enable-helm infrastructure/storage/longhorn/ | kubectl apply -f -
kubectl -n longhorn-system wait --for=condition=available deployment/longhorn-driver-deployer --timeout=600s
# 6. Deploy all infrastructure
kustomize build --enable-helm infrastructure/ | kubectl apply -f -
Step 13: Configure Longhorn B2 Backup (Optional)
Since TrueNAS/MinIO is gone, update Longhorn to use B2 for backups:
# Update Longhorn backup target to B2
kubectl -n longhorn-system patch settings.longhorn.io backup-target \
--type merge \
--patch '{"value":"s3://longhorn@us-west-000/"}'
# Create Longhorn B2 credentials
kubectl -n longhorn-system create secret generic longhorn-b2-credentials \
--from-literal=AWS_ACCESS_KEY_ID="<B2_KEY_ID>" \
--from-literal=AWS_SECRET_ACCESS_KEY="<B2_APPLICATION_KEY>" \
--from-literal=AWS_ENDPOINTS="https://s3.us-west-000.backblazeb2.com"
# Update backup target credential
kubectl -n longhorn-system patch settings.longhorn.io backup-target-credential-secret \
--type merge \
--patch '{"value":"longhorn-b2-credentials"}'
Phase 7: Deploy Applications and Restore Data
Step 14: Deploy Velero
# Velero should be part of infrastructure deployment
kubectl -n velero get pods
# Verify B2 backup storage location
kubectl -n velero get backupstoragelocations
# Should show: backblaze-b2 (ReadWrite)
Step 15: List Available Backups
# Install Velero CLI if not already installed
# Download from: https://github.com/vmware-tanzu/velero/releases
# List B2 backups
velero backup get --storage-location backblaze-b2
# Find latest weekly backup
velero backup get --storage-location backblaze-b2 \
--selector backup-type=weekly-offsite
# Describe latest backup
LATEST_BACKUP=$(velero backup get --storage-location backblaze-b2 \
--selector backup-type=weekly-offsite \
-o json | jq -r '.items | sort_by(.metadata.creationTimestamp) | .[-1].metadata.name')
echo "Latest backup: $LATEST_BACKUP"
velero backup describe $LATEST_BACKUP --details
Step 16: Restore Cluster State from B2
# Restore everything except infrastructure namespaces
velero restore create rack-fire-restore-$(date +%Y%m%d-%H%M%S) \
--from-backup $LATEST_BACKUP \
--storage-location backblaze-b2 \
--exclude-namespaces velero,cert-manager,external-secrets,longhorn-system,kube-system,argocd
# Monitor restore progress
velero restore get
velero restore logs rack-fire-restore-<timestamp> -f
# Watch pods come up
kubectl get pods -A -w
Step 17: Deploy ArgoCD and Sync Applications
# Verify ArgoCD is running
kubectl -n argocd get pods
# Get admin password
ARGOCD_PASSWORD=$(kubectl -n argocd get secret argocd-initial-admin-secret \
-o jsonpath="{.data.password}" | base64 -d)
echo "ArgoCD admin password: $ARGOCD_PASSWORD"
# Login to ArgoCD
argocd login localhost:8080 \
--username admin \
--password $ARGOCD_PASSWORD \
--port-forward \
--port-forward-namespace argocd
# Sync all applications
argocd app list
argocd app sync -l argocd.argoproj.io/instance=applications
Step 18: Restore PostgreSQL Databases from B2
For each CNPG PostgreSQL cluster, restore from B2:
Example for auth namespace (Authentik database):
# Create restore configuration
cat > restore-auth-db.yaml <<EOF
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: authentik-postgres
namespace: auth
spec:
instances: 2
bootstrap:
recovery:
source: b2-backup
recoveryTarget:
targetImmediate: true # Restore to latest backup
externalClusters:
- name: b2-backup
barmanObjectStore:
destinationPath: s3://homelab-cnpg-b2/auth/authentik-postgres
endpointURL: https://s3.us-west-000.backblazeb2.com
s3Credentials:
accessKeyId:
name: b2-cnpg-credentials
key: AWS_ACCESS_KEY_ID
secretAccessKey:
name: b2-cnpg-credentials
key: AWS_SECRET_ACCESS_KEY
wal:
compression: gzip
encryption: AES256
storage:
size: 20Gi
storageClass: longhorn
EOF
# Apply restore
kubectl apply -f restore-auth-db.yaml
# Monitor recovery
kubectl -n auth get cluster authentik-postgres -w
kubectl -n auth logs -l cnpg.io/cluster=authentik-postgres -c postgres --tail=100 -f
Repeat for all PostgreSQL clusters in your environment.
Validation
Check Infrastructure Health
# All nodes should be Ready
kubectl get nodes
# All infrastructure pods should be Running
kubectl get pods -n longhorn-system
kubectl get pods -n velero
kubectl get pods -n cert-manager
kubectl get pods -n external-secrets
kubectl get pods -n argocd
# Check Longhorn UI
kubectl -n longhorn-system get svc longhorn-frontend
# Access via LoadBalancer IP
Check Application Status
# List all namespaces
kubectl get namespaces
# Check application pods
kubectl get pods -A | grep -v Running | grep -v Completed
# Check PVCs are bound
kubectl get pvc -A | grep -v Bound
# Check services
kubectl get svc -A
Verify Data Integrity
PostgreSQL:
# Check all CNPG clusters
kubectl get clusters -A
# Verify database connectivity
kubectl -n auth exec -it authentik-postgres-1 -- psql -U postgres
# Check data
SELECT COUNT(*) FROM <critical_table>;
SELECT MAX(created_at) FROM <timestamped_table>;
Applications:
- Access application UIs
- Verify user data exists
- Test login/authentication
- Check critical workflows
Verify Backups Are Running
# Check Velero schedules
velero schedule get
# Check recent backups
velero backup get | head -10
# Check CNPG backup schedules
kubectl get scheduledbackups -A
# Check Longhorn recurring jobs
kubectl -n longhorn-system get recurringjobs
Post-Recovery Tasks
1. Document the Incident
cat > docs/incidents/rack-fire-$(date +%Y%m%d).md <<EOF
# Rack Fire Disaster Recovery
**Date**: $(date)
**Incident**: Complete destruction of server rack
**Cause**: <fire cause>
**Recovery Time**: <actual hours from incident to full recovery>
**Data Loss (RPO)**: <days since last B2 backup>
**Hardware Replaced**: <full hardware list>
**Total Cost**: <hardware + time>
## What Happened
<detailed description>
## Recovery Timeline
- T+0h: Incident occurred
- T+<X>h: New hardware acquired
- T+<X>h: Proxmox installed
- T+<X>h: Infrastructure deployed via OpenTofu
- T+<X>h: Cluster bootstrapped
- T+<X>h: Data restored from B2
- T+<X>h: All services operational
## Hardware Purchased
<itemized list with costs>
## Data Loss Assessment
- Last B2 backup: <date/time>
- Data lost: <description of data created after last backup>
- User impact: <description>
## Insurance Claim
- Claim filed: <date>
- Claim number: <number>
- Status: <pending/approved/denied>
## Lessons Learned
- What worked well
- What could be improved
- Documentation gaps
- Infrastructure improvements needed
## Follow-up Actions
- [ ] Submit insurance claim
- [ ] Order spare hardware
- [ ] Increase backup frequency
- [ ] Implement off-site hardware caching
- [ ] Review fire suppression systems
EOF
2. Increase Backup Frequency
Since you just experienced total loss, consider more frequent B2 backups:
# Update Velero schedule for more frequent B2 backups
kubectl -n velero edit schedule weekly-offsite-schedule
# Change from weekly to daily:
spec:
schedule: "0 3 * * *" # Daily at 3 AM UTC instead of Sunday only
3. Set Up Local Backup Storage (When Possible)
When you acquire new NAS/storage:
# Deploy TrueNAS or MinIO
# Update Longhorn backup target
# Update Velero default storage location
# Keep B2 as secondary/offsite backup
4. Test Backup Restoration Monthly
Schedule regular restore tests:
# Monthly: Test single namespace restore
# Quarterly: Test full cluster restore to test environment
# Annually: Complete DR drill with new hardware
5. Update Emergency Documentation
# Create printed emergency runbook
# Include:
# - Bitwarden master password location
# - GitHub access method
# - B2 account details
# - Hardware vendor contact info
# - This disaster recovery documentation
# - Network configuration diagrams
# Store printed copy off-site (safe deposit box, friend's house, etc.)
6. Consider DR Improvements
Infrastructure Improvements:
- Geographic Redundancy: Consider cloud-based DR site (AWS, Azure, etc.)
- Backup Frequency: Increase to daily or hourly B2 backups
- Hardware Spares: Keep spare server/parts at different location
- Automated Testing: Implement automated backup validation
Financial Improvements:
- Insurance Review: Ensure adequate coverage for hardware
- Budget for Spares: Allocate budget for emergency hardware purchases
Troubleshooting
Cannot Access B2 Backups
# Verify credentials
aws s3 ls s3://homelab-velero-b2 \
--endpoint-url=https://s3.us-west-000.backblazeb2.com
# If credentials don't work:
# 1. Login to B2 web console
# 2. Check if keys are still valid
# 3. Generate new application keys if needed
# 4. Update credentials in External Secrets / Bitwarden
Velero Restore Fails
# Check Velero logs
kubectl -n velero logs deployment/velero -f
# Common issues:
# - Storage class not available: Ensure Longhorn is deployed first
# - Namespace already exists: Use --existing-resource-policy update
# - PVC size conflicts: Delete and recreate PVCs manually
# Retry restore with more options
velero restore create <name> \
--from-backup $LATEST_BACKUP \
--existing-resource-policy update \
--preserve-nodeports
PostgreSQL Recovery Fails
# Check CNPG cluster status
kubectl -n <namespace> get cluster <cluster-name> -o yaml
# Check pod logs
kubectl -n <namespace> logs <cluster-pod> -c postgres
# Common issues:
# - B2 credentials wrong: Check external secret
# - Backup path wrong: Verify destinationPath in cluster spec
# - Permissions: Check PVC ownership (UID 26 for CNPG)
# Delete and retry with corrected configuration
kubectl -n <namespace> delete cluster <cluster-name>
kubectl apply -f restore-<namespace>-db.yaml
Network Configuration Different
If you can't use same IPs (10.25.150.x):
# Update tofu/config.auto.tfvars with new network
# Update DNS/ingress configurations
# Regenerate certificates with new domains
# Update external-dns configurations
Related Scenarios
- Scenario 3: Host Failure - Similar but hardware survives
- Scenario 5: Total Site Loss - Same as rack fire (house destroyed)
- Scenario 6: Ransomware - If backups are encrypted/compromised
Reference
- Backblaze B2 Documentation
- Velero Disaster Recovery Guide
- CNPG Backup and Recovery
- OpenTofu Backend Documentation
- Main disaster recovery guide: Disaster Recovery Overview
- B2 Setup Guide:
/home/benjaminsanden/Dokument/Projects/homelab/BACKBLAZE_B2_SETUP.md