Scenario 2: Disk Failure

Symptoms

Disk errors appearing in Proxmox host system logs
Longhorn storage pool showing degraded status
Smart monitoring alerts indicating disk health issues
PVC mount failures or read/write errors
Longhorn UI showing volumes with degraded replicas
Node showing disk space issues or I/O errors
Worker or control plane node reporting storage problems
Applications experiencing persistent storage failures

Impact Assessment

Recovery Time Objective (RTO): 2-4 hours
Recovery Point Objective (RPO): 1-24 hours (depends on backup age)
Data Loss Risk: Minimal if Longhorn replicas are healthy, moderate if all replicas are on the failed disk
Service Availability: Applications using affected volumes will be unavailable or degraded

Prerequisites

Physical access to Proxmox host (host3.peekoff.com) for disk replacement
Replacement disk of equal or greater capacity
kubectl access to the cluster with admin privileges
Access to Longhorn UI (LoadBalancer IP or longhorn.peekoff.com)
Access to backup storage locations (TrueNAS MinIO or B2)
Velero CLI installed for PVC restoration if needed

Recovery Procedure

Step 1: Assess Disk Failure Impact

First, determine the scope of the disk failure and which resources are affected.

Check Proxmox Host:

# SSH to Proxmox host
ssh [email protected]

# Check disk status with smartctl
smartctl -a /dev/sdX  # Replace X with the failing disk

# Check dmesg for disk errors
dmesg | grep -i "error\|fail" | grep sdX

# Check ZFS pool status (if using ZFS)
zpool status

# Check filesystem health
df -h
lsblk

Check Longhorn Status:

# Check Longhorn volumes
kubectl -n longhorn-system get volumes

# Check node status
kubectl -n longhorn-system get nodes

# Check replica health
kubectl -n longhorn-system get replicas

Access Longhorn UI:

Navigate to the Longhorn UI and check:

Dashboard: Overall system health status
Node: Which node has the failing disk
Volume: Which volumes have degraded replicas

Step 2: Identify Affected Volumes and Applications

Determine which volumes and applications are impacted:

# List all PVCs and their status
kubectl get pvc -A

# Check for pending or failed PVCs
kubectl get pvc -A | grep -v Bound

# Identify pods using affected PVCs
for ns in $(kubectl get pvc -A --no-headers | grep -v Bound | awk '{print $1}' | sort -u); do
  echo "=== Namespace: $ns ==="
  kubectl -n $ns get pods
done

# Check pod events for volume mount errors
kubectl get events -A --sort-by='.lastTimestamp' | grep -i volume

Step 3: Enable Longhorn Node Evacuation (Optional)

If the disk is still partially functional, evacuate replicas to healthy nodes:

Via Longhorn UI:

Navigate to Node section
Select the node with the failing disk
Click Edit Node
Set Eviction Requested to true
Set Scheduling to Disabled
Click Save

Via kubectl:

# Get the node name
FAILING_NODE="work-00"  # Replace with actual node name

# Disable scheduling on the node
kubectl -n longhorn-system patch node $FAILING_NODE \
  --type merge \
  --patch '{"spec":{"allowScheduling":false}}'

# Request eviction
kubectl -n longhorn-system patch node $FAILING_NODE \
  --type merge \
  --patch '{"spec":{"evictionRequested":true}}'

# Monitor replica migration
kubectl -n longhorn-system get replicas -w

Wait for replicas to migrate to healthy nodes. This may take 15-60 minutes depending on data size.

Step 4: Cordone and Drain the Kubernetes Node

Prevent new workloads from scheduling on the affected node:

# Cordone the node (prevent new pods)
kubectl cordon $FAILING_NODE

# Drain the node (evict existing pods)
kubectl drain $FAILING_NODE \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --force \
  --grace-period=300

# Verify node is drained
kubectl get nodes
kubectl get pods -A -o wide | grep $FAILING_NODE

Step 5: Replace the Failed Disk

Physical Disk Replacement:

Shut down the affected Proxmox host (if required for hot-swap):
```
# On Proxmox host
shutdown -h now
```
Replace the failed disk with a new disk of equal or greater capacity
Boot the Proxmox host

Initialize the new disk:

# Identify the new disk
lsblk

# If using ZFS, add disk to pool
zpool replace <pool-name> /dev/sdX /dev/sdY

# Or create new filesystem
mkfs.ext4 /dev/sdY
mount /dev/sdY /var/lib/longhorn

Step 6: Verify Longhorn Storage Recovery

Via Longhorn UI:

Navigate to Node section
Verify the node shows healthy disk status
Re-enable scheduling on the node:
- Click Edit Node
- Set Scheduling to Enabled
- Set Eviction Requested to false
- Click Save

Via kubectl:

# Re-enable scheduling
kubectl -n longhorn-system patch node $FAILING_NODE \
  --type merge \
  --patch '{"spec":{"allowScheduling":true,"evictionRequested":false}}'

# Uncordon the Kubernetes node
kubectl uncordon $FAILING_NODE

# Check Longhorn node status
kubectl -n longhorn-system get nodes

Step 7: Restore Affected PVCs (If Needed)

If volumes were completely lost (all replicas on failed disk), restore from backup:

List Available Backups:

# Check Longhorn backups
# Via Longhorn UI: Navigate to Backup section

# Or use Velero for PVC restore
velero backup get

# Check specific backup contents
velero backup describe <backup-name> --details

Restore from Longhorn Backup (via UI):

Navigate to Backup section
Find the backup for the lost volume
Click Restore Latest Backup
Use the exact PVC name as the volume name
Click OK
Monitor restoration in Volume section

Restore from Velero:

# Restore specific PVC from Velero backup
velero restore create restore-disk-failure-$(date +%Y%m%d-%H%M%S) \
  --from-backup <backup-name> \
  --include-resources persistentvolumeclaims,persistentvolumes \
  --include-namespaces <namespace> \
  --selector app=<app-name>

# Monitor restore progress
velero restore get
velero restore describe restore-disk-failure-<timestamp>
velero restore logs restore-disk-failure-<timestamp>

Step 8: Restart Affected Applications

After volumes are restored and healthy:

# Restart deployments
kubectl -n <namespace> rollout restart deployment/<deployment-name>

# Or restart statefulsets
kubectl -n <namespace> rollout restart statefulset/<statefulset-name>

# Monitor pod startup
kubectl -n <namespace> get pods -w

Validation

Check Longhorn Health

Via Longhorn UI:

Dashboard: All metrics should be green
Node: All nodes should show Schedulable status
Volume: All volumes should show Healthy status
Replica: Replicas should be evenly distributed

Via kubectl:

# Check Longhorn system status
kubectl -n longhorn-system get pods
kubectl -n longhorn-system get nodes
kubectl -n longhorn-system get volumes
kubectl -n longhorn-system get replicas

# Verify no degraded volumes
kubectl -n longhorn-system get volumes -o json | \
  jq -r '.items[] | select(.status.robustness != "healthy") | .metadata.name'

Check PVC Status

# All PVCs should be Bound
kubectl get pvc -A

# Check for any pending PVCs
kubectl get pvc -A | grep -v Bound

# Verify volume attachments
kubectl get volumeattachments

Check Application Status

# Verify all pods are running
kubectl get pods -A | grep -v Running | grep -v Completed

# Check application logs for errors
kubectl -n <namespace> logs <pod-name> --tail=50

# Test application connectivity
kubectl -n <namespace> port-forward svc/<service-name> 8080:80
# Access http://localhost:8080

Verify Data Integrity

For critical applications, verify data is intact:

For PostgreSQL:

kubectl -n <namespace> exec -it <postgres-pod> -- psql -U postgres

# Check database size
SELECT pg_database_size('<database-name>');

# Check table row counts
SELECT schemaname, tablename, n_live_tup
FROM pg_stat_user_tables
ORDER BY n_live_tup DESC;

# Verify latest data
SELECT MAX(created_at) FROM <your_table>;

For Applications:

Login and verify functionality
Check critical data exists
Test CRUD operations
Verify file uploads/downloads work

Post-Recovery Tasks

1. Document the Incident

# Create incident report
cat > docs/incidents/disk-failure-$(date +%Y%m%d).md <<EOF
# Disk Failure Incident

**Date**: $(date)
**Affected Node**: $FAILING_NODE
**Failed Disk**: /dev/sdX
**Volumes Affected**: <list>
**Recovery Time**: <duration>
**Data Loss**: <none/minimal/description>

## What Happened
<description of the failure>

## Recovery Steps Taken
1. Identified failing disk via smartctl and Longhorn UI
2. Evacuated replicas to healthy nodes
3. Replaced physical disk
4. Restored volumes from backup (if needed)
5. Verified application functionality

## Root Cause
<hardware failure details>

## Prevention Measures
<monitoring improvements, spare disk inventory, etc.>
EOF

2. Review Longhorn Replica Strategy

Ensure future failures don't cause data loss:

# Check current replica settings
kubectl -n longhorn-system get settings.longhorn.io default-replica-count -o yaml

# Update if needed (default should be 3)
kubectl -n longhorn-system patch settings.longhorn.io default-replica-count \
  --type merge \
  --patch '{"value":"3"}'

# For critical volumes, set higher replica count
# Via Longhorn UI: Volume → Edit → Number of Replicas

3. Enable or Verify Monitoring

Ensure disk health monitoring is active:

# Check if Prometheus is scraping Longhorn metrics
kubectl -n monitoring get servicemonitor | grep longhorn

# Verify Longhorn alerts exist
kubectl -n monitoring get prometheusrule | grep longhorn

# Test alert (optional)
# Manually trigger a test alert to verify notification works

4. Schedule Regular Backup Verification

# Verify backup schedules are running
kubectl -n velero get schedules

# Check last backup times
velero backup get

# Verify Longhorn backups
# Via Longhorn UI: Settings → Backup Target
# Confirm connection to MinIO and B2 is active

5. Update Hardware Inventory

Document the disk replacement:

Record new disk serial number
Update hardware inventory spreadsheet
Note disk warranty information
Update Proxmox notes with disk replacement date

Troubleshooting

Volume Stuck in Degraded State

# Check replica status
kubectl -n longhorn-system get replicas | grep <volume-name>

# Identify which replicas are unhealthy
kubectl -n longhorn-system describe replica <replica-name>

# Force rebuild of degraded replica (via Longhorn UI)
# Volume → <volume-name> → Salvage

# Or delete unhealthy replica to trigger rebuild
kubectl -n longhorn-system delete replica <replica-name>

Replicas Not Migrating During Eviction

# Check if there's enough space on other nodes
kubectl -n longhorn-system get nodes -o json | \
  jq -r '.items[] | "\(.metadata.name): \(.status.diskStatus)"'

# Check for scheduling issues
kubectl -n longhorn-system describe node <node-name>

# Manually move replica (via Longhorn UI)
# Volume → <volume-name> → Update Replicas Count
# Then select specific nodes for placement

PVC Fails to Bind After Restore

# Check PVC and PV status
kubectl -n <namespace> get pvc <pvc-name> -o yaml
kubectl get pv

# Check if PV exists with matching claim
kubectl get pv -o yaml | grep -A 10 <pvc-name>

# Delete and recreate PVC if needed
kubectl -n <namespace> delete pvc <pvc-name>
# Recreate from manifest or Velero restore

Longhorn Node Shows "Scheduling Disabled"

# Check node status
kubectl -n longhorn-system get node <node-name> -o yaml

# Re-enable scheduling
kubectl -n longhorn-system patch node <node-name> \
  --type merge \
  --patch '{"spec":{"allowScheduling":true}}'

Disk Not Detected After Replacement

# On Proxmox host, rescan SCSI bus
echo "- - -" > /sys/class/scsi_host/host0/scan

# Check if disk appears
lsblk
dmesg | tail -50

# Verify disk is healthy
smartctl -a /dev/sdX

# If using hardware RAID, check RAID controller
# (Commands vary by controller type)

Scenario 1: Accidental Deletion - If volumes were accidentally deleted
Scenario 3: Host Failure - If the entire Proxmox host failed
Scenario 8: Data Corruption - If restored data is corrupt

Reference

Longhorn Node Maintenance Documentation
Longhorn Replica Management
Longhorn Backup and Restore
Main disaster recovery guide: Disaster Recovery Overview

Symptoms​

Impact Assessment​

Prerequisites​

Recovery Procedure​

Step 1: Assess Disk Failure Impact​

Step 2: Identify Affected Volumes and Applications​

Step 3: Enable Longhorn Node Evacuation (Optional)​

Step 4: Cordone and Drain the Kubernetes Node​

Step 5: Replace the Failed Disk​

Step 6: Verify Longhorn Storage Recovery​

Step 7: Restore Affected PVCs (If Needed)​

Step 8: Restart Affected Applications​

Validation​

Check Longhorn Health​

Check PVC Status​

Check Application Status​

Verify Data Integrity​

Post-Recovery Tasks​

1. Document the Incident​

2. Review Longhorn Replica Strategy​

3. Enable or Verify Monitoring​

4. Schedule Regular Backup Verification​

5. Update Hardware Inventory​

Troubleshooting​

Volume Stuck in Degraded State​

Replicas Not Migrating During Eviction​

PVC Fails to Bind After Restore​

Longhorn Node Shows "Scheduling Disabled"​

Disk Not Detected After Replacement​

Related Scenarios​

Reference​

Symptoms

Impact Assessment

Prerequisites

Recovery Procedure

Step 1: Assess Disk Failure Impact

Step 2: Identify Affected Volumes and Applications

Step 3: Enable Longhorn Node Evacuation (Optional)

Step 4: Cordone and Drain the Kubernetes Node

Step 5: Replace the Failed Disk

Step 6: Verify Longhorn Storage Recovery

Step 7: Restore Affected PVCs (If Needed)

Step 8: Restart Affected Applications

Validation

Check Longhorn Health

Check PVC Status

Check Application Status

Verify Data Integrity

Post-Recovery Tasks

1. Document the Incident

2. Review Longhorn Replica Strategy

3. Enable or Verify Monitoring

4. Schedule Regular Backup Verification

5. Update Hardware Inventory

Troubleshooting

Volume Stuck in Degraded State

Replicas Not Migrating During Eviction

PVC Fails to Bind After Restore

Longhorn Node Shows "Scheduling Disabled"

Disk Not Detected After Replacement

Related Scenarios

Reference