Longhorn Upgrade Strategy
Overview
This document outlines the upgrade strategy for Longhorn distributed storage in the homelab Kubernetes cluster. The strategy focuses on automated engine upgrades and version management to minimize downtime and manual intervention.
Context
The homelab uses Longhorn as its distributed block storage system. Since Longhorn v1.1.1, automatic engine upgrades are supported to reduce manual maintenance overhead during version updates.
Automatic Engine Upgrades
Since Longhorn v1.1.1, the cluster is configured to automatically upgrade volume engines to the new default engine version after upgrading the Longhorn manager. This feature reduces manual work during upgrades and ensures volumes stay current with the latest engine improvements.
Configuration
The cluster uses the following automatic upgrade settings:
- Concurrent Automatic Engine Upgrade Per Node Limit: 3 engines per node
- Controls maximum concurrent engine upgrades per node
- Value of 3 provides good upgrade speed while preventing system overload
- If set to 0, automatic upgrades would be disabled
Upgrade Behavior by Volume State
Attached Volumes
- Healthy attached volumes receive live upgrades without downtime
- Engine upgrade occurs while the volume remains in use
Detached Volumes
- Offline upgrades are performed automatically
- No impact on running applications
Disaster Recovery Volumes
- Not automatically upgraded to avoid triggering full restoration
- Manual upgrade recommended during maintenance windows
- When activated, volumes are upgraded offline after detachment
Failure Handling
If an engine upgrade fails:
- Volume spec retains old engine image reference
- Longhorn continuously retries the upgrade
- If too many failures occur per node (> concurrent limit), upgrades pause on that node
- Failed upgrades don't affect volume availability
This ensures smooth Longhorn version upgrades with minimal operational overhead.
Upgrade Process
Longhorn Manager Upgrades
-
Preparation
- Review Longhorn release notes for breaking changes
- Ensure backup strategy is current and tested
- Verify cluster has sufficient resources for upgrades
-
Manager Upgrade
- Update Longhorn Helm chart version in
k8s/infrastructure/storage/longhorn/values.yaml - Apply changes through GitOps (Argo CD will deploy automatically)
- Monitor Longhorn UI for upgrade progress
- Update Longhorn Helm chart version in
-
Engine Upgrades
- Automatic engine upgrades begin after manager deployment
- Monitor upgrade progress in Longhorn UI
- Check for failed upgrades and investigate if needed
-
Post-Upgrade Validation
- Verify all volumes are healthy
- Confirm backup jobs are running successfully
- Test application functionality with upgraded volumes
Rollback Procedures
If issues occur during upgrade:
-
Pause Automatic Upgrades
- Set
concurrentAutomaticEngineUpgradePerNodeLimit: 0temporarily - This stops automatic engine upgrades
- Set
-
Manager Rollback
- Revert Helm chart version in Git
- Argo CD will roll back the deployment
-
Volume Recovery
- Failed volumes will retain old engine version
- Manual intervention may be required for problematic volumes
- Use Longhorn UI to manually upgrade individual volumes if needed
Monitoring and Alerts
- Monitor Longhorn UI for upgrade status
- Check Kubernetes events for upgrade-related issues
- Set up alerts for failed engine upgrades
- Review upgrade logs in Longhorn manager pods
Best Practices
- Test Upgrades: Always test upgrades in a development environment first
- Backup First: Ensure recent backups exist before major upgrades
- Monitor Resources: Watch for resource usage spikes during upgrades
- Staged Rollout: Consider upgrading nodes in stages for large clusters
- Documentation: Keep upgrade records for troubleshooting future issues