Longhorn Upgrade Strategy

Overview

This document outlines the upgrade strategy for Longhorn distributed storage in the homelab Kubernetes cluster. The strategy focuses on automated engine upgrades and version management to minimize downtime and manual intervention.

Context

The homelab uses Longhorn as its distributed block storage system. Since Longhorn v1.1.1, automatic engine upgrades are supported to reduce manual maintenance overhead during version updates.

Automatic Engine Upgrades

Since Longhorn v1.1.1, the cluster is configured to automatically upgrade volume engines to the new default engine version after upgrading the Longhorn manager. This feature reduces manual work during upgrades and ensures volumes stay current with the latest engine improvements.

Configuration

The cluster uses the following automatic upgrade settings:

Concurrent Automatic Engine Upgrade Per Node Limit: 3 engines per node
- Controls maximum concurrent engine upgrades per node
- Value of 3 provides good upgrade speed while preventing system overload
- If set to 0, automatic upgrades would be disabled

Upgrade Behavior by Volume State

Attached Volumes

Healthy attached volumes receive live upgrades without downtime
Engine upgrade occurs while the volume remains in use

Detached Volumes

Offline upgrades are performed automatically
No impact on running applications

Disaster Recovery Volumes

Not automatically upgraded to avoid triggering full restoration
Manual upgrade recommended during maintenance windows
When activated, volumes are upgraded offline after detachment

Failure Handling

If an engine upgrade fails:

Volume spec retains old engine image reference
Longhorn continuously retries the upgrade
If too many failures occur per node (> concurrent limit), upgrades pause on that node
Failed upgrades don't affect volume availability

This ensures smooth Longhorn version upgrades with minimal operational overhead.

Upgrade Process

Longhorn Manager Upgrades

Preparation
- Review Longhorn release notes for breaking changes
- Ensure backup strategy is current and tested
- Verify cluster has sufficient resources for upgrades
Manager Upgrade
- Update Longhorn Helm chart version in k8s/infrastructure/storage/longhorn/values.yaml
- Apply changes through GitOps (Argo CD will deploy automatically)
- Monitor Longhorn UI for upgrade progress
Engine Upgrades
- Automatic engine upgrades begin after manager deployment
- Monitor upgrade progress in Longhorn UI
- Check for failed upgrades and investigate if needed
Post-Upgrade Validation
- Verify all volumes are healthy
- Confirm backup jobs are running successfully
- Test application functionality with upgraded volumes

Rollback Procedures

If issues occur during upgrade:

Pause Automatic Upgrades
- Set concurrentAutomaticEngineUpgradePerNodeLimit: 0 temporarily
- This stops automatic engine upgrades
Manager Rollback
- Revert Helm chart version in Git
- Argo CD will roll back the deployment
Volume Recovery
- Failed volumes will retain old engine version
- Manual intervention may be required for problematic volumes
- Use Longhorn UI to manually upgrade individual volumes if needed

Monitoring and Alerts

Monitor Longhorn UI for upgrade status
Check Kubernetes events for upgrade-related issues
Set up alerts for failed engine upgrades
Review upgrade logs in Longhorn manager pods

Best Practices

Test Upgrades: Always test upgrades in a development environment first
Backup First: Ensure recent backups exist before major upgrades
Monitor Resources: Watch for resource usage spikes during upgrades
Staged Rollout: Consider upgrading nodes in stages for large clusters
Documentation: Keep upgrade records for troubleshooting future issues

Overview​

Context​

Automatic Engine Upgrades​

Configuration​

Upgrade Behavior by Volume State​

Failure Handling​

Upgrade Process​

Longhorn Manager Upgrades​

Rollback Procedures​

Monitoring and Alerts​

Best Practices​