Talos Upgrade Process
Overview
Talos upgrades are handled through OpenTofu using a marker-based system. Two versions are defined:
talos_image.version- Current deployed versiontalos_image.update_version- Target version (updated by Renovate)
You control which nodes upgrade by setting upgrade = true in their node config.
How It Works
- Renovate updates
talos_image.update_versionto the new version - You set
upgrade = trueon individual nodes innodes_config - Nodes with
upgrade = trueuseupdate_version, others stay atversion - Run
tofu apply- only marked nodes upgrade - After all nodes are done, update
versionto match and set allupgrade = false
talos_image.version = "v1.10.3" # current base
talos_image.update_version = "v1.11.5" # target
# In nodes_config:
"ctrl-00" = { ..., upgrade = true } → uses v1.11.5 (upgrade/stay upgraded)
"ctrl-01" = { ..., upgrade = false } → uses v1.10.3 (stay at base)
upgrade = true, keep it true until you update the base version to match. Settingupgrade = false before updating the base version triggers a downgrade! :::
Upgrade Process
Step 1: Check Current Status
cd tofu
tofu plan
The upgrade_sequence output shows:
recommended_order- Suggested upgrade sequence (control plane first, then workers)current_version- Base versiontarget_version- Target versionnodes_marked_for_upgrade- Nodes withupgrade = truenode_versions- Effective version per node
Step 2: Mark First Node for Upgrade
Edit tofu/nodes.auto.tfvars:
nodes_config = {
"ctrl-00" = {
machine_type = "controlplane"
ip = "10.25.150.10"
# ... other config ...
upgrade = true # Mark for upgrade
}
"ctrl-01" = {
# ... config ...
upgrade = false # Keep at current version
}
# ...
}
Step 3: Apply First Upgrade
tofu apply
Only ctrl-00 will be replaced. Wait for it to become Ready:
kubectl get nodes -w
Step 4: Continue Sequentially
Set upgrade = true on the next node:
"ctrl-01" = {
# ... config ...
upgrade = true # Now mark ctrl-01
}
tofu apply
Repeat for all nodes in the recommended order:
- Control plane:
ctrl-00,ctrl-01,ctrl-02 - Workers:
work-00,work-01,work-02,work-03
Step 5: Finalize Upgrade
upgrade = true on nodes that have been upgraded until you finalize. Settingupgrade = false before updating the base version will trigger a downgrade! :::
After all nodes are upgraded:
- First, update
talos_image.versionto matchupdate_version:
# tofu/talos_image.auto.tfvars
talos_image = {
version = "v1.11.5" # Updated from v1.10.3
update_version = "v1.11.5"
schematic_path = "talos/image/schematic.yaml.tftpl"
}
- Then, set all nodes to
upgrade = false:
# tofu/nodes.auto.tfvars
nodes_config = {
"ctrl-00" = { ..., upgrade = false }
"ctrl-01" = { ..., upgrade = false }
# ...
}
- Apply to clean up state:
tofu apply
This is safe because upgrade = false now means "use version" which is v1.11.5.
Bulk Upgrades
To upgrade all nodes at once, set upgrade = true on all nodes, then run tofu apply.
Monitoring
During upgrades:
# Watch Kubernetes node status
kubectl get nodes -w
# Check Talos version on specific node
talosctl version --nodes <NODE_IP>
# Verify cluster health
kubectl get pods --all-namespaces
Rollback
If a node upgrade fails immediately (before other nodes are upgraded), you can set upgrade = false to revert to
the base version:
"ctrl-01" = {
# ... config ...
upgrade = false # Revert to base version (v1.10.3)
}
tofu apply
talos_image.version is still the old version. Once you've updated the base version, rollingback requires setting version back to the old value. :::
Configuration Reference
talos_image.auto.tfvars
talos_image = {
version = "v1.10.3" # Current deployed version
update_version = "v1.11.5" # Target version (updated by Renovate)
=======
The Talos upgrade process is **fully automated**. Simply change the version and run `tofu apply` once - all nodes will upgrade sequentially with automatic health checks.
## Automated Upgrade Flow
The upgrade system automatically:
1. ✓ Upgrades control plane nodes first (sorted alphabetically: `ctrl-00` → `ctrl-01` → `ctrl-02`)
2. ✓ Runs health checks after each control plane node
3. ✓ Upgrades worker nodes sequentially (`work-00` → `work-01` → `work-02`)
4. ✓ Runs health checks after each worker node
5. ✓ Ensures only one node upgrades at a time via dependency chaining
**No manual intervention required!**
## Upgrade Process
### 1. Change Version
Edit `tofu/talos_image.auto.tfvars` and update the version:
```hcl
talos_image = {
version = "v1.10.0" # Change this to your target version
>>>>>>> Stashed changes
schematic_path = "talos/image/schematic.yaml.tftpl"
}
nodes.auto.tfvars
nodes_config = {
"ctrl-00" = {
machine_type = "controlplane"
ip = "10.25.150.10"
mac_address = "bc:24:11:6f:10:01"
vm_id = 8100
upgrade = false # Set to true to upgrade
}
# ...
}
Best Practices
- Follow the recommended order - Control plane nodes first, then workers
- One node at a time - Wait for each node to be Ready before continuing
- Monitor closely - Watch
kubectl get nodesduring each upgrade - Keep backups current - Ensure Longhorn backups are up to date
- Keep
upgrade = true- Don't setupgrade = falseon upgraded nodes until finalization - Update base version last - Only update
talos_image.versionafter ALL nodes are upgraded - Finalize in order - First update
version, then set allupgrade = false
Troubleshooting
Node Stuck After Replacement
kubectl describe node <node-name>
talosctl --nodes <NODE_IP> services
talosctl --nodes <NODE_IP> dmesg
Check Effective Versions
tofu output -json | jq '.upgrade_sequence.value.node_versions'
Partial Upgrade State
tofu refresh
tofu plan