Skip to main content

Talos Upgrade Process

Overview

Talos upgrades are handled through OpenTofu using a marker-based system. Two versions are defined:

  • talos_image.version - Current deployed version
  • talos_image.update_version - Target version (updated by Renovate)

You control which nodes upgrade by setting upgrade = true in their node config.

How It Works

  1. Renovate updates talos_image.update_version to the new version
  2. You set upgrade = true on individual nodes in nodes_config
  3. Nodes with upgrade = true use update_version, others stay at version
  4. Run tofu apply - only marked nodes upgrade
  5. After all nodes are done, update version to match and set all upgrade = false
talos_image.version        = "v1.10.3"  # current base
talos_image.update_version = "v1.11.5" # target

# In nodes_config:
"ctrl-00" = { ..., upgrade = true } → uses v1.11.5 (upgrade/stay upgraded)
"ctrl-01" = { ..., upgrade = false } → uses v1.10.3 (stay at base)
Key Rule Once a node has upgrade = true, keep it true until you update the base version to match. Setting

upgrade = false before updating the base version triggers a downgrade! :::

Upgrade Process

Step 1: Check Current Status

cd tofu
tofu plan

The upgrade_sequence output shows:

  • recommended_order - Suggested upgrade sequence (control plane first, then workers)
  • current_version - Base version
  • target_version - Target version
  • nodes_marked_for_upgrade - Nodes with upgrade = true
  • node_versions - Effective version per node

Step 2: Mark First Node for Upgrade

Edit tofu/nodes.auto.tfvars:

nodes_config = {
"ctrl-00" = {
machine_type = "controlplane"
ip = "10.25.150.10"
# ... other config ...
upgrade = true # Mark for upgrade
}
"ctrl-01" = {
# ... config ...
upgrade = false # Keep at current version
}
# ...
}

Step 3: Apply First Upgrade

tofu apply

Only ctrl-00 will be replaced. Wait for it to become Ready:

kubectl get nodes -w

Step 4: Continue Sequentially

Set upgrade = true on the next node:

"ctrl-01" = {
# ... config ...
upgrade = true # Now mark ctrl-01
}
tofu apply

Repeat for all nodes in the recommended order:

  1. Control plane: ctrl-00, ctrl-01, ctrl-02
  2. Workers: work-00, work-01, work-02, work-03

Step 5: Finalize Upgrade

Important Keep upgrade = true on nodes that have been upgraded until you finalize. Setting

upgrade = false before updating the base version will trigger a downgrade! :::

After all nodes are upgraded:

  1. First, update talos_image.version to match update_version:
# tofu/talos_image.auto.tfvars
talos_image = {
version = "v1.11.5" # Updated from v1.10.3
update_version = "v1.11.5"
schematic_path = "talos/image/schematic.yaml.tftpl"
}
  1. Then, set all nodes to upgrade = false:
# tofu/nodes.auto.tfvars
nodes_config = {
"ctrl-00" = { ..., upgrade = false }
"ctrl-01" = { ..., upgrade = false }
# ...
}
  1. Apply to clean up state:
tofu apply

This is safe because upgrade = false now means "use version" which is v1.11.5.

Bulk Upgrades

To upgrade all nodes at once, set upgrade = true on all nodes, then run tofu apply.

Monitoring

During upgrades:

# Watch Kubernetes node status
kubectl get nodes -w

# Check Talos version on specific node
talosctl version --nodes <NODE_IP>

# Verify cluster health
kubectl get pods --all-namespaces

Rollback

If a node upgrade fails immediately (before other nodes are upgraded), you can set upgrade = false to revert to the base version:

"ctrl-01" = {
# ... config ...
upgrade = false # Revert to base version (v1.10.3)
}
tofu apply
This only works if talos_image.version is still the old version. Once you've updated the base version, rolling

back requires setting version back to the old value. :::

Configuration Reference

talos_image.auto.tfvars

talos_image = {
version = "v1.10.3" # Current deployed version
update_version = "v1.11.5" # Target version (updated by Renovate)
=======
The Talos upgrade process is **fully automated**. Simply change the version and run `tofu apply` once - all nodes will upgrade sequentially with automatic health checks.

## Automated Upgrade Flow

The upgrade system automatically:

1. ✓ Upgrades control plane nodes first (sorted alphabetically: `ctrl-00` → `ctrl-01` → `ctrl-02`)
2. ✓ Runs health checks after each control plane node
3. ✓ Upgrades worker nodes sequentially (`work-00` → `work-01` → `work-02`)
4. ✓ Runs health checks after each worker node
5. ✓ Ensures only one node upgrades at a time via dependency chaining

**No manual intervention required!**

## Upgrade Process

### 1. Change Version

Edit `tofu/talos_image.auto.tfvars` and update the version:

```hcl
talos_image = {
version = "v1.10.0" # Change this to your target version
>>>>>>> Stashed changes
schematic_path = "talos/image/schematic.yaml.tftpl"
}

nodes.auto.tfvars

nodes_config = {
"ctrl-00" = {
machine_type = "controlplane"
ip = "10.25.150.10"
mac_address = "bc:24:11:6f:10:01"
vm_id = 8100
upgrade = false # Set to true to upgrade
}
# ...
}

Best Practices

  1. Follow the recommended order - Control plane nodes first, then workers
  2. One node at a time - Wait for each node to be Ready before continuing
  3. Monitor closely - Watch kubectl get nodes during each upgrade
  4. Keep backups current - Ensure Longhorn backups are up to date
  5. Keep upgrade = true - Don't set upgrade = false on upgraded nodes until finalization
  6. Update base version last - Only update talos_image.version after ALL nodes are upgraded
  7. Finalize in order - First update version, then set all upgrade = false

Troubleshooting

Node Stuck After Replacement

kubectl describe node <node-name>
talosctl --nodes <NODE_IP> services
talosctl --nodes <NODE_IP> dmesg

Check Effective Versions

tofu output -json | jq '.upgrade_sequence.value.node_versions'

Partial Upgrade State

tofu refresh
tofu plan