Kubernetes Provisioning with OpenTofu

Deployment Process

Before you begin deployment, ensure your SSH key is loaded:

eval $(ssh-agent) && ssh-add ~/.ssh/id_rsa

Deployment Process

OpenTofu reads configurations
Downloads Talos images
Creates Proxmox VMs
Applies node configs
Bootstraps first control plane
Generates kubeconfig
Verifies cluster health

Maintenance Tasks

Version Upgrades

Update versions in main.tf or related tfvars files. Note that Talos versions can be specified in multiple places:
- For the Talos image factory (e.g., module "talos" { talos_image = { version = "vX.Y.Z" } })
- For the machine configurations and cluster secrets (e.g., module "talos" { cluster = { talos_version = "vX.Y.Z" } })
- Kubernetes version (e.g., module "talos" { cluster = { kubernetes_version = "vA.B.C" } })
Example snippet from main.tf (actual structure can vary based on module inputs):
```
module "talos" {
  # ...
  versions = {
    talos      = "<see https://github.com/siderolabs/talos/releases>" # Target Talos version
    kubernetes = "<see https://github.com/kubernetes/kubernetes/releases>"  # Target Kubernetes version
  }
  # ...
}
```
Set update = true for affected nodes in tofu/nodes.auto.tfvars if your OpenTofu module supports this flag for triggering upgrades. Otherwise, tofu apply will handle changes to version properties.
Run:
```
tofu apply
```

Node Management

Add/Remove Nodes

Modify the map in tofu/nodes.auto.tfvars
Run tofu apply

Change Resources

Update node specs in tofu/nodes.auto.tfvars
Run tofu apply

Note: Resource changes can require VM restarts

Initial Setup

Prerequisites

Proxmox server running 7.4+
SSH key access configured
Network DHCP/DNS ready
Storage pools configured

Configuration

Create config.auto.tfvars with your environment settings. An example file terraform.tfvars.Example is provided.

// tofu/config.auto.tfvars example

cluster_name   = "talos"
cluster_domain = "kube.pc-tips.se"

# Network settings
# All nodes must be on the same L2 network
network = {
  gateway     = "10.25.150.1"
  vip         = "10.25.150.10" # Control plane Virtual IP
  cidr_prefix = 24
  dns_servers = ["10.25.150.1"]
  bridge      = "vmbr0"
  vlan_id     = 150
}

# Proxmox settings
proxmox_cluster = "host3"

# Software versions
versions = {
  talos      = "v1.10.3"
  kubernetes = "1.33.2"
}

# OIDC settings (optional)
oidc = {
  issuer_url = "https://sso.pc-tips.se/application/o/kubectl/"
  client_id  = "kubectl"
}

3. Deployment Steps

Load your SSH key for Proxmox access:

eval $(ssh-agent) && ssh-add ~/.ssh/id_rsa

Initialize your workspace:

tofu init

Review and apply the configuration:

# Review changes
tofu plan

# Deploy cluster
tofu apply

Set up cluster access:

# Copy kubeconfig to your config directory
cat output/kube-config.yaml > ~/.kube/config

# Verify cluster access
kubectl get nodes

Maintenance Operations

Node Operations

Applying Node Updates

To update a node, follow these steps:

Prepare the node for maintenance:

kubectl cordon node-name
kubectl drain node-name --ignore-daemonsets --delete-emptydir-data

Apply updates via OpenTofu:

tofu apply -target='module.talos.proxmox_virtual_environment_vm.this["node-name"]'

Return the node to service:

kubectl uncordon node-name

Version Upgrades

To upgrade Kubernetes and Talos versions, update the configuration:

cluster = {
  kubernetes_version = "<see https://github.com/kubernetes/kubernetes/releases>"  # Target K8s version
  talos_version     = "<see https://github.com/siderolabs/talos/releases>" # Target Talos version
}

Then apply the changes in stages:

# Plan changes
tofu plan -target=module.talos

# Apply updates
tofu apply -target=module.talos

Recovery Operations

State Recovery

If OpenTofu state is lost, follow these steps:

Import existing infrastructure:

tofu import 'module.talos.proxmox_virtual_environment_vm.this["ctrl-00"]' host3/8101

Synchronize the state:

# Refresh state
tofu refresh

# Verify state
tofu plan

Node Recovery

To replace a failed node:

Remove it from the cluster:

kubectl cordon failed-node
kubectl drain failed-node --ignore-daemonsets --delete-emptydir-data

Rebuild using OpenTofu:

# Remove state
tofu taint 'module.talos.proxmox_virtual_environment_vm.this["failed-node"]'

# Recreate node
tofu apply -target='module.talos.proxmox_virtual_environment_vm.this["failed-node"]'

Monitoring and Troubleshooting

Health Checks

To verify cluster health, check the following:

Node status:

kubectl get nodes -o wide

etcd cluster health:

talosctl -n node-ip etcd status

Control plane status:

kubectl get pods -n kube-system

Common Issues

Node Join Problems

Common causes of node join failures:

Network connectivity issues
Machine configuration errors
Bootstrap process failures

API Server Availability

When the API server is unreachable:

Verify control plane VIP status
Check etcd cluster health
Review API server container logs

Resource Management

Monitor these aspects:

VM resource utilization
Storage availability and performance
Network connectivity and throughput

Deployment Process​

Deployment Process

Maintenance Tasks

Version Upgrades​

Node Management​

Add/Remove Nodes​

Change Resources​

Initial Setup​

Prerequisites​

Configuration​

3. Deployment Steps​

Maintenance Operations​

Node Operations​

Applying Node Updates​

Version Upgrades​

Recovery Operations​

State Recovery​

Node Recovery​

Monitoring and Troubleshooting​

Health Checks​

Common Issues​

Node Join Problems​

API Server Availability​

Resource Management​

Deployment Process

Version Upgrades

Node Management

Add/Remove Nodes

Change Resources

Initial Setup

Prerequisites

Configuration

3. Deployment Steps

Maintenance Operations

Node Operations

Applying Node Updates

Version Upgrades

Recovery Operations

State Recovery

Node Recovery

Monitoring and Troubleshooting

Health Checks

Common Issues

Node Join Problems

API Server Availability

Resource Management