Kubernetes Provisioning with OpenTofu

This guide explains our infrastructure provisioning using OpenTofu to create a production-grade Kubernetes cluster running Talos OS on Proxmox## Deployment Process

Before you begin deployment, ensure your SSH key is loaded:

eval $(ssh-agent) && ssh-add ~/.ssh/id_rsa
## Infrastructure Overview

### Node Architecture

```yaml
Control Plane:
  count: 3
  type: 'talos'
  features:
    - API server (HA)
    - etcd cluster
    - controller-manager
    - scheduler

Worker Nodes:
  count: 2+
  type: 'talos'
  features:
    - container runtime
    - cilium networking
    - CSI support
    - GPU support (optional)

Network Architecture

Network	CIDR	Purpose
Node	10.25.150.0/24	Kubernetes node networking
Pod	10.25.0.0/16	Container networking
Service	10.26.0.0/16	Kubernetes services

Project Structure

/tofu/
├── main.tf           # Main configuration and node definitions
├── variables.tf      # Input variables
├── output.tf         # Generated outputs (kubeconfig, etc.)
├── providers.tf      # Provider configs (Proxmox, Talos)
├── upgrade-k8s.sh    # Kubernetes upgrade helper
├── terraform.tfvars  # Variable definitions
└── talos/            # Talos cluster module
    ├── config.tf     # Machine configs and bootstrap
    ├── image.tf      # OS image management
    ├── virtual-machines.tf  # Proxmox VM definitions
    └── manifests/    # Kubernetes bootstrap manifests
    ├── machine-config/     # Config templates
    └── inline-manifests/   # Core component YAMLs

Core Components

Proxmox Provider

We use bpg/proxmox to manage VMs declaratively. This enables:

Version-controlled infrastructure
Automated deployments
Easy cluster rebuilds

Talos OS

Talos is our node OS because it offers:

Minimal attack surface
Atomic upgrades
API-driven management
Kubernetes-specific design

VM Configuration

Node Specs

Define nodes in /tofu/main.tf with:

module "talos" {
  nodes = {
    "node1" = {
      host_node    = "proxmox1"
      machine_type = "controlplane"
      ip           = "10.0.0.1" # Example IP
      cpu          = 4
      ram_dedicated = 8192
      disks = {
        # Example: a primary disk for the OS (often handled by the image cloning)
        # and an additional disk for Longhorn.
        # The exact structure depends on your module's variables.tf.
        # This example assumes a structure like the one found in the repository:
        longhorn = { # Key for the disk, e.g., 'longhorn' or 'data'
          device     = "/dev/sdb" # Or another available device
          size       = "180G"
          type       = "scsi"     # Or 'virtio', 'sata'
          mountpoint = "/var/lib/longhorn" # If applicable for Talos config
        }
        # os_disk = { ... } # If explicitly defining the OS disk
      }
    }
  }
}

Custom Images

Built via Talos Image Factory
Includes core extensions:
- QEMU guest agent
- iSCSI tools
Automatically downloaded to Proxmox

Machine Configuration

Template System

Uses YAML templates for node configs
Injects per-node settings:
- Hostname
- IP address
- Cluster details

Core Services

We embed essential services in the Talos config:

Cilium (CNI)
CoreDNS
ConfigMaps for service configuration

Deployment Process

OpenTofu reads configurations
Downloads Talos images
Creates Proxmox VMs
Applies node configs
Bootstraps first control plane
Generates kubeconfig
Verifies cluster health

Maintenance Tasks

Version Upgrades

Update versions in main.tf or related tfvars files. Note that Talos versions might be specified in multiple places:
- For the Talos image factory (e.g., module "talos" { image = { version = "vX.Y.Z" } })
- For the machine configurations and cluster secrets (e.g., module "talos" { cluster = { talos_version = "vX.Y.Z" } })
- Kubernetes version (e.g., module "talos" { cluster = { kubernetes_version = "vA.B.C" } })
Example snippet from main.tf (actual structure may vary based on module inputs):
```
module "talos" {
  # ...
  image = {
    version = "v1.9.5" # Target Talos version for OS images
    # ...
  }
  cluster = {
    talos_version      = "v1.9.5" # Target Talos version for machine configs
    kubernetes_version = "v1.29.3"  # Target Kubernetes version
    # ...
  }
  # ...
}
```
Set update = true for affected nodes if your OpenTofu module supports this flag for triggering upgrades. Otherwise, tofu apply will handle changes to version properties.
Run:
```
tofu apply
```

Node Management

Add/Remove Nodes

Modify nodes in main.tf
Run tofu apply

Change Resources

Update node specs in main.tf
Run tofu apply

Note: Resource changes may require VM restarts

Initial Setup

Prerequisites

Proxmox server running 7.4+
SSH key access configured
Network DHCP/DNS ready
Storage pools configured

Configuration

Create terraform.tfvars with your environment settings:

proxmox = {
  name         = "host3"              # Your Proxmox host name
  cluster_name = "host3"             # Your Proxmox cluster name
  endpoint     = "https://pve:8006"   # Your Proxmox API endpoint
  insecure     = false                # Set to true if using self-signed certs
  username     = "root@pam"           # Your Proxmox username
  api_token    = "USER@pam!ID=TOKEN"  # Your Proxmox API token
}

cluster = {
  name               = "talos"        # Cluster name
  endpoint           = "api.kube.pc-tips.se"  # API endpoint
  kubernetes_version = "1.33.0"       # Kubernetes version
  talos_version     = "v1.10.1"      # Talos version
  gateway           = "10.25.150.1"   # Network gateway
  vip               = "10.25.150.10"  # Control plane VIP
}

nodes = {
  "ctrl-00" = {
    host_node     = "host3"
    machine_type  = "controlplane"
    ip            = "10.25.150.11"
    mac_address   = "bc:24:11:XX:XX:XX"
    vm_id         = 8101
    cpu           = 6
    ram_dedicated = 6144
    update        = false
    igpu          = false
  }
  # Additional nodes...
}

3. Deployment Steps

Load your SSH key for Proxmox access:

eval $(ssh-agent) && ssh-add ~/.ssh/id_rsa

Initialize your workspace:

tofu init

Review and apply the configuration:

# Review changes
tofu plan

# Deploy cluster
tofu apply

Set up cluster access:

# Copy kubeconfig to your config directory
cat output/kube-config.yaml > ~/.kube/config

# Verify cluster access
kubectl get nodes

Maintenance Operations

Node Operations

Applying Node Updates

To update a node, follow these steps:

Prepare the node for maintenance:

kubectl cordon node-name
kubectl drain node-name --ignore-daemonsets --delete-emptydir-data

Apply updates via OpenTofu:

tofu apply -target=module.talos.proxmox_virtual_environment_vm.this["node-name"]

Return the node to service:

kubectl uncordon node-name

Version Upgrades

To upgrade Kubernetes and Talos versions, update the configuration:

cluster = {
  kubernetes_version = "1.33.0"  # Target K8s version
  talos_version     = "v1.10.1" # Target Talos version
}

Then apply the changes in stages:

# Plan changes
tofu plan -target=module.talos

# Apply updates
tofu apply -target=module.talos

Recovery Operations

State Recovery

If OpenTofu state is lost, follow these steps:

Import existing infrastructure:

tofu import 'module.talos.proxmox_virtual_environment_vm.this["ctrl-00"]' host3/8101

Synchronize the state:

# Refresh state
tofu refresh

# Verify state
tofu plan

Node Recovery

To replace a failed node:

Remove it from the cluster:

kubectl cordon failed-node
kubectl drain failed-node --ignore-daemonsets --delete-emptydir-data

Rebuild using OpenTofu:

# Remove state
tofu taint 'module.talos.proxmox_virtual_environment_vm.this["failed-node"]'

# Recreate node
tofu apply -target=module.talos.proxmox_virtual_environment_vm.this["failed-node"]

Security Implementation

Core Security Features

Our cluster implements several security measures:

API server endpoint protection via Gateway API
etcd encryption at rest enabled
Node authentication via Talos PKI
Network isolation with Cilium policies

Sensitive Files

The deployment generates several sensitive files that must be secured:

output/:
  - kube-config.yaml           # Cluster access configuration
  - talos-config.yaml         # Talos management configuration
  - talos-machine-config-*.yaml # Node configurations

Important: These files contain cluster access credentials and should be stored securely.

Monitoring and Troubleshooting

Health Checks

To verify cluster health, check the following:

Node status:

kubectl get nodes -o wide

etcd cluster health:

talosctl -n node-ip etcd status

Control plane status:

kubectl get pods -n kube-system

Common Issues

Node Join Problems

Common causes of node join failures:

Network connectivity issues
Machine configuration errors
Bootstrap process failures

API Server Availability

When the API server is unreachable:

Verify control plane VIP status
Check etcd cluster health
Review API server container logs

Resource Management

Monitor these aspects:

VM resource utilization
Storage availability and performance
Network connectivity and throughput

Network Architecture​

Project Structure​

Core Components

Proxmox Provider​

Talos OS​

VM Configuration​

Node Specs​

Custom Images​

Machine Configuration​

Template System​

Core Services​

Deployment Process

Maintenance Tasks

Version Upgrades​

Node Management​

Add/Remove Nodes​

Change Resources​

Initial Setup​

Prerequisites​

Configuration​

3. Deployment Steps​

Maintenance Operations​

Node Operations​

Applying Node Updates​

Version Upgrades​

Recovery Operations​

State Recovery​

Node Recovery​

Security Implementation​

Core Security Features​

Sensitive Files​

Monitoring and Troubleshooting​

Health Checks​

Common Issues​

Node Join Problems​

API Server Availability​

Resource Management​

Network Architecture

Project Structure

Proxmox Provider

Talos OS

VM Configuration

Node Specs

Custom Images

Machine Configuration

Template System

Core Services

Version Upgrades

Node Management

Add/Remove Nodes

Change Resources

Initial Setup

Prerequisites

Configuration

3. Deployment Steps

Maintenance Operations

Node Operations

Applying Node Updates

Version Upgrades

Recovery Operations

State Recovery

Node Recovery

Security Implementation

Core Security Features

Sensitive Files

Monitoring and Troubleshooting

Health Checks

Common Issues

Node Join Problems

API Server Availability

Resource Management