Infrastructure Controllers

This document outlines the core infrastructure controllers deployed in the Kubernetes cluster.

Core controllers

Node feature discovery

Purpose: Hardware feature and system configuration detection
Further Reading: GPU Support in Kubernetes

NVIDIA GPU operator

Purpose: Automates management of GPU-enabled NVIDIA software components for GPU-enabled nodes
Further Reading: GPU Support in Kubernetes

ArgoCD

Purpose: GitOps deployment controller
Version: Latest stable (automated by Renovate)
Features:
- Application state synchronization
- Progressive delivery
- Resource health monitoring
- Multi-cluster management

cert-manager

Purpose: Certificate management
Version: Latest v1.x (automated by Renovate)
Features:
- Let's Encrypt integration
- Internal Public Key Infrastructure (PKI) support
- Automated renewal
- Container Storage Interface (CSI) driver integration

External secrets

Purpose: Secrets management
Version: Latest stable (automated by Renovate)
Features:
- Bitwarden integration
- Secrets rotation
- Secure key distribution
- Cross-namespace secrets

Longhorn

Features:

Distributed block storage
Volume replication
Backup capabilities
Default reclaim policy set to Retain
Storage overprovisioning

Kubechecks

Purpose: Validate ArgoCD applications via policy checks
Version: Latest stable (automated by Renovate)
Features:
- Schema validation
- GitHub integration
- Custom webhook support
- Writable /tmp mount to support repo clones
- Optional OpenAI summaries (disabled)

Velero

Purpose: Cluster backup and restore
Version: Latest stable (automated by Renovate)
Features:
- Object storage backups via Minio
- Volume snapshots through Longhorn
- Node-agent deployment for file system backup
- Prometheus metrics

Resource management

Base Resource Configuration

Standard Resource Profile:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 1000m
    memory: 512Mi

High-Availability Profile:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 2000m
    memory: 1Gi

Controller-specific resources

Controller	CPU Request	Memory Request	CPU Limit	Memory Limit
ArgoCD Server	500m	512Mi	2000m	1Gi
cert-manager	75m	96Mi	75m	96Mi
External Secrets	80m	100Mi	80m	100Mi
Longhorn Manager	250m	256Mi	250m	256Mi
Kubechecks	200m	256Mi	500m	512Mi

High availability

Deployment strategy

Replication:
  argocd-server: autoscaled (min 2)
  argocd-repo-server: autoscaled (min 2)
  argocd-applicationset: 2
  cert-manager-controller: 2
  cert-manager-webhook: 2
  cert-manager-cainjector: 2
  external-secrets-operator: 2
  external-secrets-webhook: 2
  longhorn-manager: 3 # One per node

Pod Disruption Budget:
  minAvailable: 1

Node placement

Topology Spread:
  maxSkew: 1
  topologyKey: kubernetes.io/hostname
  whenUnsatisfiable: DoNotSchedule

Monitoring Integration

Metrics Collection

All controllers expose Prometheus metrics:

ArgoCD: Application sync status, performance metrics
cert-manager: Certificate status, renewal metrics
External Secrets: Sync status, error rates
Longhorn: Volume health, performance metrics

Alert Rules

Critical Alerts:
  - ArgoCD application out of sync > 1h
  - Certificate renewal failure
  - Secrets sync failure
  - Volume degradation

Warning Alerts:
  - High resource usage
  - Slow sync operations
  - Certificate expiring soon

Version Management

Update Strategy

Automated updates via Renovate
Version constraints in Helm values
Progressive rollout through environments
Regular review of change logs

Current Versions

Controllers:
  argocd: <see https://github.com/argoproj/argo-cd/releases>
  cert-manager: <see https://github.com/cert-manager/cert-manager/releases>
  external-secrets: <see https://github.com/external-secrets/external-secrets/releases>
  longhorn: <see https://github.com/longhorn/longhorn/releases>

Security Configuration

Role-Based Access Control (RBAC) settings

Minimal permissions model
Namespace isolation
Service account restrictions
Security context constraints

Network Policies

Ingress Rules:
  - Allow metrics collection
  - Allow API access from authorized sources
  - Allow cross-controller communication
  - Deny all other traffic

Troubleshooting Guide

Common Issues

Controller Pod Crashes
- Check resource limits
- Review recent changes
- Monitor system resources
Sync Failures
- Verify Git repository access
- Check network connectivity
- Review controller logs
Performance Issues
- Monitor resource usage
- Check node capacity
- Review scaling configuration

Debug Commands

# Check controller status
kubectl -n argocd get pods
kubectl -n cert-manager get pods
kubectl -n external-secrets get pods

# View controller logs
kubectl -n <namespace> logs -l app=<controller-name>

# Verify RBAC
kubectl auth can-i --as system:serviceaccount:<ns>:<sa> <verb> <resource>

Core controllers​

Node feature discovery​

NVIDIA GPU operator​

ArgoCD​

cert-manager​

External secrets​

Longhorn​

Kubechecks​

Velero​

Resource management​

Base Resource Configuration​

Controller-specific resources​

High availability​

Deployment strategy​

Node placement​

Monitoring Integration​

Metrics Collection​

Alert Rules​

Version Management​

Update Strategy​

Current Versions​

Security Configuration​

Role-Based Access Control (RBAC) settings​

Network Policies​

Troubleshooting Guide​

Common Issues​

Debug Commands​

Core controllers

Node feature discovery

NVIDIA GPU operator

ArgoCD

cert-manager

External secrets

Longhorn

Kubechecks

Velero

Resource management

Base Resource Configuration

Controller-specific resources

High availability

Deployment strategy

Node placement

Monitoring Integration

Metrics Collection

Alert Rules

Version Management

Update Strategy

Current Versions

Security Configuration

Role-Based Access Control (RBAC) settings

Network Policies

Troubleshooting Guide

Common Issues

Debug Commands