Infrastructure Controllers
This document outlines the core infrastructure controllers deployed in my Kubernetes cluster.
Core Controllers
Node Feature Discovery
- Purpose: Hardware feature and system configuration detection
- Further Reading: GPU Support in Kubernetes
NVIDIA GPU Operator
- Purpose: Automates management of NVIDIA software components for GPU-enabled nodes
- Further Reading: GPU Support in Kubernetes
ArgoCD
- Purpose: GitOps deployment controller
- Version: Latest stable (automated by Renovate)
- Features:
- Application state synchronization
- Progressive delivery
- Resource health monitoring
- Multi-cluster management
cert-manager
- Purpose: Certificate management
- Version: Latest v1.x (automated by Renovate)
- Features:
- Let's Encrypt integration
- Internal PKI support
- Automated renewal
- CSI driver integration
External Secrets
- Purpose: Secrets management
- Version: Latest stable (automated by Renovate)
- Features:
- Bitwarden integration
- Secrets rotation
- Secure key distribution
- Cross-namespace secrets
Longhorn
Features:
- Distributed block storage
- Volume replication
- Backup capabilities
- Default reclaim policy set to
Retain
- Storage overprovisioning
Kubechecks
- Purpose: Validate ArgoCD applications via policy checks
- Version: Latest stable (automated by Renovate)
- Features:
- Schema validation
- GitHub integration
- Custom webhook support
- Writable
/tmp
mount to support repo clones - Optional OpenAI summaries (currently disabled)
- Loads Cloudflare provider CRDs so kubeconform validates Crossplane resources
Velero
- Purpose: Cluster backup and restore
- Version: Latest stable (automated by Renovate)
- Features:
- Object storage backups via Minio
- Volume snapshots through Longhorn
- Node-agent deployment for filesystem backup
- Prometheus metrics
Resource Management
Base Resource Configuration
Standard Resource Profile:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 1000m
memory: 512Mi
High-Availability Profile:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 2000m
memory: 1Gi
Controller-Specific Resources
Controller | CPU Request | Memory Request | CPU Limit | Memory Limit |
---|---|---|---|---|
ArgoCD Server | 500m | 512Mi | 2000m | 1Gi |
cert-manager | 75m | 96Mi | 75m | 96Mi |
External Secrets | 80m | 100Mi | 80m | 100Mi |
Longhorn Manager | 250m | 256Mi | 250m | 256Mi |
Kubechecks | 200m | 256Mi | 500m | 512Mi |
High Availability
Deployment Strategy
Replication:
argocd-server: autoscaled (min 2)
argocd-repo-server: autoscaled (min 2)
argocd-applicationset: 2
cert-manager-controller: 2
cert-manager-webhook: 2
cert-manager-cainjector: 2
external-secrets-operator: 2
external-secrets-webhook: 2
longhorn-manager: 3 # One per node
Pod Disruption Budget:
minAvailable: 1
Node Placement
Topology Spread:
maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
Monitoring Integration
Metrics Collection
All controllers expose Prometheus metrics:
- ArgoCD: Application sync status, performance metrics
- cert-manager: Certificate status, renewal metrics
- External Secrets: Sync status, error rates
- Longhorn: Volume health, performance metrics
Alert Rules
Critical Alerts:
- ArgoCD application out of sync > 1h
- Certificate renewal failure
- Secrets sync failure
- Volume degradation
Warning Alerts:
- High resource usage
- Slow sync operations
- Certificate expiring soon
Version Management
Update Strategy
- Automated updates via Renovate
- Version constraints in Helm values
- Progressive rollout through environments
- Regular review of change logs
Current Versions
Controllers:
argocd: <see https://github.com/argoproj/argo-cd/releases>
cert-manager: <see https://github.com/cert-manager/cert-manager/releases>
external-secrets: <see https://github.com/external-secrets/external-secrets/releases>
longhorn: <see https://github.com/longhorn/longhorn/releases>
Security Configuration
RBAC Settings
- Minimal permissions model
- Namespace isolation
- Service account restrictions
- Security context constraints
Network Policies
Ingress Rules:
- Allow metrics collection
- Allow API access from authorized sources
- Allow cross-controller communication
- Deny all other traffic
Troubleshooting Guide
Common Issues
-
Controller Pod Crashes
- Check resource limits
- Review recent changes
- Monitor system resources
-
Sync Failures
- Verify Git repository access
- Check network connectivity
- Review controller logs
-
Performance Issues
- Monitor resource usage
- Check node capacity
- Review scaling configuration
Debug Commands
# Check controller status
kubectl -n argocd get pods
kubectl -n cert-manager get pods
kubectl -n external-secrets get pods
# View controller logs
kubectl -n <namespace> logs -l app=<controller-name>
# Verify RBAC
kubectl auth can-i --as system:serviceaccount:<ns>:<sa> <verb> <resource>