CloudNativePG Troubleshooting
This guide covers common issues when running CloudNativePG (CNPG) clusters with Barman backup integration.
WAL Volume Full - Disk Space Issues
Symptoms
- Pod stuck in
CrashLoopBackOffwith 1/2 containers ready - Error logs showing:
"Detected low-disk space condition, avoid starting the instance" - Cluster status shows:
"Not enough disk space"or"Insufficient disk space detected" - Pod cannot start PostgreSQL instance
Root Causes
- WAL files not being archived: Files accumulate in
/var/lib/postgresql/wal/pg_wal/when Barman archiving fails - Incorrect Barman configuration: Wrong
destinationPathin ObjectStore prevents proper archiving - Timeline mismatch: Old WAL files from previous timelines not cleaned up after failover
Diagnosis
Check the pod logs for disk space errors:
kubectl logs -n <namespace> <pod-name> -c postgres --tail=50
# Look for: "Detected low-disk space condition"
Check actual disk usage on a running pod:
kubectl exec -n <namespace> <pod-name> -c plugin-barman-cloud -- \
python3 -c "import os; st = os.statvfs('/var/lib/postgresql/wal'); \
print(f'{st.f_bavail / st.f_blocks * 100:.1f}% free')"
Check for stuck WAL files waiting to be archived:
kubectl exec -n <namespace> <pod-name> -c plugin-barman-cloud -- \
python3 -c "
import os
wal_dir = '/var/lib/postgresql/wal/pg_wal'
ready_files = len([f for f in os.listdir(f'{wal_dir}/archive_status') if f.endswith('.ready')])
print(f'Files waiting to archive: {ready_files}')
"
Check cluster status:
kubectl get cluster -n <namespace> <cluster-name> -o jsonpath='{.status.phase}'
Solution
1. Fix Barman ObjectStore Configuration
Ensure your MinIO/S3 ObjectStore uses the cluster name, not a specific pod name:
Wrong:
apiVersion: barmancloud.cnpg.io/v1
kind: ObjectStore
metadata:
name: my-minio-store
spec:
configuration:
destinationPath: s3://bucket/namespace/my-cluster-1 # ❌ Pod-specific
Correct:
apiVersion: barmancloud.cnpg.io/v1
kind: ObjectStore
metadata:
name: my-minio-store
spec:
configuration:
destinationPath: s3://bucket/namespace/my-cluster # ✅ Cluster name
Apply the fix:
kubectl apply -f your-database.yaml
2. Clean Up Stuck WAL Files
If the replica is stuck with old timeline WAL files:
# Delete old WAL files from previous timeline (example: timeline 1)
kubectl exec -n <namespace> <pod-name> -c plugin-barman-cloud -- \
python3 -c "
import os
wal_dir = '/var/lib/postgresql/wal/pg_wal'
archive_status_dir = f'{wal_dir}/archive_status'
# Get files from old timeline (adjust timeline number as needed)
files = [(f, os.path.join(wal_dir, f)) for f in os.listdir(wal_dir)
if os.path.isfile(os.path.join(wal_dir, f))
and f.startswith('00000001') # Timeline 1
and not f.endswith('.history')]
files.sort(key=lambda x: os.path.getmtime(x[1]))
# Delete oldest files to free space
deleted = 0
for filename, filepath in files[:200]:
os.remove(filepath)
deleted += 1
# Also remove archive_status files
for ext in ['.ready', '.done']:
status_file = os.path.join(archive_status_dir, filename + ext)
if os.path.exists(status_file):
os.remove(status_file)
print(f'Deleted {deleted} old WAL files')
"
3. Rebuild the Replica (if necessary)
If the replica has timeline mismatches or corruption, force a rebuild:
# Delete the pod
kubectl delete pod -n <namespace> <pod-name>
# If still failing, delete the PVCs to force full rebuild
kubectl delete pvc -n <namespace> <pod-name>
kubectl delete pvc -n <namespace> <pod-name>-wal
CNPG will automatically:
- Create new PVCs
- Run a join job to bootstrap from the primary
- Start the new replica pod
Prevention
1. Properly Size WAL Volumes
In your Cluster spec:
spec:
walStorage:
size: 4Gi # Increase if needed based on your WAL generation rate
2. Monitor Continuous Archiving
Check the archiving status regularly:
kubectl get cluster -n <namespace> <cluster-name> \
-o jsonpath='{.status.conditions[?(@.type=="ContinuousArchiving")]}'
Should show:
{
"status": "True",
"message": "Continuous archiving is working"
}
3. Configure Alerts
Add Prometheus alerts for:
- WAL disk usage > 80%
- Failed WAL archiving
- Cluster not ready
4. Use Correct Retention Policies
spec:
retentionPolicy: "30d" # Adjust based on your requirements
Replication Lag
Symptoms
- Cluster status shows high replication lag
- Queries show:
"requested WAL segment has already been removed"
Diagnosis
kubectl get cluster -n <namespace> <cluster-name> -o yaml | grep -A 10 replication
Solution
If a replica is too far behind and the primary has removed needed WAL segments:
- Delete the replica pod to trigger rebuild
- Ensure
wal_keep_sizeis appropriately configured in cluster spec
Failed Switchover/Failover
Symptoms
- Cluster stuck in:
"Primary instance is being restarted without a switchover" - Pods showing unhealthy readiness probes
Solution
# Force delete stuck primary
kubectl delete pod -n <namespace> <pod-name> --force --grace-period=0
# Wait for cluster to reconcile
kubectl wait --for=condition=Ready cluster/<cluster-name> -n <namespace> --timeout=5m
Useful Commands
Check cluster health
kubectl get cluster -n <namespace> <cluster-name> -o json | \
jq -r '.status | "Phase: \(.phase)\nInstances: \(.instances)\nReady: \(.readyInstances)\nPrimary: \(.currentPrimary)"'
View all instances status
kubectl get pods -n <namespace> -l cnpg.io/cluster=<cluster-name>
Check WAL archiving on primary
kubectl logs -n <namespace> <primary-pod> -c plugin-barman-cloud | \
grep -i archive | tail -20
Verify ObjectStore configuration
kubectl get objectstore -n <namespace> <store-name> -o yaml