CloudNativePG Troubleshooting

This guide covers common issues when running CloudNativePG (CNPG) clusters with Barman backup integration.

WAL Volume Full - Disk Space Issues

Symptoms

Pod stuck in CrashLoopBackOff with 1/2 containers ready
Error logs showing: "Detected low-disk space condition, avoid starting the instance"
Cluster status shows: "Not enough disk space" or "Insufficient disk space detected"
Pod cannot start PostgreSQL instance

Root Causes

WAL files not being archived: Files accumulate in /var/lib/postgresql/wal/pg_wal/ when Barman archiving fails
Incorrect Barman configuration: Wrong destinationPath in ObjectStore prevents proper archiving
Timeline mismatch: Old WAL files from previous timelines not cleaned up after failover

Diagnosis

Check the pod logs for disk space errors:

kubectl logs -n <namespace> <pod-name> -c postgres --tail=50
# Look for: "Detected low-disk space condition"

Check actual disk usage on a running pod:

kubectl exec -n <namespace> <pod-name> -c plugin-barman-cloud -- \
  python3 -c "import os; st = os.statvfs('/var/lib/postgresql/wal'); \
  print(f'{st.f_bavail / st.f_blocks * 100:.1f}% free')"

Check for stuck WAL files waiting to be archived:

kubectl exec -n <namespace> <pod-name> -c plugin-barman-cloud -- \
  python3 -c "
import os
wal_dir = '/var/lib/postgresql/wal/pg_wal'
ready_files = len([f for f in os.listdir(f'{wal_dir}/archive_status') if f.endswith('.ready')])
print(f'Files waiting to archive: {ready_files}')
"

Check cluster status:

kubectl get cluster -n <namespace> <cluster-name> -o jsonpath='{.status.phase}'

Solution

1. Fix Barman ObjectStore Configuration

Ensure your MinIO/S3 ObjectStore uses the cluster name, not a specific pod name:

Wrong:

apiVersion: barmancloud.cnpg.io/v1
kind: ObjectStore
metadata:
  name: my-minio-store
spec:
  configuration:
    destinationPath: s3://bucket/namespace/my-cluster-1  # ❌ Pod-specific

Correct:

apiVersion: barmancloud.cnpg.io/v1
kind: ObjectStore
metadata:
  name: my-minio-store
spec:
  configuration:
    destinationPath: s3://bucket/namespace/my-cluster  # ✅ Cluster name

Apply the fix:

kubectl apply -f your-database.yaml

2. Clean Up Stuck WAL Files

If the replica is stuck with old timeline WAL files:

# Delete old WAL files from previous timeline (example: timeline 1)
kubectl exec -n <namespace> <pod-name> -c plugin-barman-cloud -- \
  python3 -c "
import os
wal_dir = '/var/lib/postgresql/wal/pg_wal'
archive_status_dir = f'{wal_dir}/archive_status'

# Get files from old timeline (adjust timeline number as needed)
files = [(f, os.path.join(wal_dir, f)) for f in os.listdir(wal_dir)
         if os.path.isfile(os.path.join(wal_dir, f))
         and f.startswith('00000001')  # Timeline 1
         and not f.endswith('.history')]
files.sort(key=lambda x: os.path.getmtime(x[1]))

# Delete oldest files to free space
deleted = 0
for filename, filepath in files[:200]:
    os.remove(filepath)
    deleted += 1
    # Also remove archive_status files
    for ext in ['.ready', '.done']:
        status_file = os.path.join(archive_status_dir, filename + ext)
        if os.path.exists(status_file):
            os.remove(status_file)

print(f'Deleted {deleted} old WAL files')
"

3. Rebuild the Replica (if necessary)

If the replica has timeline mismatches or corruption, force a rebuild:

# Delete the pod
kubectl delete pod -n <namespace> <pod-name>

# If still failing, delete the PVCs to force full rebuild
kubectl delete pvc -n <namespace> <pod-name>
kubectl delete pvc -n <namespace> <pod-name>-wal

CNPG will automatically:

Create new PVCs
Run a join job to bootstrap from the primary
Start the new replica pod

Prevention

1. Properly Size WAL Volumes

In your Cluster spec:

spec:
  walStorage:
    size: 4Gi  # Increase if needed based on your WAL generation rate

2. Monitor Continuous Archiving

Check the archiving status regularly:

kubectl get cluster -n <namespace> <cluster-name> \
  -o jsonpath='{.status.conditions[?(@.type=="ContinuousArchiving")]}'

Should show:

{
  "status": "True",
  "message": "Continuous archiving is working"
}

3. Configure Alerts

Add Prometheus alerts for:

WAL disk usage > 80%
Failed WAL archiving
Cluster not ready

4. Use Correct Retention Policies

spec:
  retentionPolicy: "30d"  # Adjust based on your requirements

Replication Lag

Symptoms

Cluster status shows high replication lag
Queries show: "requested WAL segment has already been removed"

Diagnosis

kubectl get cluster -n <namespace> <cluster-name> -o yaml | grep -A 10 replication

Solution

If a replica is too far behind and the primary has removed needed WAL segments:

Delete the replica pod to trigger rebuild
Ensure wal_keep_size is appropriately configured in cluster spec

Failed Switchover/Failover

Symptoms

Cluster stuck in: "Primary instance is being restarted without a switchover"
Pods showing unhealthy readiness probes

Solution

# Force delete stuck primary
kubectl delete pod -n <namespace> <pod-name> --force --grace-period=0

# Wait for cluster to reconcile
kubectl wait --for=condition=Ready cluster/<cluster-name> -n <namespace> --timeout=5m

Useful Commands

Check cluster health

kubectl get cluster -n <namespace> <cluster-name> -o json | \
  jq -r '.status | "Phase: \(.phase)\nInstances: \(.instances)\nReady: \(.readyInstances)\nPrimary: \(.currentPrimary)"'

View all instances status

kubectl get pods -n <namespace> -l cnpg.io/cluster=<cluster-name>

Check WAL archiving on primary

kubectl logs -n <namespace> <primary-pod> -c plugin-barman-cloud | \
  grep -i archive | tail -20

Verify ObjectStore configuration

kubectl get objectstore -n <namespace> <store-name> -o yaml

WAL Volume Full - Disk Space Issues​

Symptoms​

Root Causes​

Diagnosis​

Solution​

1. Fix Barman ObjectStore Configuration​

2. Clean Up Stuck WAL Files​

3. Rebuild the Replica (if necessary)​

Prevention​

1. Properly Size WAL Volumes​

2. Monitor Continuous Archiving​

3. Configure Alerts​

4. Use Correct Retention Policies​

Replication Lag​

Symptoms​

Diagnosis​

Solution​

Failed Switchover/Failover​

Symptoms​

Solution​

Useful Commands​

Check cluster health​

View all instances status​

Check WAL archiving on primary​

Verify ObjectStore configuration​

References​

WAL Volume Full - Disk Space Issues

Symptoms

Root Causes

Diagnosis

Solution

1. Fix Barman ObjectStore Configuration

2. Clean Up Stuck WAL Files

3. Rebuild the Replica (if necessary)

Prevention

1. Properly Size WAL Volumes

2. Monitor Continuous Archiving

3. Configure Alerts

4. Use Correct Retention Policies

Replication Lag

Symptoms

Diagnosis

Solution

Failed Switchover/Failover

Symptoms

Solution

Useful Commands

Check cluster health

View all instances status

Check WAL archiving on primary

Verify ObjectStore configuration

References