Skip to main content

Scenario 8: Data Corruption

Symptoms

  • Database queries returning inconsistent or unexpected results
  • Application errors about data integrity violations or constraint failures
  • PostgreSQL logs showing "invalid page header" or "checksum failure"
  • Gradual degradation of service over days or weeks
  • Files or database records with garbled content
  • Backup verification failures
  • Silent corruption discovered during audit or reporting
  • Users reporting missing or incorrect historical data

Impact Assessment

  • Recovery Time Objective (RTO): 4-8 hours
  • Recovery Point Objective (RPO): Variable (depends on when corruption started)
  • Data Loss Risk: Moderate to High (may lose data from corruption point forward)
  • Service Availability: Can operate degraded during investigation, full outage during restore
  • Detection Difficulty: High (corruption may be subtle and gradual)

Prerequisites

  • Access to multiple backup generations (daily, weekly, monthly)
  • CNPG PostgreSQL cluster with point-in-time recovery (PITR) capability
  • Velero backups with multiple retention points
  • Access to B2 offsite backups with version history
  • Test environment to validate old backups
  • Knowledge of application data models and expected data patterns
  • Monitoring tools to identify when corruption started

Recovery Procedure

Step 1: Identify Corruption Scope and Timeline

CRITICAL: Do NOT restore immediately. First understand the problem.

# Document current corruption state
kubectl get all -A > /tmp/corruption-state-$(date +%Y%m%d).txt
kubectl get pvc -A >> /tmp/corruption-state-$(date +%Y%m%d).txt

# Check database health
kubectl get clusters.postgresql.cnpg.io -A

# For each PostgreSQL cluster, check for corruption
kubectl -n <namespace> exec -it <postgres-pod> -- psql -U postgres <<EOF
-- Check for corrupted indexes
REINDEX DATABASE <database_name>;

-- Verify table integrity
SELECT relname, pg_relation_size(oid) as size
FROM pg_class
WHERE relkind = 'r'
ORDER BY size DESC
LIMIT 20;

-- Check for invalid data
SELECT COUNT(*) FROM <critical_table> WHERE <validation_condition>;

-- Look for orphaned records
SELECT * FROM <table> WHERE foreign_key NOT IN (SELECT id FROM parent_table);
EOF

# Check application logs for when errors started
kubectl -n <namespace> logs <app-pod> --since=168h | grep -i "error\|corrupt\|invalid"

# Check PostgreSQL logs for corruption indicators
kubectl -n <namespace> logs <postgres-pod> --since=168h | grep -i "corrupt\|checksum\|invalid page"

Identify when corruption started:

# Method 1: Check application metrics/logs
# Look for when error rates increased
# Access Prometheus/Grafana:
kubectl -n monitoring port-forward svc/prometheus 9090:9090
# Query: rate(application_errors[1h])
# Find timestamp when errors started

# Method 2: Check database modification times
kubectl -n <namespace> exec -it <postgres-pod> -- psql -U postgres <<EOF
-- Find when suspicious data appeared
SELECT MIN(created_at), MAX(created_at)
FROM <table>
WHERE <suspicious_condition>;

-- Check for gaps in sequential data
SELECT id, created_at,
LAG(created_at) OVER (ORDER BY created_at) as previous_time,
created_at - LAG(created_at) OVER (ORDER BY created_at) as gap
FROM <table>
WHERE gap > interval '1 day' -- Unusual gaps might indicate corruption point
ORDER BY created_at DESC
LIMIT 100;
EOF

# Method 3: Check Velero backup history
velero backup get --output custom-columns=NAME:.metadata.name,CREATED:.metadata.creationTimestamp,STATUS:.status.phase

# Method 4: Check git commit history for config changes
cd /home/benjaminsanden/Dokument/Projects/homelab
git log --since="30 days ago" --oneline --all -- k8s/
# Look for database config changes that might have introduced corruption

Step 2: Test Historical Backups

Create test namespace to verify old backups without affecting production:

# Create isolated test namespace
kubectl create namespace corruption-test

# Test restoring from different backup dates
# Start with most recent, work backwards

# Test 1: Yesterday's backup
velero restore create test-restore-yesterday \
--from-backup daily-$(date -d "yesterday" +%Y%m%d)-020000 \
--include-namespaces <affected-namespace> \
--namespace-mappings <affected-namespace>:corruption-test

# Wait for restore
velero restore describe test-restore-yesterday

# Verify data in test namespace
kubectl -n corruption-test get pods
kubectl -n corruption-test exec -it <postgres-pod> -- psql -U postgres <<EOF
-- Run data validation queries
SELECT COUNT(*) FROM <critical_table>;
SELECT * FROM <table> WHERE <validation_condition>;
-- Check for corruption indicators
EOF

# If still corrupt, try older backup
kubectl delete namespace corruption-test
kubectl create namespace corruption-test

# Test 2: Last week's backup
velero restore create test-restore-lastweek \
--from-backup weekly-$(date -d "7 days ago" +%Y%m%d)-020000 \
--include-namespaces <affected-namespace> \
--namespace-mappings <affected-namespace>:corruption-test

# Repeat validation

Step 3: Identify Last Known Good Backup

Document findings from backup testing:

# Create investigation log
cat > /tmp/corruption-investigation-$(date +%Y%m%d).md <<EOF
# Data Corruption Investigation

**Date**: $(date)
**Affected Namespace**: <namespace>
**Affected Database**: <cluster-name>

## Corruption Timeline
- First detected: <date/time>
- Likely started: <date/time> (based on logs/data analysis)
- Last known good: <date/time>

## Backup Testing Results
| Backup Date | Status | Notes |
|-------------|--------|-------|
| $(date -d "yesterday" +%Y-%m-%d) | CORRUPT | <details> |
| $(date -d "7 days ago" +%Y-%m-%d) | CORRUPT | <details> |
| $(date -d "14 days ago" +%Y-%m-%d) | CLEAN | ✓ Data validates |
| $(date -d "30 days ago" +%Y-%m-%d) | CLEAN | ✓ Data validates |

## Recommended Recovery Point
**Backup**: weekly-$(date -d "14 days ago" +%Y%m%d)-020000
**Data Loss**: ~14 days of data
**Justification**: <reasoning>

## Recovery Plan
1. <steps>
EOF

cat /tmp/corruption-investigation-$(date +%Y%m%d).md

Step 4: Assess Data Loss Impact

Determine what data will be lost:

# Compare clean backup vs current state
kubectl -n <namespace> exec -it <postgres-pod> -- psql -U postgres <<EOF
-- Get row counts for critical tables
SELECT
'<table_name>' as table_name,
COUNT(*) as current_count,
(SELECT COUNT(*) FROM <table_name> WHERE created_at <= '<recovery_point_date>') as recovery_count,
COUNT(*) - (SELECT COUNT(*) FROM <table_name> WHERE created_at <= '<recovery_point_date>') as data_loss
FROM <table_name>;

-- List data created after recovery point (will be lost)
SELECT * FROM <table_name>
WHERE created_at > '<recovery_point_date>'
ORDER BY created_at DESC
LIMIT 100;

-- Export critical data created after recovery point
\copy (SELECT * FROM <table> WHERE created_at > '<recovery_point_date>') TO '/tmp/data-after-recovery-point.csv' CSV HEADER
EOF

# Copy exported data out of pod
kubectl -n <namespace> cp <postgres-pod>:/tmp/data-after-recovery-point.csv /tmp/data-after-recovery-point.csv

# Analyze what will be lost
wc -l /tmp/data-after-recovery-point.csv
head /tmp/data-after-recovery-point.csv

Step 5: CNPG Point-in-Time Recovery

Restore PostgreSQL to specific point in time before corruption:

# First, check available backups in B2
kubectl -n <namespace> get cluster <cluster-name> -o yaml

# View backup list from CNPG
kubectl -n <namespace> exec -it <cluster-name>-1 -- bash
barman-cloud-backup-list \
--endpoint-url https://s3.us-west-000.backblazeb2.com \
s3://homelab-cnpg-b2/<namespace>/<cluster-name>

# Exit pod and create recovery cluster spec
cat > /tmp/recovery-cluster.yaml <<EOF
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: <cluster-name>-recovery
namespace: <namespace>
spec:
instances: 2

bootstrap:
recovery:
source: original-cluster
recoveryTarget:
# Option 1: Recover to specific timestamp (recommended)
targetTime: "2024-12-10 23:59:59" # Last known good time

# Option 2: Recover to specific transaction ID (if known)
# targetXID: "1234567"

# Option 3: Recover to named restore point (if created)
# targetName: "before_corruption"

# Backup to start recovery from
# CNPG will automatically select closest backup before targetTime
backup:
name: <backup-id> # Optional: specify backup to use

externalClusters:
- name: original-cluster
barmanObjectStore:
destinationPath: s3://homelab-cnpg-b2/<namespace>/<cluster-name>
endpointURL: https://s3.us-west-000.backblazeb2.com
s3Credentials:
accessKeyId:
name: b2-cnpg-credentials
key: AWS_ACCESS_KEY_ID
secretAccessKey:
name: b2-cnpg-credentials
key: AWS_SECRET_ACCESS_KEY
wal:
compression: gzip
encryption: AES256
maxParallel: 8

storage:
size: 50Gi # Match or increase original size
storageClass: longhorn

postgresql:
parameters:
max_connections: "200"
shared_buffers: "256MB"
# Match original cluster parameters

monitoring:
enablePodMonitor: true
EOF

Apply recovery (creates new cluster):

# Apply recovery cluster (don't delete original yet!)
kubectl apply -f /tmp/recovery-cluster.yaml

# Monitor recovery progress
kubectl -n <namespace> get cluster <cluster-name>-recovery -w

# Watch recovery logs
kubectl -n <namespace> logs -l cnpg.io/cluster=<cluster-name>-recovery -c postgres -f

# Check recovery status
kubectl -n <namespace> get cluster <cluster-name>-recovery -o jsonpath='{.status.conditions[?(@.type=="Ready")].message}'

Step 6: Validate Recovered Data

Verify the recovered database is clean:

# Connect to recovered cluster
kubectl -n <namespace> exec -it <cluster-name>-recovery-1 -- psql -U postgres

# Inside psql:
-- Check database is at correct recovery point
SELECT pg_postmaster_start_time();
SELECT pg_last_xact_replay_timestamp(); -- Should be close to target recovery time

-- Validate data integrity
SELECT COUNT(*) FROM <critical_table>;
-- Compare to expected count from investigation

-- Run application-specific validation queries
SELECT * FROM <table> WHERE <validation_condition>;

-- Check for corruption indicators
SELECT * FROM <table> WHERE <suspicious_pattern>;

-- Verify foreign key integrity
SELECT COUNT(*)
FROM <table> t
LEFT JOIN <parent_table> p ON t.parent_id = p.id
WHERE p.id IS NULL AND t.parent_id IS NOT NULL;
-- Should return 0

-- Check for data consistency
SELECT column_name, COUNT(DISTINCT value) as distinct_values
FROM <table>
GROUP BY column_name
HAVING COUNT(DISTINCT value) > <expected_threshold>;

Test application functionality:

# Temporarily point application to recovery cluster
# Option 1: Update service selector
kubectl -n <namespace> patch service <postgres-service> \
-p '{"spec":{"selector":{"cnpg.io/cluster":"<cluster-name>-recovery"}}}'

# Option 2: Create test application instance
kubectl -n corruption-test create deployment test-app \
--image=<your-app-image> \
-- --db-host=<cluster-name>-recovery-rw.<namespace>.svc.cluster.local

# Test application reads/writes
kubectl -n <namespace> port-forward svc/<app-service> 8080:80
# Access http://localhost:8080 and test functionality

# Check application logs for errors
kubectl -n <namespace> logs deployment/test-app

Step 7: Cutover to Recovered Database

Once validated, switch production to recovered cluster:

# Step 1: Put applications in maintenance mode
kubectl -n <namespace> scale deployment <app> --replicas=0

# Step 2: Rename clusters (swap old and new)
# Delete old corrupted cluster
kubectl -n <namespace> delete cluster <cluster-name>

# Rename recovery cluster to original name
kubectl -n <namespace> get cluster <cluster-name>-recovery -o yaml > /tmp/cluster-rename.yaml

# Edit /tmp/cluster-rename.yaml:
# Change metadata.name from "<cluster-name>-recovery" to "<cluster-name>"
sed -i 's/<cluster-name>-recovery/<cluster-name>/g' /tmp/cluster-rename.yaml

# Delete recovery cluster and recreate with correct name
kubectl -n <namespace> delete cluster <cluster-name>-recovery
kubectl apply -f /tmp/cluster-rename.yaml

# Step 3: Verify services are updated
kubectl -n <namespace> get endpoints <postgres-service>
# Should point to new cluster pods

# Step 4: Bring applications back online
kubectl -n <namespace> scale deployment <app> --replicas=<original-count>

# Step 5: Monitor for issues
kubectl -n <namespace> get pods -w
kubectl -n <namespace> logs deployment/<app> -f

Step 8: Attempt Data Recovery from Corrupt Period

Try to salvage data created after recovery point:

# If you exported data earlier (Step 4), attempt to reinsert clean data
kubectl -n <namespace> cp /tmp/data-after-recovery-point.csv <postgres-pod>:/tmp/

kubectl -n <namespace> exec -it <postgres-pod> -- psql -U postgres <<EOF
-- Review exported data
\! head /tmp/data-after-recovery-point.csv

-- Carefully import non-corrupt records
-- Use ON CONFLICT to skip duplicates
COPY <table> FROM '/tmp/data-after-recovery-point.csv' CSV HEADER;

-- Or selectively import:
CREATE TEMP TABLE temp_import (LIKE <table>);
COPY temp_import FROM '/tmp/data-after-recovery-point.csv' CSV HEADER;

-- Validate temp data before inserting
SELECT * FROM temp_import WHERE <validation_checks>;

-- Insert only valid rows
INSERT INTO <table>
SELECT * FROM temp_import
WHERE <validation_conditions>
ON CONFLICT (id) DO NOTHING;

-- Verify
SELECT COUNT(*) FROM <table>;
EOF

Post-Recovery Tasks

1. Implement Corruption Detection

# Add database integrity checks as CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
name: db-integrity-check
namespace: <namespace>
spec:
schedule: "0 2 * * *" # Daily at 2 AM
jobTemplate:
spec:
template:
spec:
containers:
- name: integrity-check
image: postgres:16
env:
- name: PGHOST
value: <cluster-name>-rw
- name: PGDATABASE
value: <database>
- name: PGUSER
valueFrom:
secretKeyRef:
name: <cluster-name>-app
key: username
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: <cluster-name>-app
key: password
command:
- /bin/sh
- -c
- |
# Run integrity checks
psql -c "SELECT * FROM pg_stat_database WHERE datname = '$PGDATABASE';"
psql -c "VACUUM ANALYZE;"

# Application-specific validation
psql -c "SELECT COUNT(*) FROM <critical_table>;"

# Check for orphaned records
psql -c "SELECT COUNT(*) FROM <table> t LEFT JOIN <parent> p ON t.parent_id = p.id WHERE p.id IS NULL;"

# Log results
echo "Integrity check completed at $(date)"
restartPolicy: OnFailure
---

Apply the integrity check:

kubectl apply -f db-integrity-check.yaml

# Create alerting for check failures
# In Prometheus AlertManager:
cat >> /home/benjaminsanden/Dokument/Projects/homelab/k8s/monitoring/prometheus/alerts/database.yaml <<EOF
- alert: DatabaseIntegrityCheckFailed
expr: kube_job_status_failed{job_name=~"db-integrity-check.*"} > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Database integrity check failed"
description: "Database integrity check job failed in namespace {{ \$labels.namespace }}"
EOF

2. Enable PostgreSQL Checksums

# For new databases, enable checksums at creation
# Edit cluster spec:
cat > cluster-with-checksums.yaml <<EOF
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: <cluster-name>
namespace: <namespace>
spec:
instances: 2

bootstrap:
initdb:
database: <database>
owner: <owner>
# Enable data checksums
options:
- "--data-checksums"

postgresql:
parameters:
# Enable checksum verification
data_checksums: "on"
# Log checksum failures
log_checkpoints: "on"

# ... rest of spec
EOF

# For existing databases, checksums require pg_rewind or rebuild
# This is done during recovery, so already enabled if you recovered

3. Implement Backup Verification

# Create backup validation script
cat > /home/benjaminsanden/Dokument/Projects/homelab/scripts/verify-backups.sh <<'EOF'
#!/bin/bash
set -e

NAMESPACE=$1
CLUSTER=$2

echo "Verifying backups for $NAMESPACE/$CLUSTER"

# Create test namespace
kubectl create namespace backup-verify-$(date +%s) || true
TEST_NS=$(kubectl get ns | grep backup-verify | tail -1 | awk '{print $1}')

# Get latest backup
LATEST_BACKUP=$(kubectl -n $NAMESPACE get backup -l cnpg.io/cluster=$CLUSTER -o jsonpath='{.items[-1:].metadata.name}')

echo "Testing restore of backup: $LATEST_BACKUP"

# Create recovery cluster in test namespace
cat <<YAML | kubectl apply -f -
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: verify-$CLUSTER
namespace: $TEST_NS
spec:
instances: 1
bootstrap:
recovery:
source: original
backup:
name: $LATEST_BACKUP
externalClusters:
- name: original
barmanObjectStore:
destinationPath: s3://homelab-cnpg-b2/$NAMESPACE/$CLUSTER
endpointURL: https://s3.us-west-000.backblazeb2.com
s3Credentials:
accessKeyId:
name: b2-cnpg-credentials
key: AWS_ACCESS_KEY_ID
secretAccessKey:
name: b2-cnpg-credentials
key: AWS_SECRET_ACCESS_KEY
storage:
size: 10Gi
YAML

# Wait for recovery
kubectl -n $TEST_NS wait --for=condition=Ready cluster/verify-$CLUSTER --timeout=600s

# Validate data
echo "Running validation queries..."
kubectl -n $TEST_NS exec -it verify-$CLUSTER-1 -- psql -U postgres -c "SELECT COUNT(*) FROM pg_database;"

# Cleanup
kubectl delete namespace $TEST_NS

echo "Backup verification completed successfully!"
EOF

chmod +x /home/benjaminsanden/Dokument/Projects/homelab/scripts/verify-backups.sh

# Schedule monthly verification
# Add to crontab:
# 0 3 1 * * /home/benjaminsanden/Dokument/Projects/homelab/scripts/verify-backups.sh database postgres-cluster

4. Document Corruption Event

cat > /home/benjaminsanden/Dokument/Projects/homelab/docs/incidents/data-corruption-$(date +%Y%m%d).md <<EOF
# Data Corruption Incident

**Date Detected**: $(date)
**Corruption Start**: <estimated-date>
**Affected Database**: <cluster-name>
**Namespace**: <namespace>
**Recovery Completed**: $(date)

## Summary
<Brief description of what happened>

## Timeline
- **T-Xd**: Corruption likely started (based on logs)
- **T+0h**: Corruption detected - <how discovered>
- **T+2h**: Investigation completed, recovery point identified
- **T+4h**: Recovery cluster created
- **T+6h**: Data validation completed
- **T+8h**: Cutover to recovered database
- **T+10h**: Post-recovery verification completed

## Impact
- **Data Loss**: ~X days of data (from <date> to <date>)
- **Downtime**: X hours
- **Records Lost**: ~X records across Y tables
- **Users Affected**: <count/description>

## Root Cause
<What caused the corruption>
- Hardware issue (disk failure, memory corruption)?
- Software bug (application, database, Kubernetes)?
- Configuration error?
- Unknown

## Recovery Process
1. Identified corruption timeline through log analysis
2. Tested backups from multiple dates
3. Found clean backup from <date>
4. Performed CNPG point-in-time recovery to <timestamp>
5. Validated recovered data
6. Cutover production to recovered cluster
7. Attempted to salvage data from corrupt period

## Data Recovered
- <details on what data was successfully recovered>

## Data Lost
- <details on what data could not be recovered>

## Prevention Measures Implemented
- [ ] Enabled PostgreSQL checksums
- [ ] Implemented daily integrity checks
- [ ] Added backup verification automation
- [ ] Increased backup retention
- [ ] Enhanced monitoring and alerting

## Lessons Learned
<What we learned from this incident>

## Action Items
- [ ] <specific tasks to prevent recurrence>
EOF

Troubleshooting

Cannot Identify Corruption Start Time

# If logs are insufficient, use binary search approach
# Test backups at increasing intervals

# Test today, 1 week ago, 2 weeks ago, 1 month ago, etc.
for days_ago in 1 7 14 30 60 90; do
echo "Testing backup from $days_ago days ago"
# Create test restore and validate
# Document results
done

# Once you find a range (e.g., corrupt between 14-30 days ago)
# Test intermediate points to narrow down

All Recent Backups Are Corrupt

# Extend search to older backups
velero backup get | tail -50

# Check B2 for older backups with versioning
b2 ls --recursive --versions b2://homelab-cnpg-b2/<namespace>/<cluster>/

# Consider GFS retention: monthly backups may be clean
velero backup get --selector backup-type=monthly

# If all Velero backups corrupt, check CNPG backups
# They may have longer retention

Recovery Cluster Won't Start

# Check recovery logs
kubectl -n <namespace> logs <cluster-name>-recovery-1 -c postgres

# Common issues:
# 1. Target time is too far back (beyond WAL retention)
# Solution: Use older base backup or different target time

# 2. Backup files are also corrupt
# Solution: Try different backup or check B2 versions

# 3. Insufficient resources
kubectl -n <namespace> describe pod <cluster-name>-recovery-1
# Solution: Increase storage size or node resources

# 4. Configuration mismatch
kubectl -n <namespace> get cluster <cluster-name>-recovery -o yaml
# Solution: Match parameters with original cluster

Recovered Data Still Shows Corruption

# You may have chosen recovery point after corruption started
# Try earlier recovery point

# Update recovery cluster spec with earlier time
kubectl -n <namespace> edit cluster <cluster-name>-recovery
# Change targetTime to earlier timestamp

# Delete pods to trigger new recovery
kubectl -n <namespace> delete pod -l cnpg.io/cluster=<cluster-name>-recovery

Reference