Scenario 8: Data Corruption

Symptoms

Database queries returning inconsistent or unexpected results
Application errors about data integrity violations or constraint failures
PostgreSQL logs showing "invalid page header" or "checksum failure"
Gradual degradation of service over days or weeks
Files or database records with garbled content
Backup verification failures
Silent corruption discovered during audit or reporting
Users reporting missing or incorrect historical data

Impact Assessment

Recovery Time Objective (RTO): 4-8 hours
Recovery Point Objective (RPO): Variable (depends on when corruption started)
Data Loss Risk: Moderate to High (may lose data from corruption point forward)
Service Availability: Can operate degraded during investigation, full outage during restore
Detection Difficulty: High (corruption may be subtle and gradual)

Prerequisites

Access to multiple backup generations (daily, weekly, monthly)
CNPG PostgreSQL cluster with point-in-time recovery (PITR) capability
Velero backups with multiple retention points
Access to B2 offsite backups with version history
Test environment to validate old backups
Knowledge of application data models and expected data patterns
Monitoring tools to identify when corruption started

Recovery Procedure

Step 1: Identify Corruption Scope and Timeline

CRITICAL: Do NOT restore immediately. First understand the problem.

# Document current corruption state
kubectl get all -A > /tmp/corruption-state-$(date +%Y%m%d).txt
kubectl get pvc -A >> /tmp/corruption-state-$(date +%Y%m%d).txt

# Check database health
kubectl get clusters.postgresql.cnpg.io -A

# For each PostgreSQL cluster, check for corruption
kubectl -n <namespace> exec -it <postgres-pod> -- psql -U postgres <<EOF
-- Check for corrupted indexes
REINDEX DATABASE <database_name>;

-- Verify table integrity
SELECT relname, pg_relation_size(oid) as size
FROM pg_class
WHERE relkind = 'r'
ORDER BY size DESC
LIMIT 20;

-- Check for invalid data
SELECT COUNT(*) FROM <critical_table> WHERE <validation_condition>;

-- Look for orphaned records
SELECT * FROM <table> WHERE foreign_key NOT IN (SELECT id FROM parent_table);
EOF

# Check application logs for when errors started
kubectl -n <namespace> logs <app-pod> --since=168h | grep -i "error\|corrupt\|invalid"

# Check PostgreSQL logs for corruption indicators
kubectl -n <namespace> logs <postgres-pod> --since=168h | grep -i "corrupt\|checksum\|invalid page"

Identify when corruption started:

# Method 1: Check application metrics/logs
# Look for when error rates increased
# Access Prometheus/Grafana:
kubectl -n monitoring port-forward svc/prometheus 9090:9090
# Query: rate(application_errors[1h])
# Find timestamp when errors started

# Method 2: Check database modification times
kubectl -n <namespace> exec -it <postgres-pod> -- psql -U postgres <<EOF
-- Find when suspicious data appeared
SELECT MIN(created_at), MAX(created_at)
FROM <table>
WHERE <suspicious_condition>;

-- Check for gaps in sequential data
SELECT id, created_at,
  LAG(created_at) OVER (ORDER BY created_at) as previous_time,
  created_at - LAG(created_at) OVER (ORDER BY created_at) as gap
FROM <table>
WHERE gap > interval '1 day'  -- Unusual gaps might indicate corruption point
ORDER BY created_at DESC
LIMIT 100;
EOF

# Method 3: Check Velero backup history
velero backup get --output custom-columns=NAME:.metadata.name,CREATED:.metadata.creationTimestamp,STATUS:.status.phase

# Method 4: Check git commit history for config changes
cd /home/benjaminsanden/Dokument/Projects/homelab
git log --since="30 days ago" --oneline --all -- k8s/
# Look for database config changes that might have introduced corruption

Step 2: Test Historical Backups

Create test namespace to verify old backups without affecting production:

# Create isolated test namespace
kubectl create namespace corruption-test

# Test restoring from different backup dates
# Start with most recent, work backwards

# Test 1: Yesterday's backup
velero restore create test-restore-yesterday \
  --from-backup daily-$(date -d "yesterday" +%Y%m%d)-020000 \
  --include-namespaces <affected-namespace> \
  --namespace-mappings <affected-namespace>:corruption-test

# Wait for restore
velero restore describe test-restore-yesterday

# Verify data in test namespace
kubectl -n corruption-test get pods
kubectl -n corruption-test exec -it <postgres-pod> -- psql -U postgres <<EOF
-- Run data validation queries
SELECT COUNT(*) FROM <critical_table>;
SELECT * FROM <table> WHERE <validation_condition>;
-- Check for corruption indicators
EOF

# If still corrupt, try older backup
kubectl delete namespace corruption-test
kubectl create namespace corruption-test

# Test 2: Last week's backup
velero restore create test-restore-lastweek \
  --from-backup weekly-$(date -d "7 days ago" +%Y%m%d)-020000 \
  --include-namespaces <affected-namespace> \
  --namespace-mappings <affected-namespace>:corruption-test

# Repeat validation

Step 3: Identify Last Known Good Backup

Document findings from backup testing:

# Create investigation log
cat > /tmp/corruption-investigation-$(date +%Y%m%d).md <<EOF
# Data Corruption Investigation

**Date**: $(date)
**Affected Namespace**: <namespace>
**Affected Database**: <cluster-name>

## Corruption Timeline
- First detected: <date/time>
- Likely started: <date/time> (based on logs/data analysis)
- Last known good: <date/time>

## Backup Testing Results
| Backup Date | Status | Notes |
|-------------|--------|-------|
| $(date -d "yesterday" +%Y-%m-%d) | CORRUPT | <details> |
| $(date -d "7 days ago" +%Y-%m-%d) | CORRUPT | <details> |
| $(date -d "14 days ago" +%Y-%m-%d) | CLEAN | ✓ Data validates |
| $(date -d "30 days ago" +%Y-%m-%d) | CLEAN | ✓ Data validates |

## Recommended Recovery Point
**Backup**: weekly-$(date -d "14 days ago" +%Y%m%d)-020000
**Data Loss**: ~14 days of data
**Justification**: <reasoning>

## Recovery Plan
1. <steps>
EOF

cat /tmp/corruption-investigation-$(date +%Y%m%d).md

Step 4: Assess Data Loss Impact

Determine what data will be lost:

# Compare clean backup vs current state
kubectl -n <namespace> exec -it <postgres-pod> -- psql -U postgres <<EOF
-- Get row counts for critical tables
SELECT
  '<table_name>' as table_name,
  COUNT(*) as current_count,
  (SELECT COUNT(*) FROM <table_name> WHERE created_at <= '<recovery_point_date>') as recovery_count,
  COUNT(*) - (SELECT COUNT(*) FROM <table_name> WHERE created_at <= '<recovery_point_date>') as data_loss
FROM <table_name>;

-- List data created after recovery point (will be lost)
SELECT * FROM <table_name>
WHERE created_at > '<recovery_point_date>'
ORDER BY created_at DESC
LIMIT 100;

-- Export critical data created after recovery point
\copy (SELECT * FROM <table> WHERE created_at > '<recovery_point_date>') TO '/tmp/data-after-recovery-point.csv' CSV HEADER
EOF

# Copy exported data out of pod
kubectl -n <namespace> cp <postgres-pod>:/tmp/data-after-recovery-point.csv /tmp/data-after-recovery-point.csv

# Analyze what will be lost
wc -l /tmp/data-after-recovery-point.csv
head /tmp/data-after-recovery-point.csv

Step 5: CNPG Point-in-Time Recovery

Restore PostgreSQL to specific point in time before corruption:

# First, check available backups in B2
kubectl -n <namespace> get cluster <cluster-name> -o yaml

# View backup list from CNPG
kubectl -n <namespace> exec -it <cluster-name>-1 -- bash
barman-cloud-backup-list \
  --endpoint-url https://s3.us-west-002.backblazeb2.com \
  s3://homelab-cnpg-b2/<namespace>/<cluster-name>

# Exit pod and create recovery cluster spec
cat > /tmp/recovery-cluster.yaml <<EOF
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: <cluster-name>-recovery
  namespace: <namespace>
spec:
  instances: 2

  bootstrap:
    recovery:
      source: original-cluster
      recoveryTarget:
        # Option 1: Recover to specific timestamp (recommended)
        targetTime: "2024-12-10 23:59:59"  # Last known good time

        # Option 2: Recover to specific transaction ID (if known)
        # targetXID: "1234567"

        # Option 3: Recover to named restore point (if created)
        # targetName: "before_corruption"

      # Backup to start recovery from
      # CNPG will automatically select closest backup before targetTime
      backup:
        name: <backup-id>  # Optional: specify backup to use

  externalClusters:
    - name: original-cluster
      barmanObjectStore:
        destinationPath: s3://homelab-cnpg-b2/<namespace>/<cluster-name>
        endpointURL: https://s3.us-west-002.backblazeb2.com
        s3Credentials:
          accessKeyId:
            name: b2-cnpg-credentials
            key: AWS_ACCESS_KEY_ID
          secretAccessKey:
            name: b2-cnpg-credentials
            key: AWS_SECRET_ACCESS_KEY
        wal:
          compression: gzip
          encryption: AES256
          maxParallel: 8

  storage:
    size: 50Gi  # Match or increase original size
    storageClass: longhorn

  postgresql:
    parameters:
      max_connections: "200"
      shared_buffers: "256MB"
      # Match original cluster parameters

  monitoring:
    enablePodMonitor: true
EOF

Apply recovery (creates new cluster):

# Apply recovery cluster (don't delete original yet!)
kubectl apply -f /tmp/recovery-cluster.yaml

# Monitor recovery progress
kubectl -n <namespace> get cluster <cluster-name>-recovery -w

# Watch recovery logs
kubectl -n <namespace> logs -l cnpg.io/cluster=<cluster-name>-recovery -c postgres -f

# Check recovery status
kubectl -n <namespace> get cluster <cluster-name>-recovery -o jsonpath='{.status.conditions[?(@.type=="Ready")].message}'

Step 6: Validate Recovered Data

Verify the recovered database is clean:

# Connect to recovered cluster
kubectl -n <namespace> exec -it <cluster-name>-recovery-1 -- psql -U postgres

# Inside psql:
-- Check database is at correct recovery point
SELECT pg_postmaster_start_time();
SELECT pg_last_xact_replay_timestamp();  -- Should be close to target recovery time

-- Validate data integrity
SELECT COUNT(*) FROM <critical_table>;
-- Compare to expected count from investigation

-- Run application-specific validation queries
SELECT * FROM <table> WHERE <validation_condition>;

-- Check for corruption indicators
SELECT * FROM <table> WHERE <suspicious_pattern>;

-- Verify foreign key integrity
SELECT COUNT(*)
FROM <table> t
LEFT JOIN <parent_table> p ON t.parent_id = p.id
WHERE p.id IS NULL AND t.parent_id IS NOT NULL;
-- Should return 0

-- Check for data consistency
SELECT column_name, COUNT(DISTINCT value) as distinct_values
FROM <table>
GROUP BY column_name
HAVING COUNT(DISTINCT value) > <expected_threshold>;

Test application functionality:

# Temporarily point application to recovery cluster
# Option 1: Update service selector
kubectl -n <namespace> patch service <postgres-service> \
  -p '{"spec":{"selector":{"cnpg.io/cluster":"<cluster-name>-recovery"}}}'

# Option 2: Create test application instance
kubectl -n corruption-test create deployment test-app \
  --image=<your-app-image> \
  -- --db-host=<cluster-name>-recovery-rw.<namespace>.svc.cluster.local

# Test application reads/writes
kubectl -n <namespace> port-forward svc/<app-service> 8080:80
# Access http://localhost:8080 and test functionality

# Check application logs for errors
kubectl -n <namespace> logs deployment/test-app

Step 7: Cutover to Recovered Database

Once validated, switch production to recovered cluster:

# Step 1: Put applications in maintenance mode
kubectl -n <namespace> scale deployment <app> --replicas=0

# Step 2: Rename clusters (swap old and new)
# Delete old corrupted cluster
kubectl -n <namespace> delete cluster <cluster-name>

# Rename recovery cluster to original name
kubectl -n <namespace> get cluster <cluster-name>-recovery -o yaml > /tmp/cluster-rename.yaml

# Edit /tmp/cluster-rename.yaml:
# Change metadata.name from "<cluster-name>-recovery" to "<cluster-name>"
sed -i 's/<cluster-name>-recovery/<cluster-name>/g' /tmp/cluster-rename.yaml

# Delete recovery cluster and recreate with correct name
kubectl -n <namespace> delete cluster <cluster-name>-recovery
kubectl apply -f /tmp/cluster-rename.yaml

# Step 3: Verify services are updated
kubectl -n <namespace> get endpoints <postgres-service>
# Should point to new cluster pods

# Step 4: Bring applications back online
kubectl -n <namespace> scale deployment <app> --replicas=<original-count>

# Step 5: Monitor for issues
kubectl -n <namespace> get pods -w
kubectl -n <namespace> logs deployment/<app> -f

Step 8: Attempt Data Recovery from Corrupt Period

Try to salvage data created after recovery point:

# If you exported data earlier (Step 4), attempt to reinsert clean data
kubectl -n <namespace> cp /tmp/data-after-recovery-point.csv <postgres-pod>:/tmp/

kubectl -n <namespace> exec -it <postgres-pod> -- psql -U postgres <<EOF
-- Review exported data
\! head /tmp/data-after-recovery-point.csv

-- Carefully import non-corrupt records
-- Use ON CONFLICT to skip duplicates
COPY <table> FROM '/tmp/data-after-recovery-point.csv' CSV HEADER;

-- Or selectively import:
CREATE TEMP TABLE temp_import (LIKE <table>);
COPY temp_import FROM '/tmp/data-after-recovery-point.csv' CSV HEADER;

-- Validate temp data before inserting
SELECT * FROM temp_import WHERE <validation_checks>;

-- Insert only valid rows
INSERT INTO <table>
SELECT * FROM temp_import
WHERE <validation_conditions>
ON CONFLICT (id) DO NOTHING;

-- Verify
SELECT COUNT(*) FROM <table>;
EOF

Post-Recovery Tasks

1. Implement Corruption Detection

# Add database integrity checks as CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
  name: db-integrity-check
  namespace: <namespace>
spec:
  schedule: '0 2 * * *' # Daily at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: integrity-check
              image: postgres:16
              env:
                - name: PGHOST
                  value: <cluster-name>-rw
                - name: PGDATABASE
                  value: <database>
                - name: PGUSER
                  valueFrom:
                    secretKeyRef:
                      name: <cluster-name>-app
                      key: username
                - name: PGPASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: <cluster-name>-app
                      key: password
              command:
                - /bin/sh
                - -c
                - |
                  # Run integrity checks
                  psql -c "SELECT * FROM pg_stat_database WHERE datname = '$PGDATABASE';"
                  psql -c "VACUUM ANALYZE;"

                  # Application-specific validation
                  psql -c "SELECT COUNT(*) FROM <critical_table>;"

                  # Check for orphaned records
                  psql -c "SELECT COUNT(*) FROM <table> t LEFT JOIN <parent> p ON t.parent_id = p.id WHERE p.id IS NULL;"

                  # Log results
                  echo "Integrity check completed at $(date)"
          restartPolicy: OnFailure
---

Apply the integrity check:

kubectl apply -f db-integrity-check.yaml

# Create alerting for check failures
# In Prometheus AlertManager:
cat >> /home/benjaminsanden/Dokument/Projects/homelab/k8s/monitoring/prometheus/alerts/database.yaml <<EOF
- alert: DatabaseIntegrityCheckFailed
  expr: kube_job_status_failed{job_name=~"db-integrity-check.*"} > 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Database integrity check failed"
    description: "Database integrity check job failed in namespace {{ \$labels.namespace }}"
EOF

2. Enable PostgreSQL Checksums

# For new databases, enable checksums at creation
# Edit cluster spec:
cat > cluster-with-checksums.yaml <<EOF
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: <cluster-name>
  namespace: <namespace>
spec:
  instances: 2

  bootstrap:
    initdb:
      database: <database>
      owner: <owner>
      # Enable data checksums
      options:
        - "--data-checksums"

  postgresql:
    parameters:
      # Enable checksum verification
      data_checksums: "on"
      # Log checksum failures
      log_checkpoints: "on"

  # ... rest of spec
EOF

# For existing databases, checksums require pg_rewind or rebuild
# This is done during recovery, so already enabled if you recovered

3. Implement Backup Verification

# Create backup validation script
cat > /home/benjaminsanden/Dokument/Projects/homelab/scripts/verify-backups.sh <<'EOF'
#!/bin/bash
set -e

NAMESPACE=$1
CLUSTER=$2

echo "Verifying backups for $NAMESPACE/$CLUSTER"

# Create test namespace
kubectl create namespace backup-verify-$(date +%s) || true
TEST_NS=$(kubectl get ns | grep backup-verify | tail -1 | awk '{print $1}')

# Get latest backup
LATEST_BACKUP=$(kubectl -n $NAMESPACE get backup -l cnpg.io/cluster=$CLUSTER -o jsonpath='{.items[-1:].metadata.name}')

echo "Testing restore of backup: $LATEST_BACKUP"

# Create recovery cluster in test namespace
cat <<YAML | kubectl apply -f -
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: verify-$CLUSTER
  namespace: $TEST_NS
spec:
  instances: 1
  bootstrap:
    recovery:
      source: original
      backup:
        name: $LATEST_BACKUP
  externalClusters:
    - name: original
      barmanObjectStore:
        destinationPath: s3://homelab-cnpg-b2/$NAMESPACE/$CLUSTER
        endpointURL: https://s3.us-west-002.backblazeb2.com
        s3Credentials:
          accessKeyId:
            name: b2-cnpg-credentials
            key: AWS_ACCESS_KEY_ID
          secretAccessKey:
            name: b2-cnpg-credentials
            key: AWS_SECRET_ACCESS_KEY
  storage:
    size: 10Gi
YAML

# Wait for recovery
kubectl -n $TEST_NS wait --for=condition=Ready cluster/verify-$CLUSTER --timeout=600s

# Validate data
echo "Running validation queries..."
kubectl -n $TEST_NS exec -it verify-$CLUSTER-1 -- psql -U postgres -c "SELECT COUNT(*) FROM pg_database;"

# Cleanup
kubectl delete namespace $TEST_NS

echo "Backup verification completed successfully!"
EOF

chmod +x /home/benjaminsanden/Dokument/Projects/homelab/scripts/verify-backups.sh

# Schedule monthly verification
# Add to crontab:
# 0 3 1 * * /home/benjaminsanden/Dokument/Projects/homelab/scripts/verify-backups.sh database postgres-cluster

4. Document Corruption Event

cat > /home/benjaminsanden/Dokument/Projects/homelab/docs/incidents/data-corruption-$(date +%Y%m%d).md <<EOF
# Data Corruption Incident

**Date Detected**: $(date)
**Corruption Start**: <estimated-date>
**Affected Database**: <cluster-name>
**Namespace**: <namespace>
**Recovery Completed**: $(date)

## Summary
<Brief description of what happened>

## Timeline
- **T-Xd**: Corruption likely started (based on logs)
- **T+0h**: Corruption detected - <how discovered>
- **T+2h**: Investigation completed, recovery point identified
- **T+4h**: Recovery cluster created
- **T+6h**: Data validation completed
- **T+8h**: Cutover to recovered database
- **T+10h**: Post-recovery verification completed

## Impact
- **Data Loss**: ~X days of data (from <date> to <date>)
- **Downtime**: X hours
- **Records Lost**: ~X records across Y tables
- **Users Affected**: <count/description>

## Root Cause
<What caused the corruption>
- Hardware issue (disk failure, memory corruption)?
- Software bug (application, database, Kubernetes)?
- Configuration error?
- Unknown

## Recovery Process
1. Identified corruption timeline through log analysis
2. Tested backups from multiple dates
3. Found clean backup from <date>
4. Performed CNPG point-in-time recovery to <timestamp>
5. Validated recovered data
6. Cutover production to recovered cluster
7. Attempted to salvage data from corrupt period

## Data Recovered
- <details on what data was successfully recovered>

## Data Lost
- <details on what data could not be recovered>

## Prevention Measures Implemented
- [ ] Enabled PostgreSQL checksums
- [ ] Implemented daily integrity checks
- [ ] Added backup verification automation
- [ ] Increased backup retention
- [ ] Enhanced monitoring and alerting

## Lessons Learned
<What we learned from this incident>

## Action Items
- [ ] <specific tasks to prevent recurrence>
EOF

Troubleshooting

Cannot Identify Corruption Start Time

# If logs are insufficient, use binary search approach
# Test backups at increasing intervals

# Test today, 1 week ago, 2 weeks ago, 1 month ago, etc.
for days_ago in 1 7 14 30 60 90; do
  echo "Testing backup from $days_ago days ago"
  # Create test restore and validate
  # Document results
done

# Once you find a range (e.g., corrupt between 14-30 days ago)
# Test intermediate points to narrow down

All Recent Backups Are Corrupt

# Extend search to older backups
velero backup get | tail -50

# Check B2 for older backups with versioning
b2 ls --recursive --versions b2://homelab-cnpg-b2/<namespace>/<cluster>/

# Consider GFS retention: monthly backups may be clean
velero backup get --selector backup-type=monthly

# If all Velero backups corrupt, check CNPG backups
# They may have longer retention

Recovery Cluster Won't Start

# Check recovery logs
kubectl -n <namespace> logs <cluster-name>-recovery-1 -c postgres

# Common issues:
# 1. Target time is too far back (beyond WAL retention)
# Solution: Use older base backup or different target time

# 2. Backup files are also corrupt
# Solution: Try different backup or check B2 versions

# 3. Insufficient resources
kubectl -n <namespace> describe pod <cluster-name>-recovery-1
# Solution: Increase storage size or node resources

# 4. Configuration mismatch
kubectl -n <namespace> get cluster <cluster-name>-recovery -o yaml
# Solution: Match parameters with original cluster

Recovered Data Still Shows Corruption

# You may have chosen recovery point after corruption started
# Try earlier recovery point

# Update recovery cluster spec with earlier time
kubectl -n <namespace> edit cluster <cluster-name>-recovery
# Change targetTime to earlier timestamp

# Delete pods to trigger new recovery
kubectl -n <namespace> delete pod -l cnpg.io/cluster=<cluster-name>-recovery

Scenario 1: Accidental Deletion - For basic restore procedures
Scenario 6: Ransomware Attack - Similar backup testing approach
Scenario 9: Primary Recovery Guide - For accessing B2 backups

Symptoms​

Impact Assessment​

Prerequisites​

Recovery Procedure​

Step 1: Identify Corruption Scope and Timeline​

Step 2: Test Historical Backups​

Step 3: Identify Last Known Good Backup​

Step 4: Assess Data Loss Impact​

Step 5: CNPG Point-in-Time Recovery​

Step 6: Validate Recovered Data​

Step 7: Cutover to Recovered Database​

Step 8: Attempt Data Recovery from Corrupt Period​

Post-Recovery Tasks​

1. Implement Corruption Detection​

2. Enable PostgreSQL Checksums​

3. Implement Backup Verification​

4. Document Corruption Event​

Troubleshooting​

Cannot Identify Corruption Start Time​

All Recent Backups Are Corrupt​

Recovery Cluster Won't Start​

Recovered Data Still Shows Corruption​

Related Scenarios​

Reference​