Longhorn Backup Strategy

Overview

This document outlines the comprehensive backup strategy for the homelab Kubernetes cluster using Longhorn distributed storage. The strategy implements a tiered backup approach with different frequencies and retention policies based on data criticality and recovery requirements.

Context

The homelab is a GitOps-managed Kubernetes cluster running on Talos Linux with Argo CD for continuous deployment. Storage is provided by Longhorn, a distributed block storage system that supports automated backups to S3-compatible storage. The backup system uses label-based recurring jobs to automatically back up PersistentVolumeClaims (PVCs) based on their assigned tier.

Key Components

Longhorn: Distributed storage with built-in backup capabilities
Recurring Jobs: Automated backup schedules triggered by PVC labels
Backup Tiers: Three tiers (GFS, Daily, Weekly) with different frequencies and retention
Storage Target: S3-compatible storage (MinIO) for backup retention

Backup Tiers

GFS (Grandfather-Father-Son)

Frequency: Hourly + Daily + Weekly
Retention: 48 hours (hourly), 14 days (daily), 8 weeks (weekly)
Total Backups: ~70 per volume
Storage Impact: High (frequent snapshots)
Use Case: Critical databases requiring point-in-time recovery

Daily

Frequency: Daily
Retention: 14 days
Total Backups: ~14 per volume
Storage Impact: Medium
Use Case: Important application data and configurations

Weekly

Frequency: Weekly
Retention: 4 weeks
Total Backups: ~4 per volume
Storage Impact: Low
Use Case: Non-critical application data

Implementation

Labeling Strategy

PVCs and StatefulSet volumeClaimTemplates are labeled with two key-value pairs:

metadata:
  labels:
    recurring-job.longhorn.io/source: enabled
    recurring-job-group.longhorn.io/<tier>: enabled

Why two labels?

Longhorn uses a two-label system for flexible backup configuration:

recurring-job.longhorn.io/source: enabled: This marks the PVC as a backup source, telling Longhorn to include it in automated recurring backup jobs. Without this label, the PVC will not be backed up regardless of other labels.
recurring-job-group.longhorn.io/<tier>: enabled: This specifies which recurring job group (backup schedule) to use. The <tier> value (gfs, daily, or weekly) determines the backup frequency and retention policy. Multiple groups can exist, and this label assigns the PVC to the appropriate group.

This separation allows for:

Selective backup enabling (first label)
Flexible scheduling assignment (second label)
Easy management of different backup policies across the cluster

Where <tier> is one of: gfs, daily, or weekly.

Labeled Resources

GFS Tier

Critical Databases:

Authentik PostgreSQL - SSO system, critical for authentication
Immich PostgreSQL - Photo management database, contains all metadata
OpenCode data volume - SQLite database with user code and configurations

Justification: These contain irreplaceable data where any loss would be catastrophic. Point-in-time recovery is essential for databases.

Daily Tier

PostgreSQL Databases:

LiteLLM PostgreSQL - AI model routing, important but can tolerate some data loss

Application Data:

Audiobookshelf library - User-uploaded media library data
Immich library - Photo storage (metadata handled separately)
Audiobookshelf metadata/podcasts - Media indexes and metadata
Pipeline data - OpenWebUI pipelines, Pinepods downloads/backups, Audiobookrequest config

Justification: Important operational data that should be backed up regularly but doesn't require hourly granularity. Recovery within 24 hours is acceptable.

Weekly Tier

Application Configurations:

Jellyfin cache - Media server cache (can be rebuilt)
SABnzbd - Usenet downloader config and incomplete downloads
Jellyseerr - Media request manager
Sonarr/Radarr - Media management automation
Baby Buddy - Baby tracking application
MQTT - Message broker data
Zigbee2MQTT - IoT device configuration
UniFi Controller - Network management
Home Assistant - Smart home configuration
KaraKeep - Document management (MeiliSearch + web data)
OpenWebUI web data - Chat interface data

Justification: Configuration and cache data that has value but can tolerate weekly backups. These applications can be reconfigured if lost, though it would be inconvenient.

Exclusions

Certain PVCs and applications are intentionally excluded from automated backups:

User-Excluded Applications

PedroBot: Not critical for backup
Qdrant: Vector database with replaceable embeddings
OpenHands: Development/testing tool
VLLM: AI model embeddings (can be regenerated)
HeadlessX: Remote desktop (ephemeral sessions)
Unrar: Temporary processing tool
Media-share: Handled by separate snapshot strategy

Infrastructure Components

Redis instances (minimal state, replaceable)
ArgoCD, Cilium, and other infrastructure PVCs (can be redeployed)

Justification: These contain ephemeral, cache, or easily regeneratable data. Backing them up would consume storage without significant benefit.

Why This Strategy?

Risk-Based Approach

The tiered strategy balances backup frequency with storage costs and recovery requirements. Critical data gets maximum protection, while less important data gets appropriate coverage without over-protection.

Cost Optimization

GFS: ~70 backups per volume (high cost, high value)
Daily: ~14 backups per volume (medium cost, medium value)
Weekly: ~4 backups per volume (low cost, low value)

Recovery Considerations

RTO (Recovery Time Objective): GFS allows recovery to any point in the last 8 weeks
RPO (Recovery Point Objective): Ranges from 1 hour (GFS) to 1 week (Weekly)
Data Criticality: Matches backup frequency to business impact of data loss

Operational Benefits

Automated: Label-based triggers require no manual intervention
Scalable: New PVCs automatically inherit backup behavior
GitOps: Backup configuration is declarative and version-controlled
Cost-Effective: Avoids backing up ephemeral or replaceable data

Future Considerations

Monitor backup storage usage and adjust retention if needed
Consider adding a "monthly" tier for archival data if required
Evaluate backup restore procedures periodically
Review excluded applications annually for changes in criticality

Overview​

Context​

Key Components​

Backup Tiers​

GFS (Grandfather-Father-Son)​

Daily​

Weekly​

Implementation​

Labeling Strategy​

Labeled Resources​

GFS Tier​

Daily Tier​

Weekly Tier​

Exclusions​

User-Excluded Applications​

Infrastructure Components​

Why This Strategy?​

Risk-Based Approach​

Cost Optimization​

Recovery Considerations​

Operational Benefits​

Future Considerations​