Skip to main content

pvc-restore

Overview

pvc-restore is a Kubernetes controller that automates PVC restoration from volsync backups. Instead of manually orchestrating a complex multi-step restore process, you annotate a PVC and the controller handles everything: scaling down workloads, syncing backup data, and scaling back up.

The Problem

Restoring a PVC from a volsync backup traditionally requires the following steps:

  1. Create a new PVC via ReplicationDestination pointing to the specific backup snapshot
  2. Scale down all workloads that use the target PVC
  3. Create a temporary pod mounting both the backup PVC and target PVC
  4. Manually copy files between volumes
  5. Delete the temporary pod
  6. Scale workloads back up
  7. Clean up temporary PVC and ReplicationDestination

This process is error-prone, time-consuming, and requires careful orchestration. One mistake—forgetting to scale down, or terminating mid-copy—corrupts the PVC or leaves resources dangling.

The controller eliminates this friction by automating the entire workflow through a single annotation.

How It Works

Architecture

pvc-restore uses three reconcilers working together:

  1. PersistentVolumeClaimReconciler: Watches PVCs for the restore annotation. When detected, it clears the annotation and creates a PVCRestore resource to trigger the restore process.

  2. PVCRestoreReconciler: Manages the restore workflow through a state machine. It orchestrates the phases of restore: acquiring locks, scaling down workloads, syncing data, and scaling back up.

  3. LockCleanupReconciler: Periodically cleans up stale locks from failed restores, preventing deadlock scenarios.

Workflow

Annotate PVC

PersistentVolumeClaimReconciler detects annotation

Creates PVCRestore resource

PVCRestoreReconciler enters Initialize phase

Acquires lock (prevents concurrent restores)

Transitions to ScaleDown phase

Finds workloads mounting the PVC, records their replica counts

Scales all workloads to zero replicas

Transitions to Restore phase

Finds ReplicationSource for the PVC

Creates ReplicationDestination pointing to PVC

Waits for restore to complete

Transitions to Finalize phase

Scales workloads back to original replica counts

Cleans up ReplicationDestination

Releases lock

Transitions to Finished (terminal state)

Key Concepts

  • PVCRestore Resource: A custom resource that represents a single restore operation. One is created per annotation trigger. Tracks phase, owners, error state, and timing.

  • Phase-Based State Machine: The restore progresses through phases (Initialize → ScaleDown → Restore → Finalize → Finished). Each phase completes before moving to the next, enabling resumption after transient failures.

  • Locking: Prevents concurrent restores on the same PVC. Uses a ConfigMap to coordinate. Locks are automatically cleaned up after 5 minutes if abandoned (e.g., controller crash during restore).

  • Owner Tracking: Before scaling down, the controller finds all workloads mounting the PVC (Deployment, StatefulSet, DaemonSet, etc.) by walking pod ownership chains. Original replica counts are saved and restored exactly after the restore completes.

  • Timeout Protection: If restore exceeds 5 minutes, it fails with a timeout error. This prevents the controller from holding locks indefinitely.

Installation

Prerequisites

  • Kubernetes cluster with volsync deployed and configured
  • Helm 3.x

Deploy with Helm

Add the repository and install:

helm repo add homelab-helper https://benfiola.github.io/homelab-helper
helm repo update
helm install pvc-restore homelab-helper/pvc-restore \
--namespace pvc-restore-system \
--create-namespace

The chart deploys:

  • A Deployment running the controller
  • A ServiceAccount with necessary RBAC permissions
  • ClusterRole and ClusterRoleBinding for PVC, Pod, volsync resource access
  • Custom Resource Definition (PVCRestore)

Verify Installation

Check the deployment is running:

kubectl get deployment -n pvc-restore-system pvc-restore
kubectl logs -n pvc-restore-system -l app.kubernetes.io/name=pvc-restore

Usage

Restoring a PVC

To restore a PVC, annotate it with the restore trigger:

kubectl annotate pvc my-data pvc-restore.homelab-helper.benfiola.com/restore="<snapshot-selector>" --overwrite

The <snapshot-selector> controls which backup snapshot to restore:

  • Empty string (""): Restore from the latest snapshot.
  • Integer (e.g., "1", "2"): Restore from the N-th previous snapshot (1 = previous, 2 = two snapshots ago, etc.).
  • RFC3339 timestamp (e.g., "2024-01-15T10:30:00Z"): Restore data as it existed at that exact time.

Examples:

# Restore from the latest snapshot
kubectl annotate pvc my-data pvc-restore.homelab-helper.benfiola.com/restore="" --overwrite

# Restore from the previous snapshot
kubectl annotate pvc my-data pvc-restore.homelab-helper.benfiola.com/restore="1" --overwrite

# Restore to a specific point in time
kubectl annotate pvc my-data pvc-restore.homelab-helper.benfiola.com/restore="2024-01-15T10:30:00Z" --overwrite
ReplicationSource Required!

The PVC you're restoring must have a corresponding volsync ReplicationSource actively creating backups. The controller uses the ReplicationSource's configuration (mover settings, repository, etc.) to create a temporary ReplicationDestination for the restore.

Without an active ReplicationSource, the restore will fail.

Monitoring Restore Progress

Monitor active restores:

kubectl get pvcrestores

Check detailed status:

kubectl describe restore <name>

View the temporary ReplicationDestination created for the restore:

kubectl get replicationdestinations

The restore progresses through phases: Initialize → ScaleDown → Restore → Finalize → Finished. Once it reaches Finished, workloads are scaled back to their original replica counts and temporary resources are cleaned up.

Configuration

CLI Flags and Environment Variables

The controller is invoked as:

homelab-helper pvc-restore [flags]

Available flags and their environment variable equivalents:

FlagEnvironment VariableDefaultDescription
--cache-storage-classCACHE_STORAGE_CLASS""Storage class for ReplicationDestination cache PVC
--health-addressHEALTH_ADDRESS:8081Address for health/readiness probes (/healthz, /readyz)
--metrics-addressMETRICS_ADDRESS:8080Address for Prometheus metrics endpoint (/metrics)
--leader-electionLEADER_ELECTIONfalseEnable leader election for HA deployments
--kubeconfigKUBECONFIG""Path to kubeconfig; uses in-cluster config if empty

Cache Storage Class

No Default Storage Class?

If the cluster has no default StorageClass defined and cache storage class is unset, generated volsync jobs will fail to start because the cache PVC will never be bound.

This results in restore job timeouts.

By default, volsync ReplicationDestinations create cache PVCs using the cluster's default StorageClass. If you want to use a different storage class for cache volumes (e.g., to avoid filling fast tier storage), specify it:

homelab-helper pvc-restore --cache-storage-class=slow-tier

Or via environment variable:

export CACHE_STORAGE_CLASS=slow-tier
homelab-helper pvc-restore

Helm Chart Values

config:
# Storage class for ReplicationDestination cache volumes
# Leave empty to use cluster default
cacheStorageClass: ""

deployment:
image:
# Override image tag (defaults to chart version)
tag: ""

# Number of controller replicas (use >1 with --leader-election for HA)
replicas: 1

# Resource limits/requests (optional)
resources:
null
# Example:
# limits:
# cpu: 200m
# memory: 256Mi
# requests:
# cpu: 100m
# memory: 128Mi

PVCRestore Resource Reference

Spec

The PVCRestore spec defines what PVC to restore and which backup snapshot to use.

apiVersion: pvc-restore.homelab-helper.benfiola.com/v1
kind: PVCRestore
metadata:
name: my-data-1705316400
namespace: default
spec:
pvc: my-data
previous: 1
# OR
restoreAsOf: "2024-01-15T10:30:00Z"

Spec Fields:

FieldTypeRequiredDescription
pvcstringYesName of the PVC to restore into (must exist in same namespace)
previousintegerNoRestore from N-th previous snapshot. 1 = previous, 2 = two back, etc. Mutually exclusive with restoreAsOf
restoreAsOfRFC3339 stringNoRestore data as it existed at this timestamp. Mutually exclusive with previous

Status

The PVCRestore status communicates the restore state and progress.

status:
phase: Restore
pvcOwners:
- apiVersion: apps/v1
kind: Deployment
name: my-deployment
namespace: default
replicas: 3
replicationSource: my-data-replication-src
replicationDestination: pvc-restore-my-data-1705316400
error: null
observedGeneration: 1
lastReconciledTime: "2024-01-15T10:32:15Z"

Status Fields:

FieldTypeDescription
phasestringCurrent phase: Initialize, ScaleDown, Restore, Finalize, or Finished
pvcOwnersarrayWorkloads (Deployments, StatefulSets, etc.) that mount the PVC, with their original replica counts
replicationSourcestringName of the volsync ReplicationSource backing up the PVC
replicationDestinationstringName of the temporary ReplicationDestination created for this restore
errorstring or nullError message if restore failed, otherwise null
observedGenerationintegerTracks which PVCRestore spec generation was last processed
lastReconciledTimeRFC3339 stringTimestamp of last successful reconciliation

PVCRestore Lifecycle

Once created, a PVCRestore transitions through phases:

  1. Initialize: Acquires a lock to prevent concurrent restores. Adds a finalizer to ensure cleanup on deletion.
  2. ScaleDown: Finds all workloads mounting the PVC, records their replica counts, then scales them to zero.
  3. Restore: Creates a ReplicationDestination pointing to the selected snapshot and waits for data sync to complete.
  4. Finalize: Scales workloads back to original replica counts, deletes the ReplicationDestination, and releases the lock.
  5. Finished: Terminal state. The restore is complete.

If any phase fails, the status moves directly to Finalize (to cleanup), then Finished.

Troubleshooting

Initial Diagnostics

When a restore is stuck or failing, start with these steps:

Check the PVCRestore status:

kubectl describe pvcrestore <name>

The status shows which phase the restore is in and any error message.

Check controller logs:

kubectl logs -n pvc-restore-system -l app.kubernetes.io/name=pvc-restore

Check ReplicationDestination and mover status:

kubectl get replicationdestinations

If a ReplicationDestination exists for your restore, check its mover status:

kubectl get replicationdestinations pvc-restore-<name> -o jsonpath='{.status.latestMoverStatus}'

The mover status field indicates:

  • Empty: Sync is still running
  • Succeeded: Sync completed
  • Failed: Sync failed

If the mover is in progress, check the volsync replication job logs:

Configuring Volsync Mover Pod Labels

By default, volsync does not attach any labels to mover pods. These are defined via ReplicationSource and ReplicationDestination resources.

In this case, ensure your ReplicationSource defines volsync mover pod labels - and that you use the correct labels for this command!

kubectl logs -n <namespace> -l app.kubernetes.io/pod=volsync-mover

Restore Not Starting

Symptom: PVCRestore resource exists but remains in Initialize phase indefinitely.

Check lock contention:

kubectl get configmaps -n <namespace> | grep pvc-restore-lock

If a lock ConfigMap exists for your PVC, another restore may be in progress or deadlocked. Check for running PVCRestore resources:

kubectl get pvcrestores -n <namespace> | grep <pvc-name>

If you see stale locks and no corresponding PVCRestore, the lock cleanup reconciler should clean them up (runs periodically). To manually clean up:

kubectl delete configmap pvc-restore-lock-<pvc-name> -n <namespace>

Restore Stuck in Restore Phase

Symptom: PVCRestore has been in Restore phase for an extended time, or timed out.

Follow the Initial Diagnostics steps above. Common causes:

  • ReplicationDestination mover is slow (backup is large or network is slow)
  • ReplicationSource hasn't produced any snapshots yet
  • Mover job failed to start

Check the volsync mover logs to see if the sync is in progress, succeeded, or failed. If the mover succeeded but restore still times out, check that the ReplicationSource is healthy:

kubectl get replicationsources <name> -o jsonpath='{.status.latestMoverStatus}'

Ensure it's actually creating snapshots.

Enable Debug Logging

Run controller with debug logging:

kubectl set env -n pvc-restore-system deployment/pvc-restore LOG_LEVEL=debug

Then tail the logs:

kubectl logs -n pvc-restore-system -l app.kubernetes.io/name=pvc-restore -f

Limitations

  • One restore at a time per PVC: Concurrent restores on the same PVC are serialized via locking. The second annotate will block until the first completes.
  • Fixed timeout: Restore must complete within 5 minutes.
  • Scale subresource required: Workloads must support the .scale subresource (Deployment, StatefulSet, DaemonSet, etc.). Other workload types cannot be automatically scaled down.
  • PVC must exist: You're restoring into an existing PVC. The controller does not create the target PVC.
  • Single namespace: Restore only works for PVCs and workloads in the same namespace.

See Also