pvc-restore

Overview

pvc-restore is a Kubernetes controller that automates PVC restoration from volsync backups. Instead of manually orchestrating a complex multi-step restore process, you annotate a PVC and the controller handles everything: scaling down workloads, syncing backup data, and scaling back up.

The Problem

Restoring a PVC from a volsync backup traditionally requires the following steps:

Create a new PVC via ReplicationDestination pointing to the specific backup snapshot
Scale down all workloads that use the target PVC
Create a temporary pod mounting both the backup PVC and target PVC
Manually copy files between volumes
Delete the temporary pod
Scale workloads back up
Clean up temporary PVC and ReplicationDestination

This process is error-prone, time-consuming, and requires careful orchestration. One mistake—forgetting to scale down, or terminating mid-copy—corrupts the PVC or leaves resources dangling.

The controller eliminates this friction by automating the entire workflow through a single annotation.

How It Works

Architecture

pvc-restore uses three reconcilers working together:

PersistentVolumeClaimReconciler: Watches PVCs for the restore annotation. When detected, it clears the annotation and creates a PVCRestore resource to trigger the restore process.
PVCRestoreReconciler: Manages the restore workflow through a state machine. It orchestrates the phases of restore: acquiring locks, scaling down workloads, syncing data, and scaling back up.
LockCleanupReconciler: Periodically cleans up stale locks from failed restores, preventing deadlock scenarios.

Workflow

Annotate PVC
       ↓
PersistentVolumeClaimReconciler detects annotation
       ↓
Creates PVCRestore resource
       ↓
PVCRestoreReconciler enters Initialize phase
       ↓
Acquires lock (prevents concurrent restores)
       ↓
Transitions to ScaleDown phase
       ↓
Finds workloads mounting the PVC, records their replica counts
       ↓
Scales all workloads to zero replicas
       ↓
Transitions to Restore phase
       ↓
Finds ReplicationSource for the PVC
       ↓
Creates ReplicationDestination pointing to PVC
       ↓
Waits for restore to complete
       ↓
Transitions to Finalize phase
       ↓
Scales workloads back to original replica counts
       ↓
Cleans up ReplicationDestination
       ↓
Releases lock
       ↓
Transitions to Finished (terminal state)

Key Concepts

PVCRestore Resource: A custom resource that represents a single restore operation. One is created per annotation trigger. Tracks phase, owners, error state, and timing.
Phase-Based State Machine: The restore progresses through phases (Initialize → ScaleDown → Restore → Finalize → Finished). Each phase completes before moving to the next, enabling resumption after transient failures.
Locking: Prevents concurrent restores on the same PVC. Uses a ConfigMap to coordinate. Locks are automatically cleaned up after 5 minutes if abandoned (e.g., controller crash during restore).
Owner Tracking: Before scaling down, the controller finds all workloads mounting the PVC (Deployment, StatefulSet, DaemonSet, etc.) by walking pod ownership chains. Original replica counts are saved and restored exactly after the restore completes.
Timeout Protection: If restore exceeds 5 minutes, it fails with a timeout error. This prevents the controller from holding locks indefinitely.

Installation

Prerequisites

Kubernetes cluster with volsync deployed and configured
Helm 3.x

Deploy with Helm

Add the repository and install:

helm repo add homelab-helper https://benfiola.github.io/homelab-helper
helm repo update
helm install pvc-restore homelab-helper/pvc-restore \
  --namespace pvc-restore-system \
  --create-namespace

The chart deploys:

A Deployment running the controller
A ServiceAccount with necessary RBAC permissions
ClusterRole and ClusterRoleBinding for PVC, Pod, volsync resource access
Custom Resource Definition (PVCRestore)

Verify Installation

Check the deployment is running:

kubectl get deployment -n pvc-restore-system pvc-restore
kubectl logs -n pvc-restore-system -l app.kubernetes.io/name=pvc-restore

Usage

Restoring a PVC

To restore a PVC, annotate it with the restore trigger:

kubectl annotate pvc my-data pvc-restore.homelab-helper.benfiola.com/restore="<snapshot-selector>" --overwrite

The <snapshot-selector> controls which backup snapshot to restore:

Empty string (""): Restore from the latest snapshot.
Integer (e.g., "1", "2"): Restore from the N-th previous snapshot (1 = previous, 2 = two snapshots ago, etc.).
RFC3339 timestamp (e.g., "2024-01-15T10:30:00Z"): Restore data as it existed at that exact time.

Examples:

# Restore from the latest snapshot
kubectl annotate pvc my-data pvc-restore.homelab-helper.benfiola.com/restore="" --overwrite

# Restore from the previous snapshot
kubectl annotate pvc my-data pvc-restore.homelab-helper.benfiola.com/restore="1" --overwrite

# Restore to a specific point in time
kubectl annotate pvc my-data pvc-restore.homelab-helper.benfiola.com/restore="2024-01-15T10:30:00Z" --overwrite

ReplicationSource Required!

The PVC you're restoring must have a corresponding volsync ReplicationSource actively creating backups. The controller uses the ReplicationSource's configuration (mover settings, repository, etc.) to create a temporary ReplicationDestination for the restore.

Without an active ReplicationSource, the restore will fail.

Monitoring Restore Progress

Monitor active restores:

kubectl get pvcrestores

Check detailed status:

kubectl describe restore <name>

View the temporary ReplicationDestination created for the restore:

kubectl get replicationdestinations

The restore progresses through phases: Initialize → ScaleDown → Restore → Finalize → Finished. Once it reaches Finished, workloads are scaled back to their original replica counts and temporary resources are cleaned up.

Configuration

CLI Flags and Environment Variables

The controller is invoked as:

homelab-helper pvc-restore [flags]

Available flags and their environment variable equivalents:

Flag	Environment Variable	Default	Description
`--cache-storage-class`	`CACHE_STORAGE_CLASS`	""	Storage class for ReplicationDestination cache PVC
`--health-address`	`HEALTH_ADDRESS`	`:8081`	Address for health/readiness probes (`/healthz`, `/readyz`)
`--metrics-address`	`METRICS_ADDRESS`	`:8080`	Address for Prometheus metrics endpoint (`/metrics`)
`--leader-election`	`LEADER_ELECTION`	`false`	Enable leader election for HA deployments
`--kubeconfig`	`KUBECONFIG`	""	Path to kubeconfig; uses in-cluster config if empty

Cache Storage Class

No Default Storage Class?

If the cluster has no default StorageClass defined and cache storage class is unset, generated volsync jobs will fail to start because the cache PVC will never be bound.

This results in restore job timeouts.

By default, volsync ReplicationDestinations create cache PVCs using the cluster's default StorageClass. If you want to use a different storage class for cache volumes (e.g., to avoid filling fast tier storage), specify it:

homelab-helper pvc-restore --cache-storage-class=slow-tier

Or via environment variable:

export CACHE_STORAGE_CLASS=slow-tier
homelab-helper pvc-restore

Helm Chart Values

config:
  # Storage class for ReplicationDestination cache volumes
  # Leave empty to use cluster default
  cacheStorageClass: ""

deployment:
  image:
    # Override image tag (defaults to chart version)
    tag: ""

  # Number of controller replicas (use >1 with --leader-election for HA)
  replicas: 1

  # Resource limits/requests (optional)
  resources:
    null
    # Example:
    # limits:
    #   cpu: 200m
    #   memory: 256Mi
    # requests:
    #   cpu: 100m
    #   memory: 128Mi

PVCRestore Resource Reference

Spec

The PVCRestore spec defines what PVC to restore and which backup snapshot to use.

apiVersion: pvc-restore.homelab-helper.benfiola.com/v1
kind: PVCRestore
metadata:
  name: my-data-1705316400
  namespace: default
spec:
  pvc: my-data
  previous: 1
  # OR
  restoreAsOf: "2024-01-15T10:30:00Z"

Spec Fields:

Field	Type	Required	Description
`pvc`	string	Yes	Name of the PVC to restore into (must exist in same namespace)
`previous`	integer	No	Restore from N-th previous snapshot. 1 = previous, 2 = two back, etc. Mutually exclusive with `restoreAsOf`
`restoreAsOf`	RFC3339 string	No	Restore data as it existed at this timestamp. Mutually exclusive with `previous`

Status

The PVCRestore status communicates the restore state and progress.

status:
  phase: Restore
  pvcOwners:
    - apiVersion: apps/v1
      kind: Deployment
      name: my-deployment
      namespace: default
      replicas: 3
  replicationSource: my-data-replication-src
  replicationDestination: pvc-restore-my-data-1705316400
  error: null
  observedGeneration: 1
  lastReconciledTime: "2024-01-15T10:32:15Z"

Status Fields:

Field	Type	Description
`phase`	string	Current phase: Initialize, ScaleDown, Restore, Finalize, or Finished
`pvcOwners`	array	Workloads (Deployments, StatefulSets, etc.) that mount the PVC, with their original replica counts
`replicationSource`	string	Name of the volsync ReplicationSource backing up the PVC
`replicationDestination`	string	Name of the temporary ReplicationDestination created for this restore
`error`	string or null	Error message if restore failed, otherwise null
`observedGeneration`	integer	Tracks which PVCRestore spec generation was last processed
`lastReconciledTime`	RFC3339 string	Timestamp of last successful reconciliation

PVCRestore Lifecycle

Once created, a PVCRestore transitions through phases:

Initialize: Acquires a lock to prevent concurrent restores. Adds a finalizer to ensure cleanup on deletion.
ScaleDown: Finds all workloads mounting the PVC, records their replica counts, then scales them to zero.
Restore: Creates a ReplicationDestination pointing to the selected snapshot and waits for data sync to complete.
Finalize: Scales workloads back to original replica counts, deletes the ReplicationDestination, and releases the lock.
Finished: Terminal state. The restore is complete.

If any phase fails, the status moves directly to Finalize (to cleanup), then Finished.

Troubleshooting

Initial Diagnostics

When a restore is stuck or failing, start with these steps:

Check the PVCRestore status:

kubectl describe pvcrestore <name>

The status shows which phase the restore is in and any error message.

Check controller logs:

kubectl logs -n pvc-restore-system -l app.kubernetes.io/name=pvc-restore

Check ReplicationDestination and mover status:

kubectl get replicationdestinations

If a ReplicationDestination exists for your restore, check its mover status:

kubectl get replicationdestinations pvc-restore-<name> -o jsonpath='{.status.latestMoverStatus}'

The mover status field indicates:

Empty: Sync is still running
Succeeded: Sync completed
Failed: Sync failed

If the mover is in progress, check the volsync replication job logs:

Configuring Volsync Mover Pod Labels

By default, volsync does not attach any labels to mover pods. These are defined via ReplicationSource and ReplicationDestination resources.

In this case, ensure your ReplicationSource defines volsync mover pod labels - and that you use the correct labels for this command!

kubectl logs -n <namespace> -l app.kubernetes.io/pod=volsync-mover

Restore Not Starting

Symptom: PVCRestore resource exists but remains in Initialize phase indefinitely.

Check lock contention:

kubectl get configmaps -n <namespace> | grep pvc-restore-lock

If a lock ConfigMap exists for your PVC, another restore may be in progress or deadlocked. Check for running PVCRestore resources:

kubectl get pvcrestores -n <namespace> | grep <pvc-name>

If you see stale locks and no corresponding PVCRestore, the lock cleanup reconciler should clean them up (runs periodically). To manually clean up:

kubectl delete configmap pvc-restore-lock-<pvc-name> -n <namespace>

Restore Stuck in Restore Phase

Symptom: PVCRestore has been in Restore phase for an extended time, or timed out.

Follow the Initial Diagnostics steps above. Common causes:

ReplicationDestination mover is slow (backup is large or network is slow)
ReplicationSource hasn't produced any snapshots yet
Mover job failed to start

Check the volsync mover logs to see if the sync is in progress, succeeded, or failed. If the mover succeeded but restore still times out, check that the ReplicationSource is healthy:

kubectl get replicationsources <name> -o jsonpath='{.status.latestMoverStatus}'

Ensure it's actually creating snapshots.

Enable Debug Logging

Run controller with debug logging:

kubectl set env -n pvc-restore-system deployment/pvc-restore LOG_LEVEL=debug

Then tail the logs:

kubectl logs -n pvc-restore-system -l app.kubernetes.io/name=pvc-restore -f

Limitations

One restore at a time per PVC: Concurrent restores on the same PVC are serialized via locking. The second annotate will block until the first completes.
Fixed timeout: Restore must complete within 5 minutes.
Scale subresource required: Workloads must support the .scale subresource (Deployment, StatefulSet, DaemonSet, etc.). Other workload types cannot be automatically scaled down.
PVC must exist: You're restoring into an existing PVC. The controller does not create the target PVC.
Single namespace: Restore only works for PVCs and workloads in the same namespace.

Overview​

The Problem​

How It Works​

Architecture​

Workflow​

Key Concepts​

Installation​

Prerequisites​

Deploy with Helm​

Verify Installation​

Usage​

Restoring a PVC​

Monitoring Restore Progress​

Configuration​

CLI Flags and Environment Variables​

Cache Storage Class​

Helm Chart Values​

PVCRestore Resource Reference​

Spec​

Status​

PVCRestore Lifecycle​

Troubleshooting​

Initial Diagnostics​

Restore Not Starting​

Restore Stuck in Restore Phase​

Enable Debug Logging​

Limitations​

See Also​