Tools: From Disaster to Recovery: A Practical Case Study on Kubernetes etcd Backups

Tools: From Disaster to Recovery: A Practical Case Study on Kubernetes etcd Backups

1. Backup Solution Approach

2. Detailed Implementation Steps

Preparation: Creating Storage Paths on the NFS Server

Copy etcd Certificates to the NFS Directory

Writing the PVC Manifest: Creating Persistent Volumes Based on the NFS StorageClass

Key Points:

Writing the Dockerfile

etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system

Building the Image

Writing the CronJob Manifest (Adapted for NFS Mounts)

Testing and Verification: Confirming NFS Backup Success

3. Data Recovery Steps

Pre-recovery Preparation

Perform Recovery

Back Up Existing Data on All etcd Nodes

Post-Recovery Verification and Service Startup

⚠️ Important Considerations In Kubernetes (K8s) clusters, etcd functions as the "brain." It stores all state data for the entire cluster ranging from Pod configurations and service registrations to network policy definitions, making the cluster's stability entirely dependent on etcd. Any loss or corruption of etcd data can paralyze the cluster and severely impact business operations. However, real world operations are fraught with unexpected events such as human error, hardware failure, and network anomalies, all of which threaten data integrity. Therefore, building a reliable backup mechanism, specifically periodic automated backups, is a critical link in ensuring K8S cluster stability and business continuity. This article focuses on periodic etcd data backups within a K8S environment. Through a hands-on case study, we will demonstrate how to build an efficient and stable automated backup solution to help O&M personnel navigate data security challenges. While using Ceph RBD or Object Storage for backend storage is often more effective, the operational logic remains similar. Therefore, this demonstration will utilize PV-NFS. The core logic of this solution is to leverage the native Kubernetes CronJob controller for periodic task scheduling. The etcdctl utility performs the backup, while NFS storage relies on server-side directory exports. You must first create dedicated directories for certificates and backups on the NFS server and configure the appropriate permissions. Verify the NFS Shared Directory

In this environment, the NFS server's shared directory is /data/nfs-server (verified via the exportfs command), which is accessible by all K8S nodes. Create Certificate and Backup DirectoriesCreate a certificate directory (etcd-certs) and a backup directory (etcd-backup) within /data/nfs-server. Assign read/write permissions to ensure the K8S Pods can access them: Ensure that the worker nodes can properly mount and use the NFS storage. Copy the etcd certificates to the etcd-certs directory on the NFS server (the source certificate path remains /etc/kubernetes/pki/etcd/): Create two PersistentVolumeClaims (PVCs): one for mounting certificates (Read-Only) and one for mounting the backup directory (Read-Write). Create the Certificate PVC (etcd-certs-pvc) Create a new file named static-pv-pvc-etcd-certs.yaml. This will be used to mount the NFS etcd-certs directory (certificates require ReadOnlyMany access). Create the Backup PVC (etcd-backup-pvc) Create a new file named etcd-backup-pv-pvc.yaml. This will be used to mount the NFS etcd-backup directory (requires Read-Write access for storing backup files): Create PVCs and Verify Status Run the following commands to create the PVCs and confirm that their status is Bound: Certificate Directory using ReadOnlyMany (ROX): Allows Pods on multiple nodes to mount the volume as read-only, preventing accidental modification of sensitive certificates. Backup Directory using ReadWriteMany (RWX): Allows Pods on any node to perform read/write operations, which is essential since the CronJob Pod may be scheduled on different nodes across the cluster. StorageClass:

Must be specified as the existing nfs-csi to ensure the PVC can dynamically bind to the NFS volumes. The Dockerfile only defines the etcdctl utility and the backup command environment within the container. It is independent of the storage type, so we can directly reuse the previous configuration: Download the etcdctl binary from GitHub and save the executable. Note: The main branch may be in an unstable or even broken state during development. For stable versions, see releases. etcd is a distributed reliable key-value store for the most critical data of a distributed system, with a focus on being: etcd is written in Go and uses the Raft consensus algorithm to manage a highly-available replicated log. etcd is used in production by many companies, and the development team stands behind it in critical deployment scenarios, where etcd is frequently teamed with applications such as Kubernetes, locksmith, vulcand, Doorman, and many others. Reliability is further ensured by rigorous robustness testing. See etcdctl for a simple command line client. Original image credited to xkcd.com/2347, alterations… As shown in the directory listing above, the preparation involves placing the etcdctl binary in the same build context as your Dockerfile. This ensures that when the image is built, the tool is available inside the container to execute the snapshot commands. This process manually distributes the backup image across the cluster nodes. In a production environment, it is generally recommended to push the image to a Private Container Registry (like Harbor or Azure Container Registry) and configure an imagePullSecret to allow the nodes to pull the image automatically. Create a new file named cj-backup-etcd-nfs.yaml with the following: Deploying the CronJob: The primary goal of this stage is to verify that the backup files are being correctly generated and stored in the /etcd-backup directory on the NFS server. Check CronJob and Pod Status: Before initiating a restore, it is critical to ensure the integrity of your backup file. Verify Backup Validity First, confirm that the backup file is complete and usable. Use the etcdctl utility to verify the snapshot status: When restoring etcd data, you must stop all control plane components that depend on etcd. This prevents data write conflicts and ensures the restoration process has exclusive access to the data directory. Multi-node etcd Cluster Recovery (Production Environment) In a high-availability cluster (typically 3 nodes), the restoration must be performed on all nodes. Each node must rebuild its data directory from the same snapshot to ensure consistency across the cluster. 1. Fix Directory Permissions After performing the restoration, you must ensure that the etcd data directory has the correct ownership and permissions. If the permissions are incorrect (e.g., owned by root while the service expects the etcd user), the etcd service will likely fail to start. 2. Start Control Plane Components 3. Verify Cluster Status Confirm Successful Recovery: Restoration Overwrites Existing Data: The restoration process wipes the current etcd data. Always verify your backup file before proceeding. Version Compatibility: The etcdctl version must strictly match the etcd cluster version (e.g., use v3.5.0 etcdctl for a v3.5.0 etcd cluster); otherwise, the restoration may fail. Production Environment Recommendations: Backup Before You Restore: Always take a fresh snapshot of the current (even if corrupted) etcd data (etcdctl snapshot save) before starting the recovery. This provides a rollback point in case of operational errors. Plan for Downtime: The restoration process causes brief cluster unavailability. It is highly recommended to perform this during off-peak hours. Post-Recovery Sync Check: After restoring a multi-node etcd cluster, verify that all nodes have successfully joined and are in sync using etcdctl member list. By following these steps, you can leverage etcd snapshot backups to restore Kubernetes cluster data, ensuring rapid service recovery in the event of data anomalies. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

$ root@k8s-master:~# mkdir -p /data/nfs-server/etcd-certs root@k8s-master:~# mkdir -p /data/nfs-server/etcd-backup root@k8s-master:~# ll /data/nfs-server/ total 32 drwxr-xr-x 8 root root 4096 Jul 30 16:21 ./ drwxr-xr-x 3 root root 4096 Jul 30 14:54 ../ -rw-r--r-- 1 root root 0 Jul 30 14:55 1.txt drwxr-xr-x 2 root root 4096 Jul 30 16:21 etcd-backup/ drwxr-xr-x 2 root root 4096 Jul 30 16:20 etcd-certs/ drwxrwxrwx 2 root root 4096 Jul 30 15:03 pv1/ drwxrwxrwx 2 root root 4096 Jul 30 15:03 pv2/ drwxrwxrwx 2 root root 4096 Jul 30 15:03 pv3/ drwxr-xr-x 3 root root 4096 Jul 30 15:22 sc/ root@k8s-master:~# chmod 755 /data/nfs-server/etcd-certs root@k8s-master:~# chmod 755 /data/nfs-server/etcd-backup root@k8s-master:~# mkdir -p /data/nfs-server/etcd-certs root@k8s-master:~# mkdir -p /data/nfs-server/etcd-backup root@k8s-master:~# ll /data/nfs-server/ total 32 drwxr-xr-x 8 root root 4096 Jul 30 16:21 ./ drwxr-xr-x 3 root root 4096 Jul 30 14:54 ../ -rw-r--r-- 1 root root 0 Jul 30 14:55 1.txt drwxr-xr-x 2 root root 4096 Jul 30 16:21 etcd-backup/ drwxr-xr-x 2 root root 4096 Jul 30 16:20 etcd-certs/ drwxrwxrwx 2 root root 4096 Jul 30 15:03 pv1/ drwxrwxrwx 2 root root 4096 Jul 30 15:03 pv2/ drwxrwxrwx 2 root root 4096 Jul 30 15:03 pv3/ drwxr-xr-x 3 root root 4096 Jul 30 15:22 sc/ root@k8s-master:~# chmod 755 /data/nfs-server/etcd-certs root@k8s-master:~# chmod 755 /data/nfs-server/etcd-backup root@k8s-master:~# mkdir -p /data/nfs-server/etcd-certs root@k8s-master:~# mkdir -p /data/nfs-server/etcd-backup root@k8s-master:~# ll /data/nfs-server/ total 32 drwxr-xr-x 8 root root 4096 Jul 30 16:21 ./ drwxr-xr-x 3 root root 4096 Jul 30 14:54 ../ -rw-r--r-- 1 root root 0 Jul 30 14:55 1.txt drwxr-xr-x 2 root root 4096 Jul 30 16:21 etcd-backup/ drwxr-xr-x 2 root root 4096 Jul 30 16:20 etcd-certs/ drwxrwxrwx 2 root root 4096 Jul 30 15:03 pv1/ drwxrwxrwx 2 root root 4096 Jul 30 15:03 pv2/ drwxrwxrwx 2 root root 4096 Jul 30 15:03 pv3/ drwxr-xr-x 3 root root 4096 Jul 30 15:22 sc/ root@k8s-master:~# chmod 755 /data/nfs-server/etcd-certs root@k8s-master:~# chmod 755 /data/nfs-server/etcd-backup root@k8s-node1:~# -weight: 500;">install -d /data/nfs-server/ root@k8s-node1:~# mount -t nfs 10.0.0.6:/data/nfs-server /data/nfs-server/ root@k8s-node1:~# ll /data/nfs-server/ total 32 drwxr-xr-x 8 root root 4096 Jul 30 16:21 ./ drwxr-xr-x 3 root root 4096 Jul 30 17:02 ../ -rw-r--r-- 1 root root 0 Jul 30 14:55 1.txt drwxr-xr-x 2 root root 4096 Jul 30 16:21 etcd-backup/ drwxr-xr-x 2 root root 4096 Jul 30 16:22 etcd-certs/ drwxrwxrwx 2 root root 4096 Jul 30 15:03 pv1/ drwxrwxrwx 2 root root 4096 Jul 30 15:03 pv2/ drwxrwxrwx 2 root root 4096 Jul 30 15:03 pv3/ drwxr-xr-x 5 root root 4096 Jul 30 16:26 sc/ root@k8s-node1:~# -weight: 500;">install -d /data/nfs-server/ root@k8s-node1:~# mount -t nfs 10.0.0.6:/data/nfs-server /data/nfs-server/ root@k8s-node1:~# ll /data/nfs-server/ total 32 drwxr-xr-x 8 root root 4096 Jul 30 16:21 ./ drwxr-xr-x 3 root root 4096 Jul 30 17:02 ../ -rw-r--r-- 1 root root 0 Jul 30 14:55 1.txt drwxr-xr-x 2 root root 4096 Jul 30 16:21 etcd-backup/ drwxr-xr-x 2 root root 4096 Jul 30 16:22 etcd-certs/ drwxrwxrwx 2 root root 4096 Jul 30 15:03 pv1/ drwxrwxrwx 2 root root 4096 Jul 30 15:03 pv2/ drwxrwxrwx 2 root root 4096 Jul 30 15:03 pv3/ drwxr-xr-x 5 root root 4096 Jul 30 16:26 sc/ root@k8s-node1:~# -weight: 500;">install -d /data/nfs-server/ root@k8s-node1:~# mount -t nfs 10.0.0.6:/data/nfs-server /data/nfs-server/ root@k8s-node1:~# ll /data/nfs-server/ total 32 drwxr-xr-x 8 root root 4096 Jul 30 16:21 ./ drwxr-xr-x 3 root root 4096 Jul 30 17:02 ../ -rw-r--r-- 1 root root 0 Jul 30 14:55 1.txt drwxr-xr-x 2 root root 4096 Jul 30 16:21 etcd-backup/ drwxr-xr-x 2 root root 4096 Jul 30 16:22 etcd-certs/ drwxrwxrwx 2 root root 4096 Jul 30 15:03 pv1/ drwxrwxrwx 2 root root 4096 Jul 30 15:03 pv2/ drwxrwxrwx 2 root root 4096 Jul 30 15:03 pv3/ drwxr-xr-x 5 root root 4096 Jul 30 16:26 sc/ root@k8s-master:~# cp /etc/kubernetes/pki/etcd/{ca.crt,peer.crt,peer.key} /data/nfs-server/etcd-certs/ root@k8s-master:~# ll /data/nfs-server/etcd-certs/ total 20 drwxr-xr-x 2 root root 4096 Jul 30 16:22 ./ drwxr-xr-x 8 root root 4096 Jul 30 16:21 ../ -rw-r--r-- 1 root root 1094 Jul 30 16:22 ca.crt -rw-r--r-- 1 root root 1204 Jul 30 16:22 peer.crt -rw------- 1 root root 1675 Jul 30 16:22 peer.key root@k8s-master:~# cp /etc/kubernetes/pki/etcd/{ca.crt,peer.crt,peer.key} /data/nfs-server/etcd-certs/ root@k8s-master:~# ll /data/nfs-server/etcd-certs/ total 20 drwxr-xr-x 2 root root 4096 Jul 30 16:22 ./ drwxr-xr-x 8 root root 4096 Jul 30 16:21 ../ -rw-r--r-- 1 root root 1094 Jul 30 16:22 ca.crt -rw-r--r-- 1 root root 1204 Jul 30 16:22 peer.crt -rw------- 1 root root 1675 Jul 30 16:22 peer.key root@k8s-master:~# cp /etc/kubernetes/pki/etcd/{ca.crt,peer.crt,peer.key} /data/nfs-server/etcd-certs/ root@k8s-master:~# ll /data/nfs-server/etcd-certs/ total 20 drwxr-xr-x 2 root root 4096 Jul 30 16:22 ./ drwxr-xr-x 8 root root 4096 Jul 30 16:21 ../ -rw-r--r-- 1 root root 1094 Jul 30 16:22 ca.crt -rw-r--r-- 1 root root 1204 Jul 30 16:22 peer.crt -rw------- 1 root root 1675 Jul 30 16:22 peer.key apiVersion: v1 kind: PersistentVolume metadata: name: static-pv-etcd-certs # PV Name spec: capacity: storage: 1Gi # Capacity is for identification; does not restrict actual storage accessModes: - ReadOnlyMany # Read-only, allowing access from multiple nodes persistentVolumeReclaimPolicy: Retain # Retain data; do not delete files when PVC is deleted nfs: server: 10.0.0.6 # NFS Server IP (as specified in your environment) path: /data/nfs-server/etcd-certs # Manually created directory for certificates storageClassName: "" # No StorageClass specified to avoid dynamic provisioning --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: static-pvc-etcd-certs # PVC Name spec: accessModes: - ReadOnlyMany resources: requests: storage: 1Gi volumeName: static-pv-etcd-certs # Manually bound to the PV defined above storageClassName: "" # Must match the PV configuration apiVersion: v1 kind: PersistentVolume metadata: name: static-pv-etcd-certs # PV Name spec: capacity: storage: 1Gi # Capacity is for identification; does not restrict actual storage accessModes: - ReadOnlyMany # Read-only, allowing access from multiple nodes persistentVolumeReclaimPolicy: Retain # Retain data; do not delete files when PVC is deleted nfs: server: 10.0.0.6 # NFS Server IP (as specified in your environment) path: /data/nfs-server/etcd-certs # Manually created directory for certificates storageClassName: "" # No StorageClass specified to avoid dynamic provisioning --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: static-pvc-etcd-certs # PVC Name spec: accessModes: - ReadOnlyMany resources: requests: storage: 1Gi volumeName: static-pv-etcd-certs # Manually bound to the PV defined above storageClassName: "" # Must match the PV configuration apiVersion: v1 kind: PersistentVolume metadata: name: static-pv-etcd-certs # PV Name spec: capacity: storage: 1Gi # Capacity is for identification; does not restrict actual storage accessModes: - ReadOnlyMany # Read-only, allowing access from multiple nodes persistentVolumeReclaimPolicy: Retain # Retain data; do not delete files when PVC is deleted nfs: server: 10.0.0.6 # NFS Server IP (as specified in your environment) path: /data/nfs-server/etcd-certs # Manually created directory for certificates storageClassName: "" # No StorageClass specified to avoid dynamic provisioning --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: static-pvc-etcd-certs # PVC Name spec: accessModes: - ReadOnlyMany resources: requests: storage: 1Gi volumeName: static-pv-etcd-certs # Manually bound to the PV defined above storageClassName: "" # Must match the PV configuration apiVersion: v1 kind: PersistentVolume metadata: name: static-pv-etcd-backup # PV Name spec: capacity: storage: 10Gi accessModes: - ReadWriteMany # Read-Write, allowing multi-node access persistentVolumeReclaimPolicy: Retain nfs: server: 10.0.0.6 # NFS Server IP path: /data/nfs-server/etcd-backup # Manually created backup directory storageClassName: "" --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: static-pvc-etcd-backup # PVC Name spec: accessModes: - ReadWriteMany resources: requests: storage: 10Gi volumeName: static-pv-etcd-backup # Manually bound to the PV above storageClassName: "" apiVersion: v1 kind: PersistentVolume metadata: name: static-pv-etcd-backup # PV Name spec: capacity: storage: 10Gi accessModes: - ReadWriteMany # Read-Write, allowing multi-node access persistentVolumeReclaimPolicy: Retain nfs: server: 10.0.0.6 # NFS Server IP path: /data/nfs-server/etcd-backup # Manually created backup directory storageClassName: "" --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: static-pvc-etcd-backup # PVC Name spec: accessModes: - ReadWriteMany resources: requests: storage: 10Gi volumeName: static-pv-etcd-backup # Manually bound to the PV above storageClassName: "" apiVersion: v1 kind: PersistentVolume metadata: name: static-pv-etcd-backup # PV Name spec: capacity: storage: 10Gi accessModes: - ReadWriteMany # Read-Write, allowing multi-node access persistentVolumeReclaimPolicy: Retain nfs: server: 10.0.0.6 # NFS Server IP path: /data/nfs-server/etcd-backup # Manually created backup directory storageClassName: "" --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: static-pvc-etcd-backup # PVC Name spec: accessModes: - ReadWriteMany resources: requests: storage: 10Gi volumeName: static-pv-etcd-backup # Manually bound to the PV above storageClassName: "" # Create static PVs -weight: 500;">kubectl apply -f static-pv-etcd-certs.yaml -weight: 500;">kubectl apply -f static-pv-etcd-backup.yaml # Create static PVCs (will automatically bind to the PVs with the same names) -weight: 500;">kubectl apply -f static-pvc-etcd-certs.yaml -weight: 500;">kubectl apply -f static-pvc-etcd-backup.yaml # Verify -weight: 500;">status (Ensure STATUS is "Bound") root@k8s-master:~# -weight: 500;">kubectl get pv,pvc # Create static PVs -weight: 500;">kubectl apply -f static-pv-etcd-certs.yaml -weight: 500;">kubectl apply -f static-pv-etcd-backup.yaml # Create static PVCs (will automatically bind to the PVs with the same names) -weight: 500;">kubectl apply -f static-pvc-etcd-certs.yaml -weight: 500;">kubectl apply -f static-pvc-etcd-backup.yaml # Verify -weight: 500;">status (Ensure STATUS is "Bound") root@k8s-master:~# -weight: 500;">kubectl get pv,pvc # Create static PVs -weight: 500;">kubectl apply -f static-pv-etcd-certs.yaml -weight: 500;">kubectl apply -f static-pv-etcd-backup.yaml # Create static PVCs (will automatically bind to the PVs with the same names) -weight: 500;">kubectl apply -f static-pvc-etcd-certs.yaml -weight: 500;">kubectl apply -f static-pvc-etcd-backup.yaml # Verify -weight: 500;">status (Ensure STATUS is "Bound") root@k8s-master:~# -weight: 500;">kubectl get pv,pvc root@k8s-master:~/bak-etcd# cat Dockerfile FROM alpine:latest LABEL matainer="NovaCaoFc" \ role="bak" \ project="etcd" COPY etcdctl /usr/local/bin/ CMD ["/bin/sh","-c","etcdctl --endpoints=${ETCD_HOST}:${ETCD_PORT} --cacert=/certs/ca.crt --cert=/certs/peer.crt --key=/certs/peer.key snapshot save /backup/etcd-`date +%F-%T`.backup"] root@k8s-master:~/bak-etcd# cat Dockerfile FROM alpine:latest LABEL matainer="NovaCaoFc" \ role="bak" \ project="etcd" COPY etcdctl /usr/local/bin/ CMD ["/bin/sh","-c","etcdctl --endpoints=${ETCD_HOST}:${ETCD_PORT} --cacert=/certs/ca.crt --cert=/certs/peer.crt --key=/certs/peer.key snapshot save /backup/etcd-`date +%F-%T`.backup"] root@k8s-master:~/bak-etcd# cat Dockerfile FROM alpine:latest LABEL matainer="NovaCaoFc" \ role="bak" \ project="etcd" COPY etcdctl /usr/local/bin/ CMD ["/bin/sh","-c","etcdctl --endpoints=${ETCD_HOST}:${ETCD_PORT} --cacert=/certs/ca.crt --cert=/certs/peer.crt --key=/certs/peer.key snapshot save /backup/etcd-`date +%F-%T`.backup"] root@k8s-master:~/bak-etcd# ll total 16096 drwxr-xr-x 2 root root 4096 Jul 30 16:37 ./ drwx------ 13 root root 4096 Jul 30 16:37 ../ -rw-r--r-- 1 root root 302 Jul 30 16:31 Dockerfile -rwxr-xr-x 1 cao cao 16466072 Jul 26 02:17 etcdctl* root@k8s-master:~/bak-etcd# ll total 16096 drwxr-xr-x 2 root root 4096 Jul 30 16:37 ./ drwx------ 13 root root 4096 Jul 30 16:37 ../ -rw-r--r-- 1 root root 302 Jul 30 16:31 Dockerfile -rwxr-xr-x 1 cao cao 16466072 Jul 26 02:17 etcdctl* root@k8s-master:~/bak-etcd# ll total 16096 drwxr-xr-x 2 root root 4096 Jul 30 16:37 ./ drwx------ 13 root root 4096 Jul 30 16:37 ../ -rw-r--r-- 1 root root 302 Jul 30 16:31 Dockerfile -rwxr-xr-x 1 cao cao 16466072 Jul 26 02:17 etcdctl* root@k8s-master:~/bak-etcd# -weight: 500;">docker build -t etcd-bak:v1 . [+] Building 0.1s (7/7) FINISHED -weight: 500;">docker:default => [internal] load build definition from Dockerfile 0.0s => => transferring dockerfile: 353B 0.0s => [internal] load metadata for -weight: 500;">docker.io/library/alpine:latest 0.0s => [internal] load .dockerignore 0.0s => => transferring context: 2B 0.0s => [internal] load build context 0.0s => => transferring context: 31B 0.0s => [1/2] FROM -weight: 500;">docker.io/library/alpine:latest 0.0s => CACHED [2/2] COPY etcdctl /usr/local/bin/ 0.0s => exporting to image 0.0s => => exporting layers 0.0s => => writing image sha256:8a29a144172a91e01eb81d8e540fb785e9749058be1d6336871036e9fb781adb 0.0s => => naming to -weight: 500;">docker.io/library/etcd-bak:v1 root@k8s-master:~/bak-etcd# -weight: 500;">docker images |grep bak etcd-bak v1 8a29a144172a 7 minutes ago 24.8MB root@k8s-master:~/bak-etcd# -weight: 500;">docker build -t etcd-bak:v1 . [+] Building 0.1s (7/7) FINISHED -weight: 500;">docker:default => [internal] load build definition from Dockerfile 0.0s => => transferring dockerfile: 353B 0.0s => [internal] load metadata for -weight: 500;">docker.io/library/alpine:latest 0.0s => [internal] load .dockerignore 0.0s => => transferring context: 2B 0.0s => [internal] load build context 0.0s => => transferring context: 31B 0.0s => [1/2] FROM -weight: 500;">docker.io/library/alpine:latest 0.0s => CACHED [2/2] COPY etcdctl /usr/local/bin/ 0.0s => exporting to image 0.0s => => exporting layers 0.0s => => writing image sha256:8a29a144172a91e01eb81d8e540fb785e9749058be1d6336871036e9fb781adb 0.0s => => naming to -weight: 500;">docker.io/library/etcd-bak:v1 root@k8s-master:~/bak-etcd# -weight: 500;">docker images |grep bak etcd-bak v1 8a29a144172a 7 minutes ago 24.8MB root@k8s-master:~/bak-etcd# -weight: 500;">docker build -t etcd-bak:v1 . [+] Building 0.1s (7/7) FINISHED -weight: 500;">docker:default => [internal] load build definition from Dockerfile 0.0s => => transferring dockerfile: 353B 0.0s => [internal] load metadata for -weight: 500;">docker.io/library/alpine:latest 0.0s => [internal] load .dockerignore 0.0s => => transferring context: 2B 0.0s => [internal] load build context 0.0s => => transferring context: 31B 0.0s => [1/2] FROM -weight: 500;">docker.io/library/alpine:latest 0.0s => CACHED [2/2] COPY etcdctl /usr/local/bin/ 0.0s => exporting to image 0.0s => => exporting layers 0.0s => => writing image sha256:8a29a144172a91e01eb81d8e540fb785e9749058be1d6336871036e9fb781adb 0.0s => => naming to -weight: 500;">docker.io/library/etcd-bak:v1 root@k8s-master:~/bak-etcd# -weight: 500;">docker images |grep bak etcd-bak v1 8a29a144172a 7 minutes ago 24.8MB # Save the image as a tar archive root@k8s-master:~# -weight: 500;">docker save -o etcd-bak.tar etcd-bak:v1 # Distribute the image to other nodes via scp root@k8s-master:~# scp etcd-bak.tar 10.0.0.7:/root/ etcd-bak.tar 100% 24MB 73.5MB/s 00:00 root@k8s-master:~# scp etcd-bak.tar 10.0.0.8:/root/ etcd-bak.tar # Import the image on other nodes root@k8s-node1:~# -weight: 500;">docker load -i etcd-bak.tar 418dccb7d85a: Loading layer [==================================================>] 8.596MB/8.596MB 39e2b60cb098: Loading layer [==================================================>] 16.47MB/16.47MB Loaded image: etcd-bak:v1 # Save the image as a tar archive root@k8s-master:~# -weight: 500;">docker save -o etcd-bak.tar etcd-bak:v1 # Distribute the image to other nodes via scp root@k8s-master:~# scp etcd-bak.tar 10.0.0.7:/root/ etcd-bak.tar 100% 24MB 73.5MB/s 00:00 root@k8s-master:~# scp etcd-bak.tar 10.0.0.8:/root/ etcd-bak.tar # Import the image on other nodes root@k8s-node1:~# -weight: 500;">docker load -i etcd-bak.tar 418dccb7d85a: Loading layer [==================================================>] 8.596MB/8.596MB 39e2b60cb098: Loading layer [==================================================>] 16.47MB/16.47MB Loaded image: etcd-bak:v1 # Save the image as a tar archive root@k8s-master:~# -weight: 500;">docker save -o etcd-bak.tar etcd-bak:v1 # Distribute the image to other nodes via scp root@k8s-master:~# scp etcd-bak.tar 10.0.0.7:/root/ etcd-bak.tar 100% 24MB 73.5MB/s 00:00 root@k8s-master:~# scp etcd-bak.tar 10.0.0.8:/root/ etcd-bak.tar # Import the image on other nodes root@k8s-node1:~# -weight: 500;">docker load -i etcd-bak.tar 418dccb7d85a: Loading layer [==================================================>] 8.596MB/8.596MB 39e2b60cb098: Loading layer [==================================================>] 16.47MB/16.47MB Loaded image: etcd-bak:v1 root@k8s-master:~# cat cj-backup-etcd-nfs.yaml apiVersion: batch/v1 kind: CronJob metadata: name: backup-etcd spec: schedule: "* * * * *" # Use "every minute" for initial testing jobTemplate: spec: template: spec: volumes: - name: certs persistentVolumeClaim: claimName: static-pvc-etcd-certs # Referencing the static certificate PVC - name: bak persistentVolumeClaim: claimName: static-pvc-etcd-backup # Referencing the static backup PVC containers: - name: etcd-backup image: etcd-bak:v1 imagePullPolicy: IfNotPresent volumeMounts: - name: certs mountPath: /certs readOnly: true - name: bak mountPath: /backup env: - name: ETCD_HOST value: "10.0.0.6" # Your etcd node IP - name: ETCD_PORT value: "2379" restartPolicy: OnFailure root@k8s-master:~# cat cj-backup-etcd-nfs.yaml apiVersion: batch/v1 kind: CronJob metadata: name: backup-etcd spec: schedule: "* * * * *" # Use "every minute" for initial testing jobTemplate: spec: template: spec: volumes: - name: certs persistentVolumeClaim: claimName: static-pvc-etcd-certs # Referencing the static certificate PVC - name: bak persistentVolumeClaim: claimName: static-pvc-etcd-backup # Referencing the static backup PVC containers: - name: etcd-backup image: etcd-bak:v1 imagePullPolicy: IfNotPresent volumeMounts: - name: certs mountPath: /certs readOnly: true - name: bak mountPath: /backup env: - name: ETCD_HOST value: "10.0.0.6" # Your etcd node IP - name: ETCD_PORT value: "2379" restartPolicy: OnFailure root@k8s-master:~# cat cj-backup-etcd-nfs.yaml apiVersion: batch/v1 kind: CronJob metadata: name: backup-etcd spec: schedule: "* * * * *" # Use "every minute" for initial testing jobTemplate: spec: template: spec: volumes: - name: certs persistentVolumeClaim: claimName: static-pvc-etcd-certs # Referencing the static certificate PVC - name: bak persistentVolumeClaim: claimName: static-pvc-etcd-backup # Referencing the static backup PVC containers: - name: etcd-backup image: etcd-bak:v1 imagePullPolicy: IfNotPresent volumeMounts: - name: certs mountPath: /certs readOnly: true - name: bak mountPath: /backup env: - name: ETCD_HOST value: "10.0.0.6" # Your etcd node IP - name: ETCD_PORT value: "2379" restartPolicy: OnFailure root@k8s-master:~# -weight: 500;">kubectl apply -f cj-backup-etcd-nfs.yaml cronjob.batch/backup-etcd created root@k8s-master:~# -weight: 500;">kubectl apply -f cj-backup-etcd-nfs.yaml cronjob.batch/backup-etcd created root@k8s-master:~# -weight: 500;">kubectl apply -f cj-backup-etcd-nfs.yaml cronjob.batch/backup-etcd created root@k8s-master:~# -weight: 500;">kubectl get -f cj-backup-etcd-nfs.yaml NAME SCHEDULE TIMEZONE SUSPEND ACTIVE LAST SCHEDULE AGE backup-etcd * * * * * <none> False 0 <none> 17s root@k8s-master:~# -weight: 500;">kubectl get -f cj-backup-etcd-nfs.yaml NAME SCHEDULE TIMEZONE SUSPEND ACTIVE LAST SCHEDULE AGE backup-etcd * * * * * <none> False 0 <none> 17s root@k8s-master:~# -weight: 500;">kubectl get -f cj-backup-etcd-nfs.yaml NAME SCHEDULE TIMEZONE SUSPEND ACTIVE LAST SCHEDULE AGE backup-etcd * * * * * <none> False 0 <none> 17s root@k8s-master:~# -weight: 500;">kubectl get pods NAME READY STATUS RESTARTS AGE backup-etcd-29231120-9s9dm 0/1 Completed 0 3m4s backup-etcd-29231121-nfjl5 0/1 CrashLoopBackOff 1 (11s ago) 2m4s backup-etcd-29231122-gl4df 0/1 Completed 1 64s backup-etcd-29231123-tfw5f 0/1 Completed 0 4s root@k8s-master:~# ll /data/nfs-server/etcd-backup/ total 50824 drwxr-xr-x 2 root root 4096 Jul 30 17:23 ./ drwxr-xr-x 8 root root 4096 Jul 30 16:21 ../ -rw------- 1 root root 13004832 Jul 30 17:22 etcd-2025-07-30-09:22:50.backup -rw------- 1 root root 13004832 Jul 30 17:22 etcd-2025-07-30-09:22:53.backup -rw------- 1 root root 13004832 Jul 30 17:23 etcd-2025-07-30-09:23:01.backup root@k8s-master:~# -weight: 500;">kubectl get pods NAME READY STATUS RESTARTS AGE backup-etcd-29231120-9s9dm 0/1 Completed 0 3m4s backup-etcd-29231121-nfjl5 0/1 CrashLoopBackOff 1 (11s ago) 2m4s backup-etcd-29231122-gl4df 0/1 Completed 1 64s backup-etcd-29231123-tfw5f 0/1 Completed 0 4s root@k8s-master:~# ll /data/nfs-server/etcd-backup/ total 50824 drwxr-xr-x 2 root root 4096 Jul 30 17:23 ./ drwxr-xr-x 8 root root 4096 Jul 30 16:21 ../ -rw------- 1 root root 13004832 Jul 30 17:22 etcd-2025-07-30-09:22:50.backup -rw------- 1 root root 13004832 Jul 30 17:22 etcd-2025-07-30-09:22:53.backup -rw------- 1 root root 13004832 Jul 30 17:23 etcd-2025-07-30-09:23:01.backup root@k8s-master:~# -weight: 500;">kubectl get pods NAME READY STATUS RESTARTS AGE backup-etcd-29231120-9s9dm 0/1 Completed 0 3m4s backup-etcd-29231121-nfjl5 0/1 CrashLoopBackOff 1 (11s ago) 2m4s backup-etcd-29231122-gl4df 0/1 Completed 1 64s backup-etcd-29231123-tfw5f 0/1 Completed 0 4s root@k8s-master:~# ll /data/nfs-server/etcd-backup/ total 50824 drwxr-xr-x 2 root root 4096 Jul 30 17:23 ./ drwxr-xr-x 8 root root 4096 Jul 30 16:21 ../ -rw------- 1 root root 13004832 Jul 30 17:22 etcd-2025-07-30-09:22:50.backup -rw------- 1 root root 13004832 Jul 30 17:22 etcd-2025-07-30-09:22:53.backup -rw------- 1 root root 13004832 Jul 30 17:23 etcd-2025-07-30-09:23:01.backup # Assume the backup file is in the NFS directory; first, copy it to the local node (e.g., the Master node) cp /data/nfs-server/etcd-backup/etcd-2025-07-30-09:22:50.backup /tmp/etcd-backup.db # Verify the backup file (ensure you are using an etcdctl version that matches your cluster version) etcdctl --write-out=table snapshot -weight: 500;">status /tmp/etcd-backup.db # Assume the backup file is in the NFS directory; first, copy it to the local node (e.g., the Master node) cp /data/nfs-server/etcd-backup/etcd-2025-07-30-09:22:50.backup /tmp/etcd-backup.db # Verify the backup file (ensure you are using an etcdctl version that matches your cluster version) etcdctl --write-out=table snapshot -weight: 500;">status /tmp/etcd-backup.db # Assume the backup file is in the NFS directory; first, copy it to the local node (e.g., the Master node) cp /data/nfs-server/etcd-backup/etcd-2025-07-30-09:22:50.backup /tmp/etcd-backup.db # Verify the backup file (ensure you are using an etcdctl version that matches your cluster version) etcdctl --write-out=table snapshot -weight: 500;">status /tmp/etcd-backup.db # Execute on the Master node (adjust based on your actual components) -weight: 500;">systemctl -weight: 500;">stop kubelet # Stop the control plane containers -weight: 500;">docker -weight: 500;">stop $(-weight: 500;">docker ps -q --filter name=k8s_kube-apiserver*) -weight: 500;">docker -weight: 500;">stop $(-weight: 500;">docker ps -q --filter name=k8s_kube-controller-manager*) -weight: 500;">docker -weight: 500;">stop $(-weight: 500;">docker ps -q --filter name=k8s_kube-scheduler*) -weight: 500;">docker -weight: 500;">stop $(-weight: 500;">docker ps -q --filter name=k8s_etcd*) # Execute on the Master node (adjust based on your actual components) -weight: 500;">systemctl -weight: 500;">stop kubelet # Stop the control plane containers -weight: 500;">docker -weight: 500;">stop $(-weight: 500;">docker ps -q --filter name=k8s_kube-apiserver*) -weight: 500;">docker -weight: 500;">stop $(-weight: 500;">docker ps -q --filter name=k8s_kube-controller-manager*) -weight: 500;">docker -weight: 500;">stop $(-weight: 500;">docker ps -q --filter name=k8s_kube-scheduler*) -weight: 500;">docker -weight: 500;">stop $(-weight: 500;">docker ps -q --filter name=k8s_etcd*) # Execute on the Master node (adjust based on your actual components) -weight: 500;">systemctl -weight: 500;">stop kubelet # Stop the control plane containers -weight: 500;">docker -weight: 500;">stop $(-weight: 500;">docker ps -q --filter name=k8s_kube-apiserver*) -weight: 500;">docker -weight: 500;">stop $(-weight: 500;">docker ps -q --filter name=k8s_kube-controller-manager*) -weight: 500;">docker -weight: 500;">stop $(-weight: 500;">docker ps -q --filter name=k8s_kube-scheduler*) -weight: 500;">docker -weight: 500;">stop $(-weight: 500;">docker ps -q --filter name=k8s_etcd*) mv /var/lib/etcd /var/lib/etcd.bak mv /var/lib/etcd /var/lib/etcd.bak mv /var/lib/etcd /var/lib/etcd.bak etcdctl snapshot restore /tmp/etcd-backup.db \ --data-dir=/var/lib/etcd \ --name=etcd-1 \ # Node name (e.g., etcd-1, etcd-2, or etcd-3) --initial-cluster=etcd-1=https://10.0.0.6:2380,etcd-2=https://10.0.0.7:2380,etcd-3=https://10.0.0.8:2380 \ --initial-cluster-token=etcd-cluster-token \ --initial-advertise-peer-urls=https://10.0.0.6:2380 # Peer address of the current node etcdctl snapshot restore /tmp/etcd-backup.db \ --data-dir=/var/lib/etcd \ --name=etcd-1 \ # Node name (e.g., etcd-1, etcd-2, or etcd-3) --initial-cluster=etcd-1=https://10.0.0.6:2380,etcd-2=https://10.0.0.7:2380,etcd-3=https://10.0.0.8:2380 \ --initial-cluster-token=etcd-cluster-token \ --initial-advertise-peer-urls=https://10.0.0.6:2380 # Peer address of the current node etcdctl snapshot restore /tmp/etcd-backup.db \ --data-dir=/var/lib/etcd \ --name=etcd-1 \ # Node name (e.g., etcd-1, etcd-2, or etcd-3) --initial-cluster=etcd-1=https://10.0.0.6:2380,etcd-2=https://10.0.0.7:2380,etcd-3=https://10.0.0.8:2380 \ --initial-cluster-token=etcd-cluster-token \ --initial-advertise-peer-urls=https://10.0.0.6:2380 # Peer address of the current node etcdctl snapshot restore /tmp/etcd-backup.db \ --data-dir=/var/lib/etcd \ --name=etcd-2 \ --initial-cluster=etcd-1=https://10.0.0.6:2380,etcd-2=https://10.0.0.7:2380,etcd-3=https://10.0.0.8:2380 \ --initial-cluster-token=etcd-cluster-token \ --initial-advertise-peer-urls=https://10.0.0.7:2380 etcdctl snapshot restore /tmp/etcd-backup.db \ --data-dir=/var/lib/etcd \ --name=etcd-2 \ --initial-cluster=etcd-1=https://10.0.0.6:2380,etcd-2=https://10.0.0.7:2380,etcd-3=https://10.0.0.8:2380 \ --initial-cluster-token=etcd-cluster-token \ --initial-advertise-peer-urls=https://10.0.0.7:2380 etcdctl snapshot restore /tmp/etcd-backup.db \ --data-dir=/var/lib/etcd \ --name=etcd-2 \ --initial-cluster=etcd-1=https://10.0.0.6:2380,etcd-2=https://10.0.0.7:2380,etcd-3=https://10.0.0.8:2380 \ --initial-cluster-token=etcd-cluster-token \ --initial-advertise-peer-urls=https://10.0.0.7:2380 chown -R 1000:1000 /var/lib/etcd # etcd defaults to running as user 1000 chown -R 1000:1000 /var/lib/etcd # etcd defaults to running as user 1000 chown -R 1000:1000 /var/lib/etcd # etcd defaults to running as user 1000 -weight: 500;">systemctl -weight: 500;">start kubelet # Wait for the containers to -weight: 500;">restart automatically, or -weight: 500;">start them manually -weight: 500;">docker -weight: 500;">start $(-weight: 500;">docker ps -aq --filter name=k8s_etcd*) -weight: 500;">docker -weight: 500;">start $(-weight: 500;">docker ps -aq --filter name=k8s_kube-apiserver*) -weight: 500;">docker -weight: 500;">start $(-weight: 500;">docker ps -aq --filter name=k8s_kube-controller-manager*) -weight: 500;">docker -weight: 500;">start $(-weight: 500;">docker ps -aq --filter name=k8s_kube-scheduler*) -weight: 500;">systemctl -weight: 500;">start kubelet # Wait for the containers to -weight: 500;">restart automatically, or -weight: 500;">start them manually -weight: 500;">docker -weight: 500;">start $(-weight: 500;">docker ps -aq --filter name=k8s_etcd*) -weight: 500;">docker -weight: 500;">start $(-weight: 500;">docker ps -aq --filter name=k8s_kube-apiserver*) -weight: 500;">docker -weight: 500;">start $(-weight: 500;">docker ps -aq --filter name=k8s_kube-controller-manager*) -weight: 500;">docker -weight: 500;">start $(-weight: 500;">docker ps -aq --filter name=k8s_kube-scheduler*) -weight: 500;">systemctl -weight: 500;">start kubelet # Wait for the containers to -weight: 500;">restart automatically, or -weight: 500;">start them manually -weight: 500;">docker -weight: 500;">start $(-weight: 500;">docker ps -aq --filter name=k8s_etcd*) -weight: 500;">docker -weight: 500;">start $(-weight: 500;">docker ps -aq --filter name=k8s_kube-apiserver*) -weight: 500;">docker -weight: 500;">start $(-weight: 500;">docker ps -aq --filter name=k8s_kube-controller-manager*) -weight: 500;">docker -weight: 500;">start $(-weight: 500;">docker ps -aq --filter name=k8s_kube-scheduler*) # View node -weight: 500;">status -weight: 500;">kubectl get nodes # View Pod -weight: 500;">status across all namespaces -weight: 500;">kubectl get pods --all-namespaces # Check etcd cluster health -weight: 500;">status etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/peer.crt \ --key=/etc/kubernetes/pki/etcd/peer.key \ endpoint health # View node -weight: 500;">status -weight: 500;">kubectl get nodes # View Pod -weight: 500;">status across all namespaces -weight: 500;">kubectl get pods --all-namespaces # Check etcd cluster health -weight: 500;">status etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/peer.crt \ --key=/etc/kubernetes/pki/etcd/peer.key \ endpoint health # View node -weight: 500;">status -weight: 500;">kubectl get nodes # View Pod -weight: 500;">status across all namespaces -weight: 500;">kubectl get pods --all-namespaces # Check etcd cluster health -weight: 500;">status etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/peer.crt \ --key=/etc/kubernetes/pki/etcd/peer.key \ endpoint health - Backup Tool: Utilizing the official etcdctl CLI to perform snapshot backups. - Scheduling Control: Defining the backup cycle (e.g., daily at 2:00 AM) via a K8S CronJob to trigger tasks automatically. - Certificate Management: Since etcd typically enables TLS encryption, the backup process must mount the CA certificate, client certificate, and private key to ensure etcdctl can authenticate with the cluster. - Storage Solution: Certificates and backup files are mounted via a PVC dynamically provisioned by the csi-nfs-storageclass, utilizing NFS shared storage for persistence. - Verify the NFS Shared Directory In this environment, the NFS server's shared directory is /data/nfs-server (verified via the exportfs command), which is accessible by all K8S nodes. - Create Certificate and Backup Directories Create a certificate directory (etcd-certs) and a backup directory (etcd-backup) within /data/nfs-server. Assign read/write permissions to ensure the K8S Pods can access them: - Certificate Directory using ReadOnlyMany (ROX): Allows Pods on multiple nodes to mount the volume as read-only, preventing accidental modification of sensitive certificates. - Backup Directory using ReadWriteMany (RWX): Allows Pods on any node to perform read/write operations, which is essential since the CronJob Pod may be scheduled on different nodes across the cluster. - Simple: well-defined, user-facing API (gRPC) - Secure: automatic TLS with optional client cert authentication - Fast: benchmarked 10,000 writes/sec - Reliable: properly distributed using Raft - Back up the original data on all etcd nodes: - Perform recovery on the first node: - Execute the restoration on the other nodes (modify --name and --initial-advertise-peer-urls to match the information for each specific node):