Bob Ensures High Availability and Disaster Recovery in Kubernetes on AlmaLinux

Bob will focus on strategies to make his Kubernetes cluster resilient against outages, ensuring minimal downtime and data loss during disasters.

By İbrahim Korucuoğlu ( @siberoloji) | Wednesday, November 20, 2024

Categories:

Linux

4 minute read

Let’s dive into Chapter 42, “Bob Ensures High Availability and Disaster Recovery in Kubernetes!”. In this chapter, Bob will focus on strategies to make his Kubernetes cluster resilient against outages, ensuring minimal downtime and data loss during disasters.

1. Introduction: Why High Availability (HA) and Disaster Recovery (DR) Matter

Bob’s manager tasks him with making the Kubernetes cluster highly available and disaster-resilient. High availability ensures that services remain online during minor failures, while disaster recovery protects data and restores functionality after major incidents.

“A resilient cluster is a reliable cluster—time to prepare for the worst!” Bob says, ready to fortify his infrastructure.

2. Setting Up a Highly Available Kubernetes Control Plane

Bob begins by ensuring that the Kubernetes control plane is highly available.

Deploying Multi-Master Nodes:
- Bob sets up a multi-master control plane with an external etcd cluster:
```
kubeadm init --control-plane-endpoint "load-balancer-ip:6443" --upload-certs
```

Using a Load Balancer:

He configures a load balancer to distribute traffic among control plane nodes:

frontend:
  bind *:6443
  default_backend kube-api
backend kube-api:
  server master1:6443
  server master2:6443
  server master3:6443

“With multiple masters and a load balancer, my control plane is ready for anything!” Bob says.

3. Ensuring Node Redundancy

Bob sets up worker nodes to handle application workloads across availability zones.

Spreading Nodes Across Zones:

Bob labels nodes by availability zone:

kubectl label node worker1 topology.kubernetes.io/zone=us-east-1a
kubectl label node worker2 topology.kubernetes.io/zone=us-east-1b

Using Pod Affinity and Anti-Affinity:

Bob ensures pods are spread across zones:

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
        - key: app
          operator: In
          values:
          - my-app
      topologyKey: topology.kubernetes.io/zone

“Node redundancy ensures my apps can survive zone failures!” Bob says, reassured.

4. Implementing Persistent Data Replication

Bob ensures that persistent data is replicated across zones.

Using Multi-Zone Persistent Volumes:

Bob creates a storage class for replication:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: multi-zone
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp3
  zones: us-east-1a,us-east-1b

Deploying StatefulSets with Replicated Storage:

He updates his StatefulSet to use multi-zone volumes:

volumeClaimTemplates:
- metadata:
    name: data
  spec:
    accessModes: ["ReadWriteOnce"]
    storageClassName: multi-zone
    resources:
      requests:
        storage: 10Gi

“Replicated storage keeps my data safe, even if a zone goes down!” Bob says.

5. Implementing Automated Backups

Bob sets up backup solutions to protect against data loss.

Backing Up etcd:

Bob schedules regular etcd backups:

etcdctl snapshot save /var/lib/etcd/snapshot.db

He automates backups with a cron job:

crontab -e
0 0 * * * etcdctl snapshot save /var/lib/etcd/snapshot-$(date +\%Y\%m\%d).db

Backing Up Persistent Volumes:

Bob uses Velero to back up volumes and resources:

velero install --provider aws --bucket my-backup-bucket --use-restic
velero backup create cluster-backup --include-namespaces "*"

“With regular backups, I’m prepared for worst-case scenarios!” Bob says.

6. Implementing Disaster Recovery

Bob tests recovery processes for various disaster scenarios.

Recovering from Control Plane Failures:

Bob restores etcd from a snapshot:

etcdctl snapshot restore /var/lib/etcd/snapshot.db --data-dir /var/lib/etcd-new

Recovering Applications:
- Bob uses Velero to restore resources:
```
velero restore create --from-backup cluster-backup
```

“A tested recovery plan is the backbone of disaster resilience!” Bob notes.

7. Using Multi-Cluster Kubernetes for DR

Bob explores multi-cluster setups to improve redundancy.

Deploying Clusters in Multiple Regions:
- Bob sets up clusters in different regions and synchronizes workloads using KubeFed:
```
kubefedctl join cluster1 --host-cluster-context cluster1
kubefedctl join cluster2 --host-cluster-context cluster1
```
Enabling Failover:
- He configures DNS-based failover with ExternalDNS:
```
helm install external-dns bitnami/external-dns
```

“Multi-cluster setups ensure my apps stay online, even during major outages!” Bob says.

8. Implementing Application-Level HA

Bob uses Kubernetes features to make individual applications highly available.

Using Horizontal Pod Autoscaling (HPA):

Bob scales pods based on CPU usage:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      targetAverageUtilization: 50

Configuring Pod Disruption Budgets (PDBs):

Bob ensures a minimum number of pods remain available during disruptions:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: my-app

“Application-level HA ensures seamless user experiences!” Bob says.

9. Monitoring and Alerting for HA/DR

Bob integrates monitoring tools to detect and respond to failures.

Using Prometheus and Grafana:

Bob sets up alerts for critical metrics, such as node availability and pod health:

groups:
- name: ha-alerts
  rules:
  - alert: NodeDown
    expr: up{job="kubernetes-nodes"} == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Node is down!"

Configuring Incident Response:
- Bob integrates alerts with PagerDuty for on-call notifications.

“Real-time monitoring helps me stay ahead of failures!” Bob says.

10. Conclusion: Bob’s HA and DR Mastery

With multi-master nodes, replicated storage, regular backups, and a tested recovery plan, Bob has created a Kubernetes cluster that’s both highly available and disaster-resilient. His systems can handle failures and recover quickly, keeping downtime to a minimum.

Next, Bob plans to explore Kubernetes for IoT Workloads, deploying and managing sensor data pipelines at scale.

Stay tuned for the next chapter: “Bob Deploys and Manages IoT Workloads in Kubernetes!”

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

< Event-Driven Architecture in Kubernetes IoT Workloads in Kubernetes >

Last modified 20.02.2025: new kotlin and mint content (93a1000)