High-Performance Computing with Kubernetes on AlmaLinux

How to leverage Kubernetes for High-Performance Computing workloads, scientific simulations, machine learning training, and other compute-intensive tasks.

By İbrahim Korucuoğlu ( @siberoloji) | Thursday, November 21, 2024

Categories:

Linux

4 minute read

Let’s dive into Chapter 48, “Bob Tackles High-Performance Computing with Kubernetes!”. In this chapter, Bob explores how to leverage Kubernetes for High-Performance Computing (HPC) workloads, including scientific simulations, machine learning training, and other compute-intensive tasks.

1. Introduction: Why Use Kubernetes for HPC?

Bob’s company needs a scalable and flexible platform for HPC workloads, including computational simulations, data analysis, and parallel processing. Kubernetes provides the orchestration capabilities to manage these workloads effectively.

“HPC meets Kubernetes—let’s unlock the power of parallel computing!” Bob says, ready to dive in.

2. Preparing a Kubernetes Cluster for HPC

Bob ensures his cluster is optimized for HPC workloads.

Configuring High-Performance Nodes:

Bob uses nodes with GPU or high-performance CPU support:

kubectl label nodes gpu-node hardware-type=gpu
kubectl label nodes hpc-node hardware-type=cpu

Setting Up a GPU Operator:

He installs the NVIDIA GPU Operator:

helm repo add nvidia https://nvidia.github.io/gpu-operator
helm install gpu-operator nvidia/gpu-operator

“High-performance nodes are the foundation of my HPC setup!” Bob says.

3. Deploying a Parallel Computing Framework

Bob deploys Apache Spark for distributed parallel computing.

Installing Spark on Kubernetes:

Bob uses Helm to deploy Spark:

helm repo add spark https://charts.bitnami.com/bitnami
helm install spark spark/spark

Running a Parallel Job:

Bob writes a Spark job for numerical simulations:

from pyspark import SparkContext

sc = SparkContext("local", "Monte Carlo Simulation")
num_samples = 1000000

def inside(p):
    x, y = random.random(), random.random()
    return x*x + y*y < 1

count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(f"Estimated value of Pi: {pi}")

He submits the job to Spark:

./bin/spark-submit --master k8s://<kubernetes-api-url> --deploy-mode cluster pi.py

“Spark simplifies parallel computing for HPC!” Bob says.

4. Managing MPI Workloads

Bob sets up MPI (Message Passing Interface) for tightly coupled parallel applications.

Installing MPI Operator:

Bob deploys the MPI Operator for Kubernetes:

kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-operator/master/deploy/v1/mpi-operator.yaml

Submitting an MPI Job:

He writes an MPI job to run on multiple pods:

apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
  name: mpi-job
spec:
  slotsPerWorker: 2
  template:
    spec:
      containers:
      - image: mpi-example
        name: mpi

Bob applies the job:
```
kubectl apply -f mpi-job.yaml
```

“MPI is perfect for scientific simulations on Kubernetes!” Bob says.

5. Leveraging GPUs for Deep Learning

Bob sets up a deep learning workload using TensorFlow.

Deploying TensorFlow:

Bob uses Helm to deploy TensorFlow Serving:

helm repo add tensorflow https://charts.tensorflow.org
helm install tf-serving tensorflow/tensorflow-serving

Training a Model:

Bob writes a script to train a model on GPU nodes:

import tensorflow as tf

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
    model = tf.keras.Sequential([...])
    model.compile(optimizer='adam', loss='mse')
    model.fit(dataset, epochs=10)

He deploys the training job:

apiVersion: batch/v1
kind: Job
metadata:
  name: train-model
spec:
  template:
    spec:
      containers:
      - name: train
        image: tensorflow/tensorflow:latest-gpu
        resources:
          limits:
            nvidia.com/gpu: 2

“With TensorFlow and GPUs, deep learning on Kubernetes is seamless!” Bob says.

6. Optimizing Resource Utilization

Bob ensures efficient resource allocation for HPC workloads.

Using Node Affinity:

Bob assigns workloads to appropriate nodes:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: hardware-type
          operator: In
          values:
          - gpu

Tuning Pod Resource Limits:

He sets specific resource requests and limits:

resources:
  requests:
    cpu: "4"
    memory: "8Gi"
  limits:
    cpu: "8"
    memory: "16Gi"

“Optimized resources ensure HPC workloads run efficiently!” Bob says.

7. Monitoring and Profiling HPC Workloads

Bob integrates monitoring tools to track HPC performance.

Using Prometheus and Grafana:
- Bob collects metrics from GPU nodes and Spark jobs.
- He creates dashboards to monitor job progress and node utilization.
Profiling with NVIDIA Tools:
- Bob uses NVIDIA DCGM to profile GPU performance:
```
dcgmi group -c my-group
dcgmi diag -g my-group
```

“Monitoring helps me fine-tune HPC workloads for maximum performance!” Bob says.

8. Ensuring Fault Tolerance

Bob sets up mechanisms to recover from HPC job failures.

Using Checkpointing in Spark:
- Bob enables checkpointing to resume interrupted jobs:
```
sc.setCheckpointDir("/checkpoints")
```
Configuring Job Restart Policies:
- He ensures failed jobs are retried:
```
restartPolicy: OnFailure
```

“Fault tolerance is key for long-running HPC jobs!” Bob notes.

9. Securing HPC Workloads

Bob ensures security for sensitive HPC data.

Using RBAC for HPC Users:

Bob creates roles for HPC users:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: hpc-user-role
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["create", "list", "delete"]

Encrypting Data at Rest:
- He uses encrypted persistent volumes for sensitive data:
```
parameters:
  encrypted: "true"
```

“Security is critical for sensitive HPC workloads!” Bob says.

10. Conclusion: Bob’s HPC Breakthrough

With GPU acceleration, parallel frameworks, and robust monitoring, Bob has built a Kubernetes-powered HPC environment capable of handling the most demanding computational workloads.

Next, Bob plans to explore Kubernetes for AR/VR Workloads, diving into the world of real-time rendering and immersive experiences.

Stay tuned for the next chapter: “Bob Explores AR/VR Workloads with Kubernetes!”

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

< Multi-Cloud Kubernetes Deployments AR/VR Workloads with Kubernetes >

Last modified 20.02.2025: new kotlin and mint content (93a1000)