Bob Explores Kubernetes for AI/ML Workloads

Bob will learn how to deploy and manage machine learning workloads on Kubernetes using Kubeflow, Jupyter notebooks, and specialized tools for AI/ML.

By İbrahim Korucuoğlu ( @siberoloji) | Saturday, November 16, 2024

Categories:

Linux

4 minute read

Let’s dive into Chapter 31, “Bob Explores Kubernetes for AI/ML Workloads!”. In this chapter, Bob will learn how to deploy and manage machine learning workloads on Kubernetes using Kubeflow, Jupyter notebooks, and specialized tools for AI/ML.

1. Introduction: AI/ML Meets Kubernetes

Bob’s company is venturing into AI and machine learning. His team wants to train and deploy ML models on Kubernetes, taking advantage of its scalability. Bob’s mission: understand the tools and workflows needed to integrate AI/ML workloads into his cluster.

“Kubernetes for AI? Sounds challenging, but also exciting—let’s make it happen!” Bob says.

2. Setting Up Kubeflow

Bob starts by installing Kubeflow, a machine learning platform designed for Kubernetes.

Deploying Kubeflow:

Bob uses the official deployment script to set up Kubeflow on his cluster:

curl -O https://raw.githubusercontent.com/kubeflow/manifests/v1.6-branch/kfctl_k8s_istio.yaml
kfctl apply -f kfctl_k8s_istio.yaml

Accessing the Kubeflow Dashboard:
- Bob retrieves the external IP of the Kubeflow dashboard:
```
kubectl get svc -n istio-system
```
- He accesses it in his browser.

“The Kubeflow dashboard is my new AI command center!” Bob says, impressed by the interface.

3. Running Jupyter Notebooks on Kubernetes

Bob sets up Jupyter notebooks for interactive ML development.

Creating a Jupyter Notebook Pod:

Bob writes a YAML file, jupyter-notebook.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: jupyter
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: jupyter
        image: jupyter/minimal-notebook
        ports:
        - containerPort: 8888

Accessing the Notebook:

Bob exposes the notebook with a NodePort service and retrieves the access URL:

kubectl expose deployment jupyter --type=NodePort --name=jupyter-service
kubectl get svc jupyter-service

“Jupyter on Kubernetes makes ML development scalable!” Bob says.

4. Training a Machine Learning Model

Bob learns to train an ML model using distributed workloads.

Creating a TensorFlow Job:

Bob installs the Kubeflow TFJob Operator to manage TensorFlow training jobs:

kubectl apply -f https://raw.githubusercontent.com/kubeflow/training-operator/master/manifests/v1beta1/tfjob/tfjob-crd.yaml

Submitting a Training Job:

Bob writes tensorflow-job.yaml to train a simple model:

apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
  name: "mnist-training"
spec:
  replicaSpecs:
    - replicas: 2
      template:
        spec:
          containers:
          - name: "tensorflow"
            image: "tensorflow/tensorflow:2.4.0"
            command: ["python", "/app/mnist.py"]

Monitoring Training:
```
kubectl logs -f <pod-name>
```

“Distributed training is a breeze with Kubernetes!” Bob says, proud of the setup.

5. Deploying a Trained Model

Bob deploys a trained ML model as a REST API using KFServing.

Installing KFServing:

kubectl apply -f https://github.com/kubeflow/kfserving/releases/download/v0.7.0/kfserving.yaml

Creating an Inference Service:

Bob writes inference-service.yaml to serve the model:

apiVersion: serving.kubeflow.org/v1beta1
kind: InferenceService
metadata:
  name: mnist-service
spec:
  predictor:
    tensorflow:
      storageUri: "gs://my-models/mnist/"

Accessing the API:

Bob retrieves the external URL and tests the model with a curl command:

kubectl get inferenceservice mnist-service
curl -d '{"instances": [[0.5, 0.3, 0.1]]}' -H "Content-Type: application/json" -X POST http://<service-url>/v1/models/mnist-service:predict

“Serving ML models is now as easy as deploying a Kubernetes service!” Bob says, amazed.

6. Using GPUs for AI Workloads

Bob learns to optimize AI workloads using GPUs.

Enabling GPU Support:

Bob installs NVIDIA’s GPU operator:

kubectl apply -f https://github.com/NVIDIA/gpu-operator/releases/download/v1.9.0/nvidia-gpu-operator.yaml

Deploying a GPU-Accelerated Pod:

He writes a YAML file, gpu-pod.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
  - name: tensorflow-gpu
    image: tensorflow/tensorflow:2.4.0-gpu
    resources:
      limits:
        nvidia.com/gpu: 1

Verifying GPU Usage:
```
kubectl logs gpu-pod
```

“With GPUs, my ML models train faster than ever!” Bob says, thrilled.

7. Managing Data with Persistent Volumes

Bob integrates persistent storage for large datasets.

Creating a Persistent Volume:

Bob writes pv.yaml:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: ml-data
spec:
  capacity:
    storage: 50Gi
  accessModes:
    - ReadWriteMany
  hostPath:
    path: /mnt/data

Mounting the Volume:
- He updates his TensorFlow job to use the volume for training data.

“Persistent volumes simplify handling large datasets!” Bob says.

8. Automating AI Pipelines with Kubeflow Pipelines

Bob automates end-to-end ML workflows with Kubeflow Pipelines.

Creating a Pipeline:

Bob writes a Python script to define a pipeline using the Kubeflow Pipelines SDK:

from kfp import dsl

@dsl.pipeline(name="ML Pipeline")
def pipeline():
    preprocess = dsl.ContainerOp(
        name="Preprocess",
        image="my-preprocess-image",
        arguments=["--input", "/data/raw", "--output", "/data/processed"]
    )
    train = dsl.ContainerOp(
        name="Train",
        image="my-train-image",
        arguments=["--data", "/data/processed", "--model", "/data/model"]
    )
    preprocess >> train

Submitting the Pipeline:
```
kfp run --pipeline ml-pipeline.py
```

“Automating workflows saves so much time!” Bob says, appreciating the efficiency.

9. Monitoring AI Workloads

Bob ensures his AI workloads are running efficiently.

Using Prometheus and Grafana:
- He adds GPU and memory metrics to his dashboards.
Integrating MLFlow for Experiment Tracking:
- Bob uses MLFlow to log model training metrics and compare results.

“Observability is just as important for AI as it is for apps!” Bob notes.

10. Conclusion: Bob’s AI/ML Kubernetes Expertise

With Kubeflow, Jupyter, and GPU optimization, Bob has transformed his Kubernetes cluster into an AI powerhouse. He’s ready to tackle real-world ML workloads, from training to deployment, with ease.

Next, Bob plans to explore Edge Computing with Kubernetes, learning how to deploy workloads to edge devices for low-latency applications.

Stay tuned for the next chapter: “Bob Ventures into Edge Computing with Kubernetes!”

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

< Serverless Kubernetes Edge Computing with Kubernetes >

Last modified 26.02.2025: new content in freebsd and compose (e39dbe4)