Bob Explores Kubernetes for AI/ML Workloads

Bob will learn how to deploy and manage machine learning workloads on Kubernetes using Kubeflow, Jupyter notebooks, and specialized tools for AI/ML.

Let’s dive into Chapter 31, “Bob Explores Kubernetes for AI/ML Workloads!”. In this chapter, Bob will learn how to deploy and manage machine learning workloads on Kubernetes using Kubeflow, Jupyter notebooks, and specialized tools for AI/ML.

1. Introduction: AI/ML Meets Kubernetes

Bob’s company is venturing into AI and machine learning. His team wants to train and deploy ML models on Kubernetes, taking advantage of its scalability. Bob’s mission: understand the tools and workflows needed to integrate AI/ML workloads into his cluster.

“Kubernetes for AI? Sounds challenging, but also exciting—let’s make it happen!” Bob says.


2. Setting Up Kubeflow

Bob starts by installing Kubeflow, a machine learning platform designed for Kubernetes.

  • Deploying Kubeflow:

    • Bob uses the official deployment script to set up Kubeflow on his cluster:

      curl -O https://raw.githubusercontent.com/kubeflow/manifests/v1.6-branch/kfctl_k8s_istio.yaml
      kfctl apply -f kfctl_k8s_istio.yaml
      
  • Accessing the Kubeflow Dashboard:

    • Bob retrieves the external IP of the Kubeflow dashboard:

      kubectl get svc -n istio-system
      
    • He accesses it in his browser.

“The Kubeflow dashboard is my new AI command center!” Bob says, impressed by the interface.


3. Running Jupyter Notebooks on Kubernetes

Bob sets up Jupyter notebooks for interactive ML development.

  • Creating a Jupyter Notebook Pod:

    • Bob writes a YAML file, jupyter-notebook.yaml:

      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: jupyter
      spec:
        replicas: 1
        template:
          spec:
            containers:
            - name: jupyter
              image: jupyter/minimal-notebook
              ports:
              - containerPort: 8888
      
  • Accessing the Notebook:

    • Bob exposes the notebook with a NodePort service and retrieves the access URL:

      kubectl expose deployment jupyter --type=NodePort --name=jupyter-service
      kubectl get svc jupyter-service
      

“Jupyter on Kubernetes makes ML development scalable!” Bob says.


4. Training a Machine Learning Model

Bob learns to train an ML model using distributed workloads.

  • Creating a TensorFlow Job:

    • Bob installs the Kubeflow TFJob Operator to manage TensorFlow training jobs:

      kubectl apply -f https://raw.githubusercontent.com/kubeflow/training-operator/master/manifests/v1beta1/tfjob/tfjob-crd.yaml
      
  • Submitting a Training Job:

    • Bob writes tensorflow-job.yaml to train a simple model:

      apiVersion: "kubeflow.org/v1"
      kind: "TFJob"
      metadata:
        name: "mnist-training"
      spec:
        replicaSpecs:
          - replicas: 2
            template:
              spec:
                containers:
                - name: "tensorflow"
                  image: "tensorflow/tensorflow:2.4.0"
                  command: ["python", "/app/mnist.py"]
      
  • Monitoring Training:

    kubectl logs -f <pod-name>
    

“Distributed training is a breeze with Kubernetes!” Bob says, proud of the setup.


5. Deploying a Trained Model

Bob deploys a trained ML model as a REST API using KFServing.

  • Installing KFServing:

    kubectl apply -f https://github.com/kubeflow/kfserving/releases/download/v0.7.0/kfserving.yaml
    
  • Creating an Inference Service:

    • Bob writes inference-service.yaml to serve the model:

      apiVersion: serving.kubeflow.org/v1beta1
      kind: InferenceService
      metadata:
        name: mnist-service
      spec:
        predictor:
          tensorflow:
            storageUri: "gs://my-models/mnist/"
      
  • Accessing the API:

    • Bob retrieves the external URL and tests the model with a curl command:

      kubectl get inferenceservice mnist-service
      curl -d '{"instances": [[0.5, 0.3, 0.1]]}' -H "Content-Type: application/json" -X POST http://<service-url>/v1/models/mnist-service:predict
      

“Serving ML models is now as easy as deploying a Kubernetes service!” Bob says, amazed.


6. Using GPUs for AI Workloads

Bob learns to optimize AI workloads using GPUs.

  • Enabling GPU Support:

    • Bob installs NVIDIA’s GPU operator:

      kubectl apply -f https://github.com/NVIDIA/gpu-operator/releases/download/v1.9.0/nvidia-gpu-operator.yaml
      
  • Deploying a GPU-Accelerated Pod:

    • He writes a YAML file, gpu-pod.yaml:

      apiVersion: v1
      kind: Pod
      metadata:
        name: gpu-pod
      spec:
        containers:
        - name: tensorflow-gpu
          image: tensorflow/tensorflow:2.4.0-gpu
          resources:
            limits:
              nvidia.com/gpu: 1
      
  • Verifying GPU Usage:

    kubectl logs gpu-pod
    

“With GPUs, my ML models train faster than ever!” Bob says, thrilled.


7. Managing Data with Persistent Volumes

Bob integrates persistent storage for large datasets.

  • Creating a Persistent Volume:

    • Bob writes pv.yaml:

      apiVersion: v1
      kind: PersistentVolume
      metadata:
        name: ml-data
      spec:
        capacity:
          storage: 50Gi
        accessModes:
          - ReadWriteMany
        hostPath:
          path: /mnt/data
      
  • Mounting the Volume:

    • He updates his TensorFlow job to use the volume for training data.

“Persistent volumes simplify handling large datasets!” Bob says.


8. Automating AI Pipelines with Kubeflow Pipelines

Bob automates end-to-end ML workflows with Kubeflow Pipelines.

  • Creating a Pipeline:

    • Bob writes a Python script to define a pipeline using the Kubeflow Pipelines SDK:

      from kfp import dsl
      
      @dsl.pipeline(name="ML Pipeline")
      def pipeline():
          preprocess = dsl.ContainerOp(
              name="Preprocess",
              image="my-preprocess-image",
              arguments=["--input", "/data/raw", "--output", "/data/processed"]
          )
          train = dsl.ContainerOp(
              name="Train",
              image="my-train-image",
              arguments=["--data", "/data/processed", "--model", "/data/model"]
          )
          preprocess >> train
      
  • Submitting the Pipeline:

    kfp run --pipeline ml-pipeline.py
    

“Automating workflows saves so much time!” Bob says, appreciating the efficiency.


9. Monitoring AI Workloads

Bob ensures his AI workloads are running efficiently.

  • Using Prometheus and Grafana:
    • He adds GPU and memory metrics to his dashboards.
  • Integrating MLFlow for Experiment Tracking:
    • Bob uses MLFlow to log model training metrics and compare results.

“Observability is just as important for AI as it is for apps!” Bob notes.


10. Conclusion: Bob’s AI/ML Kubernetes Expertise

With Kubeflow, Jupyter, and GPU optimization, Bob has transformed his Kubernetes cluster into an AI powerhouse. He’s ready to tackle real-world ML workloads, from training to deployment, with ease.

Next, Bob plans to explore Edge Computing with Kubernetes, learning how to deploy workloads to edge devices for low-latency applications.

Stay tuned for the next chapter: “Bob Ventures into Edge Computing with Kubernetes!”