Bob Tackles Machine Learning at Scale on AlmaLinux

Explore machine learning (ML) at scaleusing AlmaLinux.

By İbrahim Korucuoğlu ( @siberoloji) | Thursday, December 12, 2024

Categories:

Linux

4 minute read

Bob’s next adventure was to explore machine learning (ML) at scale using AlmaLinux. By leveraging distributed computing frameworks and efficient resource management, Bob aimed to train complex models and process massive datasets.

“Scaling machine learning means making smarter decisions, faster—let’s get started!” Bob said with determination.

Chapter Outline: “Bob Tackles Machine Learning at Scale”

Introduction: Why Scale Machine Learning?
- The challenges of large-scale ML workloads.
- Benefits of distributed computing and parallel processing.
Preparing AlmaLinux for Distributed ML
- Installing Python ML libraries and frameworks.
- Setting up GPUs and multi-node configurations.
Building Distributed ML Pipelines
- Using TensorFlow’s distributed training.
- Setting up PyTorch Distributed Data Parallel (DDP).
Managing Data for Scaled ML Workloads
- Leveraging HDFS and object storage for large datasets.
- Using Apache Kafka for data streaming.
Scaling ML Workloads with Kubernetes
- Deploying TensorFlow Serving and PyTorch on Kubernetes.
- Auto-scaling ML tasks with Kubernetes.
Monitoring and Optimizing ML Performance
- Using Prometheus and Grafana to monitor GPU and CPU usage.
- Tuning hyperparameters and resource allocation.
Conclusion: Bob Reflects on Scaled ML Mastery

Part 1: Why Scale Machine Learning?

Bob discovered that traditional ML setups struggle with:

Large Datasets: Datasets can be terabytes or more, requiring distributed storage and processing.
Complex Models: Deep learning models with millions of parameters need significant compute power.
Real-Time Requirements: Applications like recommendation systems demand fast inference.

Benefits of Scaling ML

Faster model training.
Handling massive datasets efficiently.
Real-time inference for high-demand applications.

“Scaling ML lets us solve bigger problems, faster!” Bob said.

Part 2: Preparing AlmaLinux for Distributed ML

Step 1: Installing ML Libraries and Frameworks

Install Python and common ML libraries:

sudo dnf install -y python3 python3-pip
pip3 install numpy pandas matplotlib tensorflow torch scikit-learn

Step 2: Setting Up GPUs

Install NVIDIA drivers and CUDA:
```
sudo dnf install -y nvidia-driver cuda
```
Verify GPU availability:
```
nvidia-smi
```
Install TensorFlow and PyTorch with GPU support:
```
pip3 install tensorflow-gpu torch torchvision
```

Step 3: Configuring Multi-Node Clusters

Set up SSH access for seamless communication:
```
ssh-keygen -t rsa
ssh-copy-id user@node2
```

“With GPUs and multi-node setups, I’m ready to scale ML tasks!” Bob said.

Part 3: Building Distributed ML Pipelines

Step 1: TensorFlow Distributed Training

Write a simple distributed training script:

import tensorflow as tf

strategy = tf.distribute.MultiWorkerMirroredStrategy()
with strategy.scope():
    model = tf.keras.models.Sequential([
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(10, activation='softmax')
    ])
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

def dataset_fn():
    (x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
    x_train = x_train / 255.0
    return tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(32)

model.fit(dataset_fn(), epochs=5)

Run the script across multiple nodes:

TF_CONFIG='{"cluster": {"worker": ["node1:12345", "node2:12345"]}, "task": {"type": "worker", "index": 0}}' python3 distributed_training.py

Step 2: PyTorch Distributed Data Parallel

Modify a PyTorch script for distributed training:

import torch
import torch.nn as nn
import torch.distributed as dist

def setup():
    dist.init_process_group("gloo")

def train(rank):
    setup()
    model = nn.Linear(10, 1).to(rank)
    ddp_model = nn.parallel.DistributedDataParallel(model, device_ids=[rank])
    optimizer = torch.optim.SGD(ddp_model.parameters(), lr=0.01)

    # Simulate training
    for epoch in range(5):
        optimizer.zero_grad()
        outputs = ddp_model(torch.randn(20, 10).to(rank))
        loss = outputs.sum()
        loss.backward()
        optimizer.step()

if __name__ == "__main__":
    train(0)

“Distributed training lets me train models faster than ever!” Bob said.

Part 4: Managing Data for Scaled ML Workloads

Step 1: Leveraging HDFS and Object Storage

Install Hadoop for HDFS:
```
sudo dnf install -y hadoop
```

Configure the core-site.xml file:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://node1:9000</value>
    </property>
</configuration>

Test HDFS:

hdfs dfs -mkdir /ml-data
hdfs dfs -put local-data.csv /ml-data

Step 2: Streaming Data with Apache Kafka

Install Kafka:
```
sudo dnf install -y kafka-server
```

Create a Kafka topic:

kafka-topics.sh --create --topic ml-stream --bootstrap-server localhost:9092

Stream data to the topic:

kafka-console-producer.sh --topic ml-stream --bootstrap-server localhost:9092

“With HDFS and Kafka, I can manage massive ML datasets seamlessly!” Bob noted.

Part 5: Scaling ML Workloads with Kubernetes

Step 1: Deploying TensorFlow Serving

Create a TensorFlow Serving deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tf-serving
spec:
  replicas: 2
  template:
    metadata:
      labels:
        app: tf-serving
    spec:
      containers:
      - name: tf-serving
        image: tensorflow/serving
        args:
        - --model_base_path=/models/mymodel
        - --rest_api_port=8501

Apply the deployment:
```
kubectl apply -f tf-serving.yaml
```

Step 2: Auto-Scaling ML Tasks

Enable Kubernetes auto-scaling:

kubectl autoscale deployment tf-serving --cpu-percent=50 --min=2 --max=10

“Kubernetes ensures my ML workloads scale effortlessly!” Bob said.

Part 6: Monitoring and Optimizing ML Performance

Step 1: Monitoring GPU and CPU Usage

Install Prometheus and Grafana:
```
sudo dnf install -y prometheus grafana
```
Configure Prometheus to monitor GPU metrics.

Step 2: Tuning Hyperparameters

Use grid search for automated hyperparameter tuning:

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

params = {'n_estimators': [10, 50, 100], 'max_depth': [None, 10, 20]}
clf = GridSearchCV(RandomForestClassifier(), params, cv=5)
clf.fit(X_train, y_train)

“Monitoring and tuning ensure I get the best performance from my ML setup!” Bob noted.

Conclusion: Bob Reflects on Scaled ML Mastery

Bob successfully scaled machine learning workloads on AlmaLinux, leveraging distributed training, Kubernetes, and advanced data management tools. With powerful monitoring and optimization strategies, he was ready to handle even the most demanding ML applications.