Bob Explores Kubernetes for Big Data and Analytics on AlmaLinux

In this chapter, Bob will learn how to use Kubernetes for managing and processing large-scale data workloads using tools like Apache Spark, Hadoop, and Presto

Let’s dive into Chapter 37, “Bob Explores Kubernetes for Big Data and Analytics!”. In this chapter, Bob will learn how to use Kubernetes for managing and processing large-scale data workloads using tools like Apache Spark, Hadoop, and Presto, leveraging the scalability and resilience of Kubernetes for data analytics.


1. Introduction: Big Data Meets Kubernetes

Bob’s company is diving into big data analytics, processing terabytes of data daily. His team wants to use Kubernetes to manage distributed data processing frameworks for tasks like real-time analytics, ETL pipelines, and querying large datasets.

“Big data and Kubernetes? Sounds like a match made for scalability—let’s get started!” Bob says, rolling up his sleeves.


2. Deploying Apache Spark on Kubernetes

Bob begins with Apache Spark, a powerful engine for distributed data processing.

  • Installing Spark:

    • Bob uses the Spark-on-Kubernetes distribution:

      wget https://downloads.apache.org/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz
      tar -xvzf spark-3.4.0-bin-hadoop3.tgz
      cd spark-3.4.0-bin-hadoop3
      
  • Submitting a Spark Job:

    • Bob writes a simple Spark job to count words in a text file:

      from pyspark.sql import SparkSession
      
      spark = SparkSession.builder.appName("WordCount").getOrCreate()
      data = spark.read.text("hdfs://data/words.txt")
      counts = data.rdd.flatMap(lambda line: line.split()).countByValue()
      for word, count in counts.items():
          print(f"{word}: {count}")
      
    • He submits the job using the Kubernetes API:

      ./bin/spark-submit \
        --master k8s://https://<k8s-api-server> \
        --deploy-mode cluster \
        --conf spark.executor.instances=3 \
        --conf spark.kubernetes.container.image=apache/spark:3.4.0 \
        local:///path/to/wordcount.py
      
  • Monitoring the Job:

    • Bob uses the Spark UI to track job progress:

      kubectl port-forward svc/spark-ui 4040:4040
      

“Spark on Kubernetes scales my jobs effortlessly!” Bob says, impressed by the integration.


3. Deploying a Hadoop Cluster

Bob sets up Apache Hadoop for distributed storage and processing.

  • Installing Hadoop on Kubernetes:

    • Bob uses a Helm chart to deploy Hadoop:

      helm repo add bitnami https://charts.bitnami.com/bitnami
      helm install hadoop bitnami/hadoop
      
  • Configuring HDFS:

    • Bob uploads a dataset to HDFS:

      hdfs dfs -mkdir /data
      hdfs dfs -put local-dataset.csv /data
      
  • Running a MapReduce Job:

    • Bob submits a MapReduce job to process the data:

      hadoop jar hadoop-mapreduce-examples.jar wordcount /data /output
      

“Hadoop’s distributed storage is perfect for managing massive datasets!” Bob says.


4. Using Presto for Interactive Queries

Next, Bob deploys Presto, a distributed SQL query engine for big data.

  • Installing Presto:

    • Bob uses Helm to deploy Presto:

      helm repo add prestosql https://prestosql.github.io/presto-helm
      helm install presto prestosql/presto
      
  • Connecting to Data Sources:

    • Bob configures Presto to query data from HDFS and an S3 bucket:

      kubectl exec -it presto-coordinator -- presto --catalog hive
      
  • Running Queries:

    • Bob queries the dataset:

      SELECT COUNT(*) FROM hive.default.dataset WHERE column='value';
      

“Presto gives me lightning-fast queries on my big data!” Bob says, enjoying the speed.


5. Orchestrating Workflows with Apache Airflow

Bob learns to manage ETL pipelines using Apache Airflow.

  • Deploying Airflow:

    • Bob uses the official Helm chart:

      helm repo add apache-airflow https://airflow.apache.org
      helm install airflow apache-airflow/airflow
      
  • Creating a DAG (Directed Acyclic Graph):

    • Bob writes a Python DAG to automate data ingestion and processing:

      from airflow import DAG
      from airflow.operators.bash_operator import BashOperator
      from datetime import datetime
      
      with DAG('data_pipeline', start_date=datetime(2023, 1, 1), schedule_interval='@daily') as dag:
          ingest = BashOperator(task_id='ingest', bash_command='python ingest_data.py')
          process = BashOperator(task_id='process', bash_command='python process_data.py')
      
          ingest >> process
      
  • Testing the Pipeline:

    • Bob schedules the DAG and monitors its execution in the Airflow UI.

“Airflow automates my pipelines beautifully!” Bob says, pleased with the results.


6. Leveraging Kubernetes-Native Tools for Big Data

Bob explores Kubernetes-native tools like Kubeflow Pipelines for machine learning workflows and data analytics.

  • Deploying Kubeflow Pipelines:

    kubectl apply -f https://github.com/kubeflow/pipelines/releases/download/v1.7.0/kubeflow-pipelines.yaml
    
  • Creating a Data Workflow:

    • Bob uses Kubeflow to preprocess data, train a machine learning model, and store results in a database.

“Kubernetes-native solutions fit right into my big data stack!” Bob says.


7. Monitoring Big Data Workloads

Bob integrates monitoring tools to track his big data jobs.

  • Using Prometheus and Grafana:
    • Bob collects metrics from Spark and Hadoop using exporters and visualizes them in Grafana.
  • Tracking Job Logs:
    • Bob centralizes logs using the EFK stack (Elasticsearch, Fluentd, Kibana) for quick debugging.

“Monitoring keeps my data processing pipelines running smoothly!” Bob notes.


8. Optimizing Big Data Costs

Bob reviews strategies to manage costs while handling massive datasets.

  • Using Spot Instances:
    • He runs non-critical Spark jobs on spot instances.
  • Autoscaling Data Processing Nodes:
    • Bob configures Kubernetes autoscaling for Hadoop and Spark clusters.
  • Data Tiering:
    • He moves infrequently accessed data to low-cost storage tiers like S3 Glacier.

“Big data doesn’t have to mean big costs!” Bob says, pleased with the savings.


9. Exploring Real-Time Data Processing

Bob dives into real-time analytics with tools like Apache Kafka and Flink.

  • Deploying Kafka:

    • Bob sets up Kafka for ingesting streaming data:

      helm repo add bitnami https://charts.bitnami.com/bitnami
      helm install kafka bitnami/kafka
      
  • Running a Flink Job:

    • Bob processes Kafka streams with Flink:

      ./bin/flink run -m kubernetes-cluster -p 4 flink-job.jar
      

“Real-time processing brings my analytics to the next level!” Bob says.


10. Conclusion: Bob’s Big Data Breakthrough

With Spark, Hadoop, Presto, Airflow, and Kubernetes-native tools, Bob has mastered big data processing on Kubernetes. He’s ready to handle massive datasets and real-time analytics with confidence.

Next, Bob plans to explore multi-tenancy in Kubernetes, learning how to isolate and manage workloads for different teams or customers.

Stay tuned for the next chapter: “Bob Implements Multi-Tenancy in Kubernetes!”