首頁 > 軟體

Docker-Compose搭建Spark叢集的實現方法

2022-05-29 22:02:37

一、前言

在前文中,我們使用Docker-Compose完成了hdfs叢集的構建。本文將繼續使用Docker-Compose,實現Spark叢集的搭建。

二、docker-compose.yml

對於Spark叢集,我們採用一個mater節點和兩個worker節點進行構建。其中,所有的work節點均分配1一個core和 1GB的記憶體。

Docker映象選擇了bitnami/spark的開源映象,選擇的spark版本為2.4.3,docker-compose設定如下:

  master:
    image: bitnami/spark:2.4.3
    container_name: master
    user: root
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    ports:
      - '8080:8080'
      - '7077:7077'
    volumes:
      - ./python:/python

  worker1:
    image: bitnami/spark:2.4.3
    container_name: worker1
    user: root
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://master:7077
      - SPARK_WORKER_MEMORY=1G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
  worker2:
    image: bitnami/spark:2.4.3
    container_name: worker2
    user: root
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://master:7077
      - SPARK_WORKER_MEMORY=1G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no

在master節點中,也對映了一個/python目錄,用於存放pyspark程式碼,方便執行。

對於master節點,暴露出7077埠和8080埠分別用於連線spark以及瀏覽器檢視spark UI,在spark UI中,叢集狀態如下圖(啟動後):

如果有需要,可以自行新增worker節點,其中可以修改SPARK_WORKER_MEMORYSPARK_WORKER_CORES對節點分配的資源進行修改。

對於該映象而言,預設exec進去是無使用者的,會導致一些安裝命令許可權的不足,無法安裝。例如需要執行pyspark,可能需要安裝numpy、pandas等庫,就無法使用pip完成安裝。而通過user: root就能設定預設使用者為root使用者,避免上述問題。

三、啟動叢集

同上文一樣,在docker-compose.yml的目錄下執行docker-compose up -d命令,就能一鍵構建叢集(但是如果需要用到numpy等庫,還是需要自己到各節點內進行安裝)。

進入master節點執行spark-shell,成功進入:

四、結合hdfs使用

將上文的Hadoop的docker-compose.yml與本次的結合,得到新的docker-compose.yml:

version: "1.0"
services:
  namenode:
    image: bde2020/hadoop-namenode:2.0.0-hadoop3.2.1-java8
    container_name: namenode
    ports:
      - 9870:9870
      - 9000:9000
    volumes:
      - ./hadoop/dfs/name:/hadoop/dfs/name
      - ./input:/input
    environment:
      - CLUSTER_NAME=test
    env_file:
      - ./hadoop.env

  datanode:
    image: bde2020/hadoop-datanode:2.0.0-hadoop3.2.1-java8
    container_name: datanode
    depends_on:
      - namenode
    volumes:
      - ./hadoop/dfs/data:/hadoop/dfs/data
    environment:
      SERVICE_PRECONDITION: "namenode:9870"
    env_file:
      - ./hadoop.env
  
  resourcemanager:
    image: bde2020/hadoop-resourcemanager:2.0.0-hadoop3.2.1-java8
    container_name: resourcemanager
    environment:
      SERVICE_PRECONDITION: "namenode:9000 namenode:9870 datanode:9864"
    env_file:
      - ./hadoop.env

  nodemanager1:
    image: bde2020/hadoop-nodemanager:2.0.0-hadoop3.2.1-java8
    container_name: nodemanager
    environment:
      SERVICE_PRECONDITION: "namenode:9000 namenode:9870 datanode:9864 resourcemanager:8088"
    env_file:
      - ./hadoop.env
  
  historyserver:
    image: bde2020/hadoop-historyserver:2.0.0-hadoop3.2.1-java8
    container_name: historyserver
    environment:
      SERVICE_PRECONDITION: "namenode:9000 namenode:9870 datanode:9864 resourcemanager:8088"
    volumes:
      - ./hadoop/yarn/timeline:/hadoop/yarn/timeline
    env_file:
      - ./hadoop.env
    
  master:
    image: bitnami/spark:2.4.3-debian-9-r81
    container_name: master
    user: root
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    ports:
      - '8080:8080'
      - '7077:7077'
    volumes:
      - ./python:/python

  worker1:
    image: bitnami/spark:2.4.3-debian-9-r81
    container_name: worker1
    user: root
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://master:7077
      - SPARK_WORKER_MEMORY=1G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
  worker2:
    image: bitnami/spark:2.4.3-debian-9-r81
    container_name: worker2
    user: root
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://master:7077
      - SPARK_WORKER_MEMORY=1G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no

執行叢集(還需要一個hadoop.env檔案見上文)長這樣:

通過Docker容器的對映功能,將本地檔案與spark叢集的master節點的/python進行了檔案對映,編寫的pyspark通過對映可與容器中進行同步,並通過docker exec指令,完成程式碼執行:

執行了一個迴歸程式,叢集功能正常:

到此這篇關於Docker-Compose搭建Spark叢集的實現方法的文章就介紹到這了,更多相關Docker-Compose搭建Spark叢集內容請搜尋it145.com以前的文章或繼續瀏覽下面的相關文章希望大家以後多多支援it145.com!


IT145.com E-mail:sddin#qq.com