1. 安装 Scala

Spark runs on Java 8/11, Scala 2.12/2.13, Python 3.6+ and R 3.5+. Python 3.6 support is deprecated as of Spark 3.2.0. Java 8 prior to version 8u201 support is deprecated as of Spark 3.2.0. For the Scala API, Spark 3.2.1 uses Scala 2.12. You will need to use a compatible Scala version (2.12.x).

from https://spark.apache.org/docs/3.2.1/

1.1 下载 Scala

  1. wget https://downloads.lightbend.com/scala/2.12.8/scala-2.12.8.tgz
  2. tar xvf scala-2.12.8.tgz
  3. sudo mv scala-2.12.8 /usr/local/scala

1.2 配置环境变量

  1. vim ~/.bashrc
  2. # SCALA
  3. export SCALA_HOME=/usr/local/scala
  4. export PATH=$PATH:$SCALA_HOME/bin
  5. source ~/.bashrc

1.3 测试

  1. [root@master ~]# scala
  2. Welcome to Scala 2.12.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_312).
  3. Type in expressions for evaluation. Or try :help.
  4. scala> 1+1
  5. res0: Int = 2
  6. scala> :q

2. 安装 Spark

2.1 下载 Spark

访问 Download Apache Spark 获取下载链接

  1. wget https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
  2. tar xvf spark-3.2.1-bin-hadoop3.2.tgz
  3. sudo mv spark-3.2.1-bin-hadoop3.2 /usr/local/spark

2.2 设置环境变量

  1. vim ~/.bashrc
  2. # Spark
  3. export SPARK_HOME=/usr/local/spark
  4. export PATH=$PATH:$SPARK_HOME/bin
  5. source ~/.bashrc

2.3 安装 Python3

  • Centos 8 ISO中未包含 python3,需进行步骤,若系统已包含 Python3,直接进行 2.4 即可;
  • Python 3.6 support is deprecated as of Spark 3.2.0. 建议安装 Python 3.6 以上版本;
  • 若 缺少 make命令,yum install make即可;
  1. yum install -y gcc openssl-devel bzip2-devel libffi-devel
  2. wget https://www.python.org/ftp/python/3.7.12/Python-3.7.12.tgz
  3. tar -zxvf Python-3.7.12.tgz
  4. cd Python-3.7.12 && ./configure prefix=/usr/local/python3
  5. ./configure --enable-optimizations
  6. make && make install

2.4 测试

  1. pyspark
  2. >>> exit()

2.5 配置 pyspark 只显示 WARN 内容

  1. cd /usr/local/spark/conf
  2. cp log4j.properties.template log4j.properties
  3. vim log4j.properties
  4. # Set everything to be logged to the console
  5. log4j.rootCategory=INFO, console # INFO 修改为 WARN

3. Hadoop 使用 pyspark

3.1 启动 Hadoop

  1. start-all.sh

3.2 创建测试文件

  1. cp /usr/local/hadoop/LICENSE.txt ~/wordcount/input
  2. ll ~/wordcount/input
  3. hadoop fs -mkdir -p /user/wordcount/input
  4. cd ~/wordcount/input
  5. hadoop fs -copyFromLocal LICENSE.txt /user/wordcount/input

3.3 本地运行 pyspark 程序

local[N] 为本地运行,使用N个线程

  1. pyspark --master local[4]
  2. >>> sc.master
  3. 'local[4]'

读取本地文件

  1. >>> textFile=sc.textFile("file:/usr/local/spark/README.md")
  2. >>> textFile.count()
  3. 109

读取 HDFS 文件

  1. >>> textFile=sc.textFile("hdfs://master:9000/user/wordcount/input/LICENSE.txt")
  2. >>> textFile.count()
  3. 270

4. Hadoop YARN 中运行 pyspark

即使用 Hadoop yarn 对 spark 资源进行分配管理

4.1 YARN 下运行 pyspark

  • Hadoop 配置文件目录 HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
  • 运行的程序 pyspark
  • 设置运行模式 yarn-client --master yarn --deploy-mode client
  1. HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop pyspark --master yarn --deploy-mode client
  2. >>> sc.master
  3. 'yarn'
  4. >>> textFile=sc.textFile("hdfs://master:9000/user/wordcount/input/LICENSE.txt")
  5. >>> textFile.count()
  • 注意⚠️:若在 yarn 模式下, hdfs 报错,检查 datanode 上未安装 python 的情况

4.2 问题处理

  1. WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable ```bash cp /usr/local/spark/conf/spark-env.sh.template spark-env.sh

vim spark-env.sh

export LD_LIBRARY_PATH=$JAVA_LIBRARY_PATH

  1. 2. `WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.`
  2. ```bash
  3. hdfs dfs -mkdir /spark_jars
  4. hdfs dfs -put /usr/local/spark/jars/* /spark_jars/
  5. cp /usr/local/spark/conf/spark-defaults.conf.template /usr/local/spark/conf/spark-defaults.conf
  6. vim spark-defaults.conf
  7. # 添加配置
  8. spark.yarn.jars hdfs://master:9000/spark_jars/*

5. Spark Standalone Cluster 运行 pyspark

standalone 是 spark 自身携带的资源管理框架,yarn 是 hadoop 中的资源管理框架。其本质都是对核心和内存进行管理和分配;

5.1 环境配置

  1. master 服务器 创建 spark-env.sh ```bash // 未包含 spark-env.sh 执行 cp /usr/local/spark/conf/spark-env.sh.template /usr/local/spark/conf/spark-env.sh

// 包含 spark-env.sh vim /usr/local/spark/conf/spark-env.sh

// 设置 spark_master 的IP或服务器名称 export SPARK_MASTER_IP=master // 设置 slave 使用的CPU数量 export SPARK_WORKER_CORES=1 // 设置 slave 使用的内存 export SPARK_WORKER_MEMORY=512m // 设置实例数 export SPARK_WORKER_INSTANCES=4

  1. 2. master 配置 workers
  2. ```bash
  3. ssh data01
  4. mkdir /usr/local/spark
  5. exit
  6. scp -r /usr/local/spark root@data01:/usr/local
  7. ssh data02
  8. mkdir /usr/local/spark
  9. exit
  10. scp -r /usr/local/spark root@data02:/usr/local
  1. 复制 spark 到 data01 data02 ```bash cp /usr/local/spark/conf/workers.template /usr/local/spark/conf/workers vim /usr/local/spark/conf/workers

A Spark Worker will be started on each of the machines listed below.

data01 data02

  1. <a name="a7d65e7e"></a>
  2. ### 5.2 Spark Standalone Cluster 运行 pyspark
  3. 1. 启动 Spark Standalone Cluster
  4. ```bash
  5. /usr/local/spark/sbin/start-all.sh
  6. # 单独启动
  7. /usr/local/spark/sbin/start-master.sh
  8. /usr/local/spark/sbin/start-workers.sh
  1. Spark Standalone 运行 pyspark
    1. pyspark --master spark://master:7077 --num-executors 1 --total-executor-cores 3 --executor-memory 512m
  1. 测试
    1. >>> sc.master
    2. 'spark://master:7077'
    3. >>> textFile=sc.textFile("file:/usr/local/spark/README.md")
    4. >>> textFile.count()
    5. 109
    6. >>> textFile=sc.textFile("hdfs://master:9000/user/wordcount/input/LICENSE.txt")
    7. >>> textFile.count()
    8. 270

5.3 Spark Web UI

  1. [root@master ~]# pyspark --master spark://master:7077 --num-executors 1 --total-executor-cores 3 --executor-memory 512m
  2. Python 3.7.2 (default, Feb 10 2022, 21:53:48)
  3. [GCC 8.5.0 20210514 (Red Hat 8.5.0-4)] on linux
  4. Type "help", "copyright", "credits" or "license" for more information.
  5. Welcome to
  6. ____ __
  7. / __/__ ___ _____/ /__
  8. _\ \/ _ \/ _ `/ __/ '_/
  9. /__ / .__/\_,_/_/ /_/\_\ version 3.2.1
  10. /_/
  11. Using Python version 3.7.2 (default, Feb 10 2022 21:53:48)
  12. Spark context Web UI available at http://master:4040
  13. Spark context available as 'sc' (master = spark://master:7077, app id = app-20220212184932-0000).
  14. SparkSession available as 'spark'.

http://master:4040即 spark web ui 地址,浏览器访问即可;