1. 安装 Scala
Spark runs on Java 8/11, Scala 2.12/2.13, Python 3.6+ and R 3.5+. Python 3.6 support is deprecated as of Spark 3.2.0. Java 8 prior to version 8u201 support is deprecated as of Spark 3.2.0. For the Scala API, Spark 3.2.1 uses Scala 2.12. You will need to use a compatible Scala version (2.12.x).
1.1 下载 Scala
wget https://downloads.lightbend.com/scala/2.12.8/scala-2.12.8.tgz
tar xvf scala-2.12.8.tgz
sudo mv scala-2.12.8 /usr/local/scala
1.2 配置环境变量
vim ~/.bashrc
# SCALA
export SCALA_HOME=/usr/local/scala
export PATH=$PATH:$SCALA_HOME/bin
source ~/.bashrc
1.3 测试
[root@master ~]# scala
Welcome to Scala 2.12.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_312).
Type in expressions for evaluation. Or try :help.
scala> 1+1
res0: Int = 2
scala> :q
2. 安装 Spark
2.1 下载 Spark
访问 Download Apache Spark 获取下载链接
wget https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
tar xvf spark-3.2.1-bin-hadoop3.2.tgz
sudo mv spark-3.2.1-bin-hadoop3.2 /usr/local/spark
2.2 设置环境变量
vim ~/.bashrc
# Spark
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin
source ~/.bashrc
2.3 安装 Python3
- Centos 8 ISO中未包含 python3,需进行步骤,若系统已包含 Python3,直接进行 2.4 即可;
- Python 3.6 support is deprecated as of Spark 3.2.0. 建议安装 Python 3.6 以上版本;
- 若 缺少 make命令,
yum install make
即可;
yum install -y gcc openssl-devel bzip2-devel libffi-devel
wget https://www.python.org/ftp/python/3.7.12/Python-3.7.12.tgz
tar -zxvf Python-3.7.12.tgz
cd Python-3.7.12 && ./configure prefix=/usr/local/python3
./configure --enable-optimizations
make && make install
2.4 测试
pyspark
>>> exit()
2.5 配置 pyspark 只显示 WARN 内容
cd /usr/local/spark/conf
cp log4j.properties.template log4j.properties
vim log4j.properties
# Set everything to be logged to the console
log4j.rootCategory=INFO, console # INFO 修改为 WARN
3. Hadoop 使用 pyspark
3.1 启动 Hadoop
start-all.sh
3.2 创建测试文件
cp /usr/local/hadoop/LICENSE.txt ~/wordcount/input
ll ~/wordcount/input
hadoop fs -mkdir -p /user/wordcount/input
cd ~/wordcount/input
hadoop fs -copyFromLocal LICENSE.txt /user/wordcount/input
3.3 本地运行 pyspark 程序
local[N] 为本地运行,使用N个线程
pyspark --master local[4]
>>> sc.master
'local[4]'
读取本地文件
>>> textFile=sc.textFile("file:/usr/local/spark/README.md")
>>> textFile.count()
109
读取 HDFS 文件
>>> textFile=sc.textFile("hdfs://master:9000/user/wordcount/input/LICENSE.txt")
>>> textFile.count()
270
4. Hadoop YARN 中运行 pyspark
即使用 Hadoop yarn 对 spark 资源进行分配管理
4.1 YARN 下运行 pyspark
- Hadoop 配置文件目录
HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
- 运行的程序
pyspark
- 设置运行模式 yarn-client
--master yarn --deploy-mode client
HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop pyspark --master yarn --deploy-mode client
>>> sc.master
'yarn'
>>> textFile=sc.textFile("hdfs://master:9000/user/wordcount/input/LICENSE.txt")
>>> textFile.count()
- 注意⚠️:若在 yarn 模式下, hdfs 报错,检查 datanode 上未安装 python 的情况
4.2 问题处理
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
```bash cp /usr/local/spark/conf/spark-env.sh.template spark-env.sh
vim spark-env.sh
export LD_LIBRARY_PATH=$JAVA_LIBRARY_PATH
2. `WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.`
```bash
hdfs dfs -mkdir /spark_jars
hdfs dfs -put /usr/local/spark/jars/* /spark_jars/
cp /usr/local/spark/conf/spark-defaults.conf.template /usr/local/spark/conf/spark-defaults.conf
vim spark-defaults.conf
# 添加配置
spark.yarn.jars hdfs://master:9000/spark_jars/*
5. Spark Standalone Cluster 运行 pyspark
standalone 是 spark 自身携带的资源管理框架,yarn 是 hadoop 中的资源管理框架。其本质都是对核心和内存进行管理和分配;
5.1 环境配置
- master 服务器 创建 spark-env.sh ```bash // 未包含 spark-env.sh 执行 cp /usr/local/spark/conf/spark-env.sh.template /usr/local/spark/conf/spark-env.sh
// 包含 spark-env.sh vim /usr/local/spark/conf/spark-env.sh
// 设置 spark_master 的IP或服务器名称 export SPARK_MASTER_IP=master // 设置 slave 使用的CPU数量 export SPARK_WORKER_CORES=1 // 设置 slave 使用的内存 export SPARK_WORKER_MEMORY=512m // 设置实例数 export SPARK_WORKER_INSTANCES=4
2. master 配置 workers
```bash
ssh data01
mkdir /usr/local/spark
exit
scp -r /usr/local/spark root@data01:/usr/local
ssh data02
mkdir /usr/local/spark
exit
scp -r /usr/local/spark root@data02:/usr/local
- 复制 spark 到 data01 data02 ```bash cp /usr/local/spark/conf/workers.template /usr/local/spark/conf/workers vim /usr/local/spark/conf/workers
A Spark Worker will be started on each of the machines listed below.
data01 data02
<a name="a7d65e7e"></a>
### 5.2 Spark Standalone Cluster 运行 pyspark
1. 启动 Spark Standalone Cluster
```bash
/usr/local/spark/sbin/start-all.sh
# 单独启动
/usr/local/spark/sbin/start-master.sh
/usr/local/spark/sbin/start-workers.sh
- Spark Standalone 运行 pyspark
pyspark --master spark://master:7077 --num-executors 1 --total-executor-cores 3 --executor-memory 512m
- 测试
>>> sc.master
'spark://master:7077'
>>> textFile=sc.textFile("file:/usr/local/spark/README.md")
>>> textFile.count()
109
>>> textFile=sc.textFile("hdfs://master:9000/user/wordcount/input/LICENSE.txt")
>>> textFile.count()
270
5.3 Spark Web UI
[root@master ~]# pyspark --master spark://master:7077 --num-executors 1 --total-executor-cores 3 --executor-memory 512m
Python 3.7.2 (default, Feb 10 2022, 21:53:48)
[GCC 8.5.0 20210514 (Red Hat 8.5.0-4)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.2.1
/_/
Using Python version 3.7.2 (default, Feb 10 2022 21:53:48)
Spark context Web UI available at http://master:4040
Spark context available as 'sc' (master = spark://master:7077, app id = app-20220212184932-0000).
SparkSession available as 'spark'.
http://master:4040
即 spark web ui 地址,浏览器访问即可;