1. 安装 Scala
Spark runs on Java 8/11, Scala 2.12/2.13, Python 3.6+ and R 3.5+. Python 3.6 support is deprecated as of Spark 3.2.0. Java 8 prior to version 8u201 support is deprecated as of Spark 3.2.0. For the Scala API, Spark 3.2.1 uses Scala 2.12. You will need to use a compatible Scala version (2.12.x).
1.1 下载 Scala
wget https://downloads.lightbend.com/scala/2.12.8/scala-2.12.8.tgztar xvf scala-2.12.8.tgzsudo mv scala-2.12.8 /usr/local/scala
1.2 配置环境变量
vim ~/.bashrc# SCALAexport SCALA_HOME=/usr/local/scalaexport PATH=$PATH:$SCALA_HOME/binsource ~/.bashrc
1.3 测试
[root@master ~]# scalaWelcome to Scala 2.12.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_312).Type in expressions for evaluation. Or try :help.scala> 1+1res0: Int = 2scala> :q
2. 安装 Spark
2.1 下载 Spark
访问 Download Apache Spark 获取下载链接
wget https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgztar xvf spark-3.2.1-bin-hadoop3.2.tgzsudo mv spark-3.2.1-bin-hadoop3.2 /usr/local/spark
2.2 设置环境变量
vim ~/.bashrc# Sparkexport SPARK_HOME=/usr/local/sparkexport PATH=$PATH:$SPARK_HOME/binsource ~/.bashrc
2.3 安装 Python3
- Centos 8 ISO中未包含 python3,需进行步骤,若系统已包含 Python3,直接进行 2.4 即可;
- Python 3.6 support is deprecated as of Spark 3.2.0. 建议安装 Python 3.6 以上版本;
- 若 缺少 make命令,
yum install make即可;
yum install -y gcc openssl-devel bzip2-devel libffi-develwget https://www.python.org/ftp/python/3.7.12/Python-3.7.12.tgztar -zxvf Python-3.7.12.tgzcd Python-3.7.12 && ./configure prefix=/usr/local/python3./configure --enable-optimizationsmake && make install
2.4 测试
pyspark>>> exit()
2.5 配置 pyspark 只显示 WARN 内容
cd /usr/local/spark/confcp log4j.properties.template log4j.propertiesvim log4j.properties# Set everything to be logged to the consolelog4j.rootCategory=INFO, console # INFO 修改为 WARN
3. Hadoop 使用 pyspark
3.1 启动 Hadoop
start-all.sh
3.2 创建测试文件
cp /usr/local/hadoop/LICENSE.txt ~/wordcount/inputll ~/wordcount/inputhadoop fs -mkdir -p /user/wordcount/inputcd ~/wordcount/inputhadoop fs -copyFromLocal LICENSE.txt /user/wordcount/input
3.3 本地运行 pyspark 程序
local[N] 为本地运行,使用N个线程
pyspark --master local[4]>>> sc.master'local[4]'
读取本地文件
>>> textFile=sc.textFile("file:/usr/local/spark/README.md")>>> textFile.count()109
读取 HDFS 文件
>>> textFile=sc.textFile("hdfs://master:9000/user/wordcount/input/LICENSE.txt")>>> textFile.count()270
4. Hadoop YARN 中运行 pyspark
即使用 Hadoop yarn 对 spark 资源进行分配管理
4.1 YARN 下运行 pyspark
- Hadoop 配置文件目录
HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop - 运行的程序
pyspark - 设置运行模式 yarn-client
--master yarn --deploy-mode client
HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop pyspark --master yarn --deploy-mode client>>> sc.master'yarn'>>> textFile=sc.textFile("hdfs://master:9000/user/wordcount/input/LICENSE.txt")>>> textFile.count()
- 注意⚠️:若在 yarn 模式下, hdfs 报错,检查 datanode 上未安装 python 的情况
4.2 问题处理
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable```bash cp /usr/local/spark/conf/spark-env.sh.template spark-env.sh
vim spark-env.sh
export LD_LIBRARY_PATH=$JAVA_LIBRARY_PATH
2. `WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.````bashhdfs dfs -mkdir /spark_jarshdfs dfs -put /usr/local/spark/jars/* /spark_jars/cp /usr/local/spark/conf/spark-defaults.conf.template /usr/local/spark/conf/spark-defaults.confvim spark-defaults.conf# 添加配置spark.yarn.jars hdfs://master:9000/spark_jars/*
5. Spark Standalone Cluster 运行 pyspark
standalone 是 spark 自身携带的资源管理框架,yarn 是 hadoop 中的资源管理框架。其本质都是对核心和内存进行管理和分配;
5.1 环境配置
- master 服务器 创建 spark-env.sh ```bash // 未包含 spark-env.sh 执行 cp /usr/local/spark/conf/spark-env.sh.template /usr/local/spark/conf/spark-env.sh
// 包含 spark-env.sh vim /usr/local/spark/conf/spark-env.sh
// 设置 spark_master 的IP或服务器名称 export SPARK_MASTER_IP=master // 设置 slave 使用的CPU数量 export SPARK_WORKER_CORES=1 // 设置 slave 使用的内存 export SPARK_WORKER_MEMORY=512m // 设置实例数 export SPARK_WORKER_INSTANCES=4
2. master 配置 workers```bashssh data01mkdir /usr/local/sparkexitscp -r /usr/local/spark root@data01:/usr/localssh data02mkdir /usr/local/sparkexitscp -r /usr/local/spark root@data02:/usr/local
- 复制 spark 到 data01 data02 ```bash cp /usr/local/spark/conf/workers.template /usr/local/spark/conf/workers vim /usr/local/spark/conf/workers
A Spark Worker will be started on each of the machines listed below.
data01 data02
<a name="a7d65e7e"></a>### 5.2 Spark Standalone Cluster 运行 pyspark1. 启动 Spark Standalone Cluster```bash/usr/local/spark/sbin/start-all.sh# 单独启动/usr/local/spark/sbin/start-master.sh/usr/local/spark/sbin/start-workers.sh
- Spark Standalone 运行 pyspark
pyspark --master spark://master:7077 --num-executors 1 --total-executor-cores 3 --executor-memory 512m
- 测试
>>> sc.master'spark://master:7077'>>> textFile=sc.textFile("file:/usr/local/spark/README.md")>>> textFile.count()109>>> textFile=sc.textFile("hdfs://master:9000/user/wordcount/input/LICENSE.txt")>>> textFile.count()270
5.3 Spark Web UI
[root@master ~]# pyspark --master spark://master:7077 --num-executors 1 --total-executor-cores 3 --executor-memory 512mPython 3.7.2 (default, Feb 10 2022, 21:53:48)[GCC 8.5.0 20210514 (Red Hat 8.5.0-4)] on linuxType "help", "copyright", "credits" or "license" for more information.Welcome to____ __/ __/__ ___ _____/ /___\ \/ _ \/ _ `/ __/ '_//__ / .__/\_,_/_/ /_/\_\ version 3.2.1/_/Using Python version 3.7.2 (default, Feb 10 2022 21:53:48)Spark context Web UI available at http://master:4040Spark context available as 'sc' (master = spark://master:7077, app id = app-20220212184932-0000).SparkSession available as 'spark'.
http://master:4040即 spark web ui 地址,浏览器访问即可;
