1. 资源规划
组件 | bigdata-hk-node1 | bigdata-hk-node2 | bigdata-hk-node3 |
---|---|---|---|
OS | centos7.6 | centos7.6 | centos7.6 |
JDK | jvm | jvm | jvm |
HDFS | NN/DN | DN | 2NN/DN |
YARN | NM | RM/NM/JobHistoryServer | NM |
Hive | hive | N.A | N.A |
Spark | sparksql | N.A | N.A |
2. 安装介质
版本:spark-3.0.0-bin-hadoop-3.2.tgz
下载:http://spark.apache.org/downloads.html
3. 环境准备
- 安装JDK(v1.8+)
- 安装Hadoop
- 安装Hive
4. Spark On Hive
4.1. 解压
# 登录bigdata-hk-node1节点
cd /share
tar -zxvf spark-3.0.0-bin-hadoop-3.2.tgz -C /opt/module/
# 创建软连接(删除使用rm -rf)
ln -s /opt/module/spark-3.0.0-bin-hadoop3.2 /opt/module/spark
4.2. 配置Spark
配置spark-env.sh。
vi /opt/module/spark/conf/spark-env.sh
配置如下:
export JAVA_HOME=/opt/module/jdk1.8.0_221
export HADOOP_HOME=/opt/module/hadoop-3.2.2
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_LAUNCH_WITH_SCALA=0
export SPARK_WORKER_MEMORY=1g
export SPARK_DRIVER_MEMORY=1g
export SPARK_MASTER_IP=192.168.56.101
export SPARK_LIBRARY_PATH=/opt/module/spark/lib
export SPARK_MASTER_WEBUI_PORT=8081
export SPARK_WORKER_DIR=/opt/module/spark/work
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_PORT=7078
export SPARK_LOG_DIR=/opt/module/spark/log
export SPARK_PID_DIR=/opt/module/spark/run
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080 -Dspark.history.retainedApplications=3 -Dspark.history.fs.logDirectory=hdfs://bigdata-hk-node1:8020/spark-log"
配置spark-defaults.conf。
vi /opt/module/spark/conf/spark-defaults.conf
配置如下:
spark.master spark://bigdata-hk-node1:7077
spark.home /opt/module/spark
spark.eventLog.enabled true
spark.eventLog.dir hdfs://bigdata-hk-node1:8020/spark-log
spark.eventLog.compress true
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.executor.memory 1g
spark.driver.memory 1g
spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
4.3. 添加Hive集成jar包
添加“mysql-connector-java-5.1.47.jar”至${SPARK_HOME}/jars(如:/opt/module/spark/jars)。
4.4. 环境变量设置
sudo vi /etc/profile.d/bigdata_env.sh # :$(或G:`shift+g`)到达行尾添加
配置如下:
# SPARK_HOME
export SPARK_HOME=/opt/module/spark
export PATH=$SPARK_HOME/bin:$PATH
环境变量生效:
source /etc/profile
4.5. 添加Hive集成配置
# bigdata-hk-node1
cp -rf /opt/module/apache-hive-3.1.2-bin/conf/hive-site.xml /opt/module/spark/conf/
vi /opt/module/spark/conf/hive-site.xml
配置如下(追加):
<!-- 每次启动spark的时候,会检查hive的版本 -->
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
4.6. 配置Yarn
vi /opt/module/hadoop-3.2.2/etc/hadoop/yarn-site.xml
配置如下(安装Hadoop时已经配置好,如未配置则需更新并分发以下配置内容):
<!--日志聚合到HDFS提供给WEB UI查看 -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<!-- 配置日志服务器的地址,work节点使用 -->
<property>
<name>yarn.log.server.url</name>
<value>http://bigdata-hk-node1:19888/jobhistory/logs/</value>
</property>
<!-- 配置日志过期时间,单位秒 -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>86400</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
</property>
<!-- 以下为解决spark-shell 以yarn client模式运行报错问题而增加的配置,估计spark-summit也会有这个问题。2个配置只用配置一个即可解决问题,当然都配置也没问题 -->
<!-- 是否启动一个线程检查每个任务正使用的虚拟内存量,如果任务超出分配值,则直接将其杀掉,默认是true -->
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<!-- 是否启动一个线程检查每个任务正使用的物理内存量,如果任务超出分配值,则直接将其杀掉,默认是true -->
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<!-- 任务每使用1MB物理内存,最多可使用虚拟内存量,默认是2.1 -->
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>4</value>
</property>
5. 启动Spark(选)
注意:Standalone环境下,启动Spark集群之前,要确保Hadoop集群和Yarn均已启动。 ```bash cd /opt/module/spark
启动Spark集群
sbin/start-all.sh sbin/stop-all.sh
创建Spark日志文件HDFS目录(首次运行)
/opt/module/hadoop-3.2.2/bin/hdfs dfs -mkdir -p /spark-log sbin/start-history-server.sh sbin/stop-history-server.sh
<a name="p156P"></a>
# 6. 验证
说明:Spark On Yarn环境下,此处无需启动Spark集群。
<a name="p9njG"></a>
## 6.1. Client模式
- **Spark-shell**
**注意:spark-shell只支持client模式(Driver运行在本地),不支持cluster模式。**
```bash
cd /opt/module/spark
bin/spark-shell --master yarn --deploy-mode client
验证脚本如下:
sc.textFile("hdfs://bigdata-hk-node1:8020/tmp/input/core-site.xml").flatMap(_.split(" ")).map(x=>(x,1)).reduceByKey(_+_).map(x=>(x._2,x._1)).sortByKey(false).map(x=>(x._2,x._1)).take(10)
- spark-sql
注意:spark-sql只支持client模式,不支持cluster模式。
# 注意:如果Hive使用客户端模式,请确保Hive Metastore服务器已启动
cd /opt/module/spark
bin/spark-sql --master yarn --deploy-mode client
验证SQL如下:
use default;
select sum(id) from stu;
select max(id) from stu;
select avg(id) from stu;
spark-submit
cd /opt/module/spark
bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode client \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1 \
--queue thequeue \
examples/jars/spark-examples*.jar \
10
6.2. Cluster模式
spark-submit
cd /opt/module/spark
bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1 \
--queue thequeue \
examples/jars/spark-examples*.jar \
10
6.3. 端口
进程查看 ```bash
netstat安装,已安装则直接跳过
sudo yum install net-tools
bigdata-hk-node1
sudo netstat -anop |grep -E ‘9870|8081|18080|7077|7078’
bigdata-hk-node1(开启spark-shell、spark-sql、spark-submit(client模式)后,
进程SparkSubmit启动,关闭后端口释放)
sudo netstat -anop |grep 4040
bigdata-hk-node2
sudo netstat -anop |grep 8088 ```
- 浏览器查看 | 组件 | URL | | —- | —- | | Hadoop Web UI | http://bigdata-hk-node1:9870 | | Yarn Web UI | http://bigdata-hk-node2:8088 | | Spark Web UI | http://bigdata-hk-node1:8081 | | Application Web UI | http://bigdata-hk-node1:4040 —> http://bigdata-hk-node2:8088/proxy/redirect/application_* | | Spark History Server | http://bigdata-hk-node1:18080 |
说明:4040端口显示的是正在运行的spark任务,一旦任务运行完成或者没有任务运行,4040端口是无法访问的。