1. 资源规划
| 组件 | bigdata-hk-node1 | bigdata-hk-node2 | bigdata-hk-node3 | 
|---|---|---|---|
| OS | centos7.6 | centos7.6 | centos7.6 | 
| JDK | jvm | jvm | jvm | 
| HDFS | NN/DN | DN | 2NN/DN | 
| YARN | NM | RM/NM/JobHistoryServer | NM | 
| Hive | hive | N.A | N.A | 
| Spark | sparksql | N.A | N.A | 
2. 安装介质
版本:spark-3.0.0-bin-hadoop-3.2.tgz
下载:http://spark.apache.org/downloads.html
3. 环境准备
- 安装JDK(v1.8+)
- 安装Hadoop
- 安装Hive
4. Spark On Hive
4.1. 解压
# 登录bigdata-hk-node1节点
cd /share
tar -zxvf spark-3.0.0-bin-hadoop-3.2.tgz -C /opt/module/
# 创建软连接(删除使用rm -rf)
ln -s /opt/module/spark-3.0.0-bin-hadoop3.2 /opt/module/spark
4.2. 配置Spark
- 配置spark-env.sh。 - vi /opt/module/spark/conf/spark-env.sh
 - 配置如下: - export JAVA_HOME=/opt/module/jdk1.8.0_221
- export HADOOP_HOME=/opt/module/hadoop-3.2.2
- export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
- export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
- export SPARK_LAUNCH_WITH_SCALA=0
- export SPARK_WORKER_MEMORY=1g
- export SPARK_DRIVER_MEMORY=1g
- export SPARK_MASTER_IP=192.168.56.101
- export SPARK_LIBRARY_PATH=/opt/module/spark/lib
- export SPARK_MASTER_WEBUI_PORT=8081
- export SPARK_WORKER_DIR=/opt/module/spark/work
- export SPARK_MASTER_PORT=7077
- export SPARK_WORKER_PORT=7078
- export SPARK_LOG_DIR=/opt/module/spark/log
- export SPARK_PID_DIR=/opt/module/spark/run
- export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080 -Dspark.history.retainedApplications=3 -Dspark.history.fs.logDirectory=hdfs://bigdata-hk-node1:8020/spark-log"
 
- 配置spark-defaults.conf。 - vi /opt/module/spark/conf/spark-defaults.conf
 - 配置如下: - spark.master spark://bigdata-hk-node1:7077
- spark.home /opt/module/spark
- spark.eventLog.enabled true
- spark.eventLog.dir hdfs://bigdata-hk-node1:8020/spark-log
- spark.eventLog.compress true
- spark.serializer org.apache.spark.serializer.KryoSerializer
- spark.executor.memory 1g
- spark.driver.memory 1g
- spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
 - 4.3. 添加Hive集成jar包- 添加“mysql-connector-java-5.1.47.jar”至${SPARK_HOME}/jars(如:/opt/module/spark/jars)。 - 4.4. 环境变量设置- sudo vi /etc/profile.d/bigdata_env.sh # :$(或G:`shift+g`)到达行尾添加
 - 配置如下: - # SPARK_HOME
- export SPARK_HOME=/opt/module/spark
- export PATH=$SPARK_HOME/bin:$PATH
 - 环境变量生效: - source /etc/profile
 - 4.5. 添加Hive集成配置- # bigdata-hk-node1
- cp -rf /opt/module/apache-hive-3.1.2-bin/conf/hive-site.xml /opt/module/spark/conf/
- vi /opt/module/spark/conf/hive-site.xml
 - 配置如下(追加): - <!-- 每次启动spark的时候,会检查hive的版本 -->
- <property>
- <name>hive.metastore.schema.verification</name>
- <value>false</value>
- </property>
 - 4.6. 配置Yarn- vi /opt/module/hadoop-3.2.2/etc/hadoop/yarn-site.xml
 - 配置如下(安装Hadoop时已经配置好,如未配置则需更新并分发以下配置内容): - <!--日志聚合到HDFS提供给WEB UI查看 -->
- <property>
- <name>yarn.log-aggregation-enable</name>
- <value>true</value>
- </property>
- <!-- 配置日志服务器的地址,work节点使用 -->
- <property>
- <name>yarn.log.server.url</name>
- <value>http://bigdata-hk-node1:19888/jobhistory/logs/</value>
- </property>
- <!-- 配置日志过期时间,单位秒 -->
- <property>
- <name>yarn.log-aggregation.retain-seconds</name>
- <value>86400</value>
- </property>
- <property>
- <name>yarn.resourcemanager.scheduler.class</name>
- <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
- </property>
- <!-- 以下为解决spark-shell 以yarn client模式运行报错问题而增加的配置,估计spark-summit也会有这个问题。2个配置只用配置一个即可解决问题,当然都配置也没问题 -->
- <!-- 是否启动一个线程检查每个任务正使用的虚拟内存量,如果任务超出分配值,则直接将其杀掉,默认是true -->
- <property>
- <name>yarn.nodemanager.vmem-check-enabled</name>
- <value>false</value>
- </property>
- <!-- 是否启动一个线程检查每个任务正使用的物理内存量,如果任务超出分配值,则直接将其杀掉,默认是true -->
- <property>
- <name>yarn.nodemanager.pmem-check-enabled</name>
- <value>false</value>
- </property>
- <!-- 任务每使用1MB物理内存,最多可使用虚拟内存量,默认是2.1 -->
- <property>
- <name>yarn.nodemanager.vmem-pmem-ratio</name>
- <value>4</value>
- </property>
 - 5. 启动Spark(选)- 注意:Standalone环境下,启动Spark集群之前,要确保Hadoop集群和Yarn均已启动。 ```bash cd /opt/module/spark 
启动Spark集群
sbin/start-all.sh sbin/stop-all.sh
创建Spark日志文件HDFS目录(首次运行)
/opt/module/hadoop-3.2.2/bin/hdfs dfs -mkdir -p /spark-log sbin/start-history-server.sh sbin/stop-history-server.sh
<a name="p156P"></a>
# 6. 验证
说明:Spark On Yarn环境下,此处无需启动Spark集群。
<a name="p9njG"></a>
## 6.1. Client模式
- **Spark-shell**
**注意:spark-shell只支持client模式(Driver运行在本地),不支持cluster模式。**
```bash
cd /opt/module/spark
bin/spark-shell --master yarn --deploy-mode client
验证脚本如下:
sc.textFile("hdfs://bigdata-hk-node1:8020/tmp/input/core-site.xml").flatMap(_.split(" ")).map(x=>(x,1)).reduceByKey(_+_).map(x=>(x._2,x._1)).sortByKey(false).map(x=>(x._2,x._1)).take(10)
- spark-sql
注意:spark-sql只支持client模式,不支持cluster模式。
# 注意:如果Hive使用客户端模式,请确保Hive Metastore服务器已启动
cd /opt/module/spark
bin/spark-sql --master yarn --deploy-mode client
验证SQL如下:
use default;
select sum(id) from stu;
select max(id) from stu;
select avg(id) from stu;
- spark-submit - cd /opt/module/spark
- bin/spark-submit --class org.apache.spark.examples.SparkPi \
- --master yarn \
- --deploy-mode client \
- --driver-memory 1g \
- --executor-memory 1g \
- --executor-cores 1 \
- --queue thequeue \
- examples/jars/spark-examples*.jar \
- 10
 - 6.2. Cluster模式
- spark-submit - cd /opt/module/spark
- bin/spark-submit --class org.apache.spark.examples.SparkPi \
- --master yarn \
- --deploy-mode cluster \
- --driver-memory 1g \
- --executor-memory 1g \
- --executor-cores 1 \
- --queue thequeue \
- examples/jars/spark-examples*.jar \
- 10
 - 6.3. 端口
- 进程查看 ```bash - netstat安装,已安装则直接跳过- sudo yum install net-tools 
bigdata-hk-node1
sudo netstat -anop |grep -E ‘9870|8081|18080|7077|7078’
bigdata-hk-node1(开启spark-shell、spark-sql、spark-submit(client模式)后,
进程SparkSubmit启动,关闭后端口释放)
sudo netstat -anop |grep 4040
bigdata-hk-node2
sudo netstat -anop |grep 8088 ```
- 浏览器查看 | 组件 | URL | | —- | —- | | Hadoop Web UI | http://bigdata-hk-node1:9870 | | Yarn Web UI | http://bigdata-hk-node2:8088 | | Spark Web UI | http://bigdata-hk-node1:8081 | | Application Web UI | http://bigdata-hk-node1:4040 —> http://bigdata-hk-node2:8088/proxy/redirect/application_* | | Spark History Server | http://bigdata-hk-node1:18080 |
说明:4040端口显示的是正在运行的spark任务,一旦任务运行完成或者没有任务运行,4040端口是无法访问的。
 
                         
                                

