资源规划
组件 | bigdata-node1 | bigdata-node2 | bigdata-node3 |
---|---|---|---|
OS | centos7.6 | centos7.6 | centos7.6 |
JDK | jvm | jvm | jvm |
Scala | scala | scala | scala |
HDFS | NameNode/SecondaryNameNode/DataNode/JobHistoryServer/ApplicationHistoryServer | DataNode | DataNode |
YARN | ResourceManager | NodeManager | NodeManager |
Hive | HiveServer2/Metastore/CLI/Beeline | CLI/Beeline | CLI/Beeline |
MySQL | N.A | N.A | MySQL Server |
Maven | mvn | N.A | N.A |
Spark | master/HistoryServer | worker | worker |
安装介质
版本:spark-2.0.0.tgz
下载:http://spark.apache.org/downloads.html
环境准备
安装JDK
安装Scala
安装Hadoop
安装Hive
安装Maven
Maven的版本建议根据pom.xml文件中requireMavenVersion的要求进行选择安装。参考:《CentOS7.6-安装Maven-3.3.9》
编译Spark
(1).解压
cd /share
tar -zxvf spark-2.0.0.tgz -C ~/modules/
(2).配置部署包环境
make-distribution.sh已经带有Maven编译过程,可以生成部署包,位于根目录下,文件名类似于spark--bin-.tgz。
export SPARK_HOME=/home/vagrant/modules/spark-2.0.0
cd $SPARK_HOME/dev
./change-scala-version.sh 2.11
vi make-distribution.sh
配置如下:
# 替换以下几个参数值
MVN="$MAVEN_HOME/bin/mvn"
# 删除之前的配置,用下面的值进行替换
VERSION=2.0.0
SCALA_VERSION=2.11.8
SPARK_HADOOP_VERSION=2.7.2
SPARK_HIVE=1 #1表示打包hive,非1值为不打包hive
(3).编译部署包
cd $SPARK_HOME/dev
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
# 包含Hive用于SparkSQL
./make-distribution.sh --name hadoop-2.7.2 --tgz -Phadoop-2.7 -Pyarn -Pscala-2.11 -Dhadoop.version=2.7.2 -Phive -Phive-thriftserver -DskipTests
编译时长大概1h,编译成功后,会在源码根目录下多出一个文件(spark-*-bin-${MAKE_DIS_NAME}.tgz)。
Spark On Hive
(1).解压
cd ~/modules/spark-2.0.0
tar -zxvf spark-2.0.0-bin-hadoop-2.7.2.tgz -C ~/modules/
# 创建软连接(删除使用rm -rf)
ln -s ~/modules/spark-2.0.0-bin-hadoop-2.7.2 ~/modules/spark
(2).配置Spark
配置spark-env.sh。
vi ~/modules/spark/conf/spark-env.sh
配置如下:
export JAVA_HOME=/home/vagrant/modules/jdk1.8.0_221
export SCALA_HOME=/home/vagrant/modules/scala-2.11.8
export HADOOP_HOME=/home/vagrant/modules/hadoop-2.7.2
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_LAUNCH_WITH_SCALA=0
export SPARK_WORKER_MEMORY=1g
export SPARK_DRIVER_MEMORY=1g
export SPARK_MASTER_IP=192.168.0.101
export SPARK_LIBRARY_PATH=/home/vagrant/modules/spark/lib
export SPARK_MASTER_WEBUI_PORT=8081
export SPARK_WORKER_DIR=/home/vagrant/modules/spark/work
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_PORT=7078
export SPARK_LOG_DIR=/home/vagrant/modules/spark/log
export SPARK_PID_DIR=/home/vagrant/modules/spark/run
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080 -Dspark.history.retainedApplications=3 -Dspark.history.fs.logDirectory=hdfs://bigdata-node1:9000/spark-log"
配置slaves。
vi ~/modules/spark/conf/slaves
配置如下:
bigdata-node1
bigdata-node2
bigdata-node3
配置spark-defaults.conf。
vi ~/modules/spark/conf/spark-defaults.conf
配置如下:
spark.master spark://bigdata-node1:7077
spark.home /home/vagrant/modules/spark2
spark.eventLog.enabled true
spark.eventLog.dir hdfs://bigdata-node1:9000/spark-log
spark.eventLog.compress true
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.executor.memory 1g
spark.driver.memory 1g
spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
配置说明详见附录:《附录:A-**spark-defaults.conf参数说明**》
(3)添加Hive集成jar包
添加mysql-connector-java-5.1.40-bin.jar至${SPARK_HOME}/jars。
(4).分发
scp -r ~/modules/spark-2.0.0-bin-hadoop-2.7.2 vagrant@bigdata-node2:~/modules/
scp -r ~/modules/spark-2.0.0-bin-hadoop-2.7.2 vagrant@bigdata-node3:~/modules/
(5).环境变量设置
注意:此处只需要修改worker节点的配置(bigdata-node2、**bigdata-node3**)。
vi ~/.bashrc # :$到达行尾添加
配置如下:
export SPARK_HOME=/home/vagrant/modules/spark
export PATH=$SPARK_HOME/bin:$PATH
环境变量生效:
source ~/.bashrc
(6)添加Hive集成配置
# 由于Hive在各个节点的配置不一致(服务端/客户端),建议下面脚本单独在各节点执行
cp -rf ~/modules/apache-hive-2.3.4-bin/conf/hive-site.xml ~/modules/spark/conf/
vi ~/modules/spark/conf/hive-site.xml
配置如下(追加):
<!-- 每次启动spark的时候,会检查hive的版本 -->
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
(7)配置Yarn
vi ~/modules/hadoop-2.7.2/etc/hadoop/yarn-site.xml
配置如下:
<!--日志聚合到HDFS提供给WEB UI查看 -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<!-- 配置日志服务器的地址,work节点使用 -->
<property>
<name>yarn.log.server.url</name>
<value>http://bigdata-node1:19888/jobhistory/logs/</value>
</property>
<!-- 配置日志过期时间,单位秒 -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>86400</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
</property>
<!--以下为解决spark-shell 以yarn client模式运行报错问题而增加的配置,估计spark-summit也会有这个问题。2个配置只用配置一个即可解决问题,当然都配置也没问题-->
<!--是否启动一个线程检查每个任务正使用的虚拟内存量,如果任务超出分配值,则直接将其杀掉,默认是true。-->
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<!--是否启动一个线程检查每个任务正使用的物理内存量,如果任务超出分配值,则直接将其杀掉,默认是true。-->
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<!--任务每使用1MB物理内存,最多可使用虚拟内存量,默认是2.1-->
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>4</value>
</property>
分发(Yarn集群需要重启):
scp -r ~/modules/hadoop-2.7.2/etc/hadoop/yarn-site.xml vagrant@bigdata-node2:~/modules/hadoop-2.7.2/etc/hadoop/yarn-site.xml
scp -r ~/modules/hadoop-2.7.2/etc/hadoop/yarn-site.xml vagrant@bigdata-node3:~/modules/hadoop-2.7.2/etc/hadoop/yarn-site.xml
启动Spark
注意:在启动Spark集群之前,要确保Hadoop集群和YARN均已启动。
cd ~/modules/spark
# 启动Spark集群
sbin/start-all.sh
sbin/stop-all.sh
# 创建Spark日志文件目录(HDFS),第一运行之前
~/modules/hadoop-2.7.2/bin/hdfs dfs -mkdir -p /spark-log
sbin/start-history-server.sh
sbin/stop-history-server.sh
验证
端口查看:
sudo netstat -anop |grep 50070
sudo netstat -anop |grep 8088
sudo netstat -anop |grep 8081
sudo netstat -anop |grep 18080
sudo netstat -anop |grep 7077
sudo netstat -anop |grep 7078
浏览器入口:
组件 | URL |
---|---|
Hadoop Web UI | http://bigdata-node1:50070 |
Yarn Web UI | http://bigdata-node1:8088 |
Spark Web UI | http://bigdata-node1:8081 |
Spark History Server | http://bigdata-node1:18080 |
Standalone
(1)client模式
- spark-shell
注意:spark-shell只支持client模式(Driver运行在本地),不支持cluster模式。
# 测试数据准备
cd ~/modules/hadoop-2.7.2
bin/hdfs dfs -mkdir -p /tmp/input
## bin/hdfs dfs -rm -r /tmp/output
bin/hdfs dfs -put ~/modules/hadoop-2.7.2/etc/hadoop/core-site.xml /tmp/input
bin/hdfs dfs -text /tmp/input/core-site.xml
# spark-shell
cd ~/modules/spark
bin/spark-shell --master spark://bigdata-node1:7077
scala>> sc.textFile("hdfs://bigdata-node1:9000/tmp/input/core-site.xml").flatMap(_.split(" ")).map(x=>(x,1)).reduceByKey(_+_).map(x=>(x._2,x._1)).sortByKey(false).map(x=>(x._2,x._1)).take(10)
- spark-sql
注意:spark-sql只支持client模式,不支持cluster模式。
# 注意:如果Hive使用客户端模式,请确保Hive Metastore服务器已启动
cd ~/modules/spark
bin/spark-sql --master spark://bigdata-node1:7077
spark-sql>> use default;
spark-sql>> select sum(id) from stu;
spark-sql>> select max(id) from stu;
spark-sql>> select avg(id) from stu;
spark-submit
cd ~/modules/spark
bin/spark-submit --master spark://bigdata-node1:7077 --deploy-mode client --class org.apache.spark.examples.SparkPi examples/jars/spark-examples_2.11-2.0.0.jar 10
(2)cluster模式
spark-submit
cd ~/modules/spark
bin/spark-submit --master spark://bigdata-node1:7077 --deploy-mode cluster --class org.apache.spark.examples.SparkPi examples/jars/spark-examples_2.11-2.0.0.jar 10
Standalone-cluster模式driver会在集群的随机一台worker节点上启动。如果提交多个application,那么每个application的driver会分散到集群的worker节点,起到一个分担流量的作用。这种模式适用于生产模式。其日志无法从控制台获得,可通过WebUI的方式读取。
Spark on yarn
(1)client模式
Spark-shell
注意:spark-shell只支持client模式(Driver运行在本地),不支持cluster模式。
cd ~/modules/spark
bin/spark-shell --master yarn --deploy-mode client
scala>> sc.textFile("hdfs://bigdata-node1:9000/tmp/input/core-site.xml").flatMap(_.split(" ")).map(x=>(x,1)).reduceByKey(_+_).map(x=>(x._2,x._1)).sortByKey(false).map(x=>(x._2,x._1)).take(10)
- spark-sql
注意:spark-sql只支持client模式,不支持cluster模式。
# 注意:如果Hive使用客户端模式,请确保Hive Metastore服务器已启动
cd ~/modules/spark
bin/spark-sql --master yarn --deploy-mode client
spark-sql>> use default;
spark-sql>> select sum(id) from stu;
spark-sql>> select max(id) from stu;
spark-sql>> select avg(id) from stu;
spark-submit
cd ~/modules/spark
bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode client \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1 \
--queue thequeue \
examples/jars/spark-examples*.jar \
10
(2)cluster模式
spark-submit
cd ~/modules/spark
bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1 \
--queue thequeue \
examples/jars/spark-examples*.jar \
10
问题
- Spark编译错误(Maven版本3.2.5):
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1.537 s (Wall Clock)
[INFO] Finished at: 2020-09-23T15:58:52+08:00
[INFO] Final Memory: 36M/1061M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce (enforce-versions) on project spark-parent_2.11: Some Enforcer rules have failed. Look above for specific messages explaining why the rule failed. -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
- 在源码根目录中将pom.xml中maven版本改为3.2.5。
<maven.version>3.2.5</maven.version>
Spark编译错误(Maven版本3.2.5,该问题未解决,建议使用3.3.9或pom文件中显示版本编译)
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on project spark-hive-thriftserver_2.11: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn <goals> -rf :spark-hive-thriftserver_2.11
Spark编译错误:“[INFO] Spark Project Hive Thrift Server ………………. FAILURE [ * s]”。
解决方案(修改pom.xml):
新增“
” <parent>
<groupId>org.apache</groupId>
<artifactId>apache</artifactId>
<version>14</version>
<relativePath></relativePath>
</parent>
新增“
”(如果没有需新增,小版本号也需要一致) <profile>
<id>hadoop-2.7</id>
<properties>
<hadoop.version>2.7.2</hadoop.version>
</properties>
</profile>
修改“
” <useZincServer>false</useZincServer>
参考
【Spark Standalone和Yarn工作模式】https://blog.csdn.net/qq_39327985/article/details/86513171
【Spark on Hive和Hive on Spark区别】https://www.jianshu.com/p/7236bcbcc657
【在Yarn上运行spark-shell和spark-sql命令行】https://www.cnblogs.com/ggzone/p/5094469.html
【hive on spark与sparkSQl共存】https://blog.csdn.net/fzuzhanghao1993/article/details/90203357
【Spark-2.2.0源码编译报错】https://blog.csdn.net/jiaotangX/article/details/78635133附录
附录:A-spark-defaults.conf参数说明
| 配置项 | 描述 | | —- | —- | | spark.master | 指定Spark运行模式,可以是yarn-client、yarn-cluster、spark://master:7077…. | | spark.home | 指定SPARK_HOME路径 | | spark.eventLog.enabled | 是否记录Spark事件,用于应用程序在完成后重构webUI | | spark.eventLog.dir | 设置spark.eventLog.enabled为true后,该属性为记录spark时间的根目录。在此根目录中,Spark为每个应用程序创建分目录,并将应用程序的时间记录到此目录中。用户可以将此属性设置为HDFS目录,以便History Server读取 | | spark.eventLog.compress | 否压缩记录Spark事件,前提spark.eventLog.enabled为true,默认使用的是snappy | | spark.serializer | 序列化 | | spark.executor.memory | 指定executor的内存 | | spark.driver.memory | 指定dirver的内存 | | spark.executor.extraJavaOptions | Spark GC配置 |