资源规划
| 组件 | bigdata-node1 | bigdata-node2 | bigdata-node3 |
|---|---|---|---|
| OS | centos7.6 | centos7.6 | centos7.6 |
| JDK | jvm | jvm | jvm |
| Scala | scala | scala | scala |
| HDFS | NameNode/SecondaryNameNode/DataNode/JobHistoryServer/ApplicationHistoryServer | DataNode | DataNode |
| YARN | ResourceManager | NodeManager | NodeManager |
| Hive | HiveServer2/Metastore/CLI/Beeline | CLI/Beeline | CLI/Beeline |
| MySQL | N.A | N.A | MySQL Server |
| Maven | mvn | N.A | N.A |
| Spark | master/HistoryServer | worker | worker |
安装介质
版本:spark-2.0.0.tgz
下载:http://spark.apache.org/downloads.html
环境准备
安装JDK
安装Scala
安装Hadoop
安装Hive
安装Maven
Maven的版本建议根据pom.xml文件中requireMavenVersion的要求进行选择安装。参考:《CentOS7.6-安装Maven-3.3.9》
编译Spark
(1).解压
cd /sharetar -zxvf spark-2.0.0.tgz -C ~/modules/
(2).配置部署包环境
make-distribution.sh已经带有Maven编译过程,可以生成部署包,位于根目录下,文件名类似于spark--bin-.tgz。
export SPARK_HOME=/home/vagrant/modules/spark-2.0.0cd $SPARK_HOME/dev./change-scala-version.sh 2.11vi make-distribution.sh
配置如下:
# 替换以下几个参数值MVN="$MAVEN_HOME/bin/mvn"# 删除之前的配置,用下面的值进行替换VERSION=2.0.0SCALA_VERSION=2.11.8SPARK_HADOOP_VERSION=2.7.2SPARK_HIVE=1 #1表示打包hive,非1值为不打包hive
(3).编译部署包
cd $SPARK_HOME/devexport MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"# 包含Hive用于SparkSQL./make-distribution.sh --name hadoop-2.7.2 --tgz -Phadoop-2.7 -Pyarn -Pscala-2.11 -Dhadoop.version=2.7.2 -Phive -Phive-thriftserver -DskipTests
编译时长大概1h,编译成功后,会在源码根目录下多出一个文件(spark-*-bin-${MAKE_DIS_NAME}.tgz)。
Spark On Hive
(1).解压
cd ~/modules/spark-2.0.0tar -zxvf spark-2.0.0-bin-hadoop-2.7.2.tgz -C ~/modules/# 创建软连接(删除使用rm -rf)ln -s ~/modules/spark-2.0.0-bin-hadoop-2.7.2 ~/modules/spark
(2).配置Spark
配置spark-env.sh。
vi ~/modules/spark/conf/spark-env.sh
配置如下:
export JAVA_HOME=/home/vagrant/modules/jdk1.8.0_221export SCALA_HOME=/home/vagrant/modules/scala-2.11.8export HADOOP_HOME=/home/vagrant/modules/hadoop-2.7.2export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoopexport YARN_CONF_DIR=$HADOOP_HOME/etc/hadoopexport SPARK_LAUNCH_WITH_SCALA=0export SPARK_WORKER_MEMORY=1gexport SPARK_DRIVER_MEMORY=1gexport SPARK_MASTER_IP=192.168.0.101export SPARK_LIBRARY_PATH=/home/vagrant/modules/spark/libexport SPARK_MASTER_WEBUI_PORT=8081export SPARK_WORKER_DIR=/home/vagrant/modules/spark/workexport SPARK_MASTER_PORT=7077export SPARK_WORKER_PORT=7078export SPARK_LOG_DIR=/home/vagrant/modules/spark/logexport SPARK_PID_DIR=/home/vagrant/modules/spark/runexport SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080 -Dspark.history.retainedApplications=3 -Dspark.history.fs.logDirectory=hdfs://bigdata-node1:9000/spark-log"
配置slaves。
vi ~/modules/spark/conf/slaves
配置如下:
bigdata-node1bigdata-node2bigdata-node3
配置spark-defaults.conf。
vi ~/modules/spark/conf/spark-defaults.conf
配置如下:
spark.master spark://bigdata-node1:7077spark.home /home/vagrant/modules/spark2spark.eventLog.enabled truespark.eventLog.dir hdfs://bigdata-node1:9000/spark-logspark.eventLog.compress truespark.serializer org.apache.spark.serializer.KryoSerializerspark.executor.memory 1gspark.driver.memory 1gspark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
配置说明详见附录:《附录:A-**spark-defaults.conf参数说明**》
(3)添加Hive集成jar包
添加mysql-connector-java-5.1.40-bin.jar至${SPARK_HOME}/jars。
(4).分发
scp -r ~/modules/spark-2.0.0-bin-hadoop-2.7.2 vagrant@bigdata-node2:~/modules/scp -r ~/modules/spark-2.0.0-bin-hadoop-2.7.2 vagrant@bigdata-node3:~/modules/
(5).环境变量设置
注意:此处只需要修改worker节点的配置(bigdata-node2、**bigdata-node3**)。
vi ~/.bashrc # :$到达行尾添加
配置如下:
export SPARK_HOME=/home/vagrant/modules/sparkexport PATH=$SPARK_HOME/bin:$PATH
环境变量生效:
source ~/.bashrc
(6)添加Hive集成配置
# 由于Hive在各个节点的配置不一致(服务端/客户端),建议下面脚本单独在各节点执行cp -rf ~/modules/apache-hive-2.3.4-bin/conf/hive-site.xml ~/modules/spark/conf/vi ~/modules/spark/conf/hive-site.xml
配置如下(追加):
<!-- 每次启动spark的时候,会检查hive的版本 --><property><name>hive.metastore.schema.verification</name><value>false</value></property>
(7)配置Yarn
vi ~/modules/hadoop-2.7.2/etc/hadoop/yarn-site.xml
配置如下:
<!--日志聚合到HDFS提供给WEB UI查看 --><property><name>yarn.log-aggregation-enable</name><value>true</value></property><!-- 配置日志服务器的地址,work节点使用 --><property><name>yarn.log.server.url</name><value>http://bigdata-node1:19888/jobhistory/logs/</value></property><!-- 配置日志过期时间,单位秒 --><property><name>yarn.log-aggregation.retain-seconds</name><value>86400</value></property><property><name>yarn.resourcemanager.scheduler.class</name><value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value></property><!--以下为解决spark-shell 以yarn client模式运行报错问题而增加的配置,估计spark-summit也会有这个问题。2个配置只用配置一个即可解决问题,当然都配置也没问题--><!--是否启动一个线程检查每个任务正使用的虚拟内存量,如果任务超出分配值,则直接将其杀掉,默认是true。--><property><name>yarn.nodemanager.vmem-check-enabled</name><value>false</value></property><!--是否启动一个线程检查每个任务正使用的物理内存量,如果任务超出分配值,则直接将其杀掉,默认是true。--><property><name>yarn.nodemanager.pmem-check-enabled</name><value>false</value></property><!--任务每使用1MB物理内存,最多可使用虚拟内存量,默认是2.1--><property><name>yarn.nodemanager.vmem-pmem-ratio</name><value>4</value></property>
分发(Yarn集群需要重启):
scp -r ~/modules/hadoop-2.7.2/etc/hadoop/yarn-site.xml vagrant@bigdata-node2:~/modules/hadoop-2.7.2/etc/hadoop/yarn-site.xmlscp -r ~/modules/hadoop-2.7.2/etc/hadoop/yarn-site.xml vagrant@bigdata-node3:~/modules/hadoop-2.7.2/etc/hadoop/yarn-site.xml
启动Spark
注意:在启动Spark集群之前,要确保Hadoop集群和YARN均已启动。
cd ~/modules/spark# 启动Spark集群sbin/start-all.shsbin/stop-all.sh# 创建Spark日志文件目录(HDFS),第一运行之前~/modules/hadoop-2.7.2/bin/hdfs dfs -mkdir -p /spark-logsbin/start-history-server.shsbin/stop-history-server.sh
验证
端口查看:
sudo netstat -anop |grep 50070sudo netstat -anop |grep 8088sudo netstat -anop |grep 8081sudo netstat -anop |grep 18080sudo netstat -anop |grep 7077sudo netstat -anop |grep 7078
浏览器入口:
| 组件 | URL |
|---|---|
| Hadoop Web UI | http://bigdata-node1:50070 |
| Yarn Web UI | http://bigdata-node1:8088 |
| Spark Web UI | http://bigdata-node1:8081 |
| Spark History Server | http://bigdata-node1:18080 |
Standalone
(1)client模式
- spark-shell
注意:spark-shell只支持client模式(Driver运行在本地),不支持cluster模式。
# 测试数据准备cd ~/modules/hadoop-2.7.2bin/hdfs dfs -mkdir -p /tmp/input## bin/hdfs dfs -rm -r /tmp/outputbin/hdfs dfs -put ~/modules/hadoop-2.7.2/etc/hadoop/core-site.xml /tmp/inputbin/hdfs dfs -text /tmp/input/core-site.xml# spark-shellcd ~/modules/sparkbin/spark-shell --master spark://bigdata-node1:7077scala>> sc.textFile("hdfs://bigdata-node1:9000/tmp/input/core-site.xml").flatMap(_.split(" ")).map(x=>(x,1)).reduceByKey(_+_).map(x=>(x._2,x._1)).sortByKey(false).map(x=>(x._2,x._1)).take(10)
- spark-sql
注意:spark-sql只支持client模式,不支持cluster模式。
# 注意:如果Hive使用客户端模式,请确保Hive Metastore服务器已启动cd ~/modules/sparkbin/spark-sql --master spark://bigdata-node1:7077spark-sql>> use default;spark-sql>> select sum(id) from stu;spark-sql>> select max(id) from stu;spark-sql>> select avg(id) from stu;
spark-submit
cd ~/modules/sparkbin/spark-submit --master spark://bigdata-node1:7077 --deploy-mode client --class org.apache.spark.examples.SparkPi examples/jars/spark-examples_2.11-2.0.0.jar 10
(2)cluster模式
spark-submit
cd ~/modules/sparkbin/spark-submit --master spark://bigdata-node1:7077 --deploy-mode cluster --class org.apache.spark.examples.SparkPi examples/jars/spark-examples_2.11-2.0.0.jar 10
Standalone-cluster模式driver会在集群的随机一台worker节点上启动。如果提交多个application,那么每个application的driver会分散到集群的worker节点,起到一个分担流量的作用。这种模式适用于生产模式。其日志无法从控制台获得,可通过WebUI的方式读取。
Spark on yarn
(1)client模式
Spark-shell
注意:spark-shell只支持client模式(Driver运行在本地),不支持cluster模式。
cd ~/modules/sparkbin/spark-shell --master yarn --deploy-mode clientscala>> sc.textFile("hdfs://bigdata-node1:9000/tmp/input/core-site.xml").flatMap(_.split(" ")).map(x=>(x,1)).reduceByKey(_+_).map(x=>(x._2,x._1)).sortByKey(false).map(x=>(x._2,x._1)).take(10)
- spark-sql
注意:spark-sql只支持client模式,不支持cluster模式。
# 注意:如果Hive使用客户端模式,请确保Hive Metastore服务器已启动cd ~/modules/sparkbin/spark-sql --master yarn --deploy-mode clientspark-sql>> use default;spark-sql>> select sum(id) from stu;spark-sql>> select max(id) from stu;spark-sql>> select avg(id) from stu;
spark-submit
cd ~/modules/sparkbin/spark-submit --class org.apache.spark.examples.SparkPi \--master yarn \--deploy-mode client \--driver-memory 1g \--executor-memory 1g \--executor-cores 1 \--queue thequeue \examples/jars/spark-examples*.jar \10
(2)cluster模式
spark-submit
cd ~/modules/sparkbin/spark-submit --class org.apache.spark.examples.SparkPi \--master yarn \--deploy-mode cluster \--driver-memory 1g \--executor-memory 1g \--executor-cores 1 \--queue thequeue \examples/jars/spark-examples*.jar \10
问题
- Spark编译错误(Maven版本3.2.5):
[INFO] ------------------------------------------------------------------------[INFO] BUILD FAILURE[INFO] ------------------------------------------------------------------------[INFO] Total time: 1.537 s (Wall Clock)[INFO] Finished at: 2020-09-23T15:58:52+08:00[INFO] Final Memory: 36M/1061M[INFO] ------------------------------------------------------------------------[ERROR] Failed to execute goal org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce (enforce-versions) on project spark-parent_2.11: Some Enforcer rules have failed. Look above for specific messages explaining why the rule failed. -> [Help 1][ERROR][ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.[ERROR] Re-run Maven using the -X switch to enable full debug logging.[ERROR][ERROR] For more information about the errors and possible solutions, please read the following articles:[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
- 在源码根目录中将pom.xml中maven版本改为3.2.5。
<maven.version>3.2.5</maven.version>
Spark编译错误(Maven版本3.2.5,该问题未解决,建议使用3.3.9或pom文件中显示版本编译)
[INFO] ------------------------------------------------------------------------[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on project spark-hive-thriftserver_2.11: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed -> [Help 1][ERROR][ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.[ERROR] Re-run Maven using the -X switch to enable full debug logging.[ERROR][ERROR] For more information about the errors and possible solutions, please read the following articles:[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException[ERROR][ERROR] After correcting the problems, you can resume the build with the command[ERROR] mvn <goals> -rf :spark-hive-thriftserver_2.11
Spark编译错误:“[INFO] Spark Project Hive Thrift Server ………………. FAILURE [ * s]”。
解决方案(修改pom.xml):
新增“
” <parent><groupId>org.apache</groupId><artifactId>apache</artifactId><version>14</version><relativePath></relativePath></parent>
新增“
”(如果没有需新增,小版本号也需要一致) <profile><id>hadoop-2.7</id><properties><hadoop.version>2.7.2</hadoop.version></properties></profile>
修改“
” <useZincServer>false</useZincServer>
参考
【Spark Standalone和Yarn工作模式】https://blog.csdn.net/qq_39327985/article/details/86513171
【Spark on Hive和Hive on Spark区别】https://www.jianshu.com/p/7236bcbcc657
【在Yarn上运行spark-shell和spark-sql命令行】https://www.cnblogs.com/ggzone/p/5094469.html
【hive on spark与sparkSQl共存】https://blog.csdn.net/fzuzhanghao1993/article/details/90203357
【Spark-2.2.0源码编译报错】https://blog.csdn.net/jiaotangX/article/details/78635133附录
附录:A-spark-defaults.conf参数说明
| 配置项 | 描述 | | —- | —- | | spark.master | 指定Spark运行模式,可以是yarn-client、yarn-cluster、spark://master:7077…. | | spark.home | 指定SPARK_HOME路径 | | spark.eventLog.enabled | 是否记录Spark事件,用于应用程序在完成后重构webUI | | spark.eventLog.dir | 设置spark.eventLog.enabled为true后,该属性为记录spark时间的根目录。在此根目录中,Spark为每个应用程序创建分目录,并将应用程序的时间记录到此目录中。用户可以将此属性设置为HDFS目录,以便History Server读取 | | spark.eventLog.compress | 否压缩记录Spark事件,前提spark.eventLog.enabled为true,默认使用的是snappy | | spark.serializer | 序列化 | | spark.executor.memory | 指定executor的内存 | | spark.driver.memory | 指定dirver的内存 | | spark.executor.extraJavaOptions | Spark GC配置 |
