资源规划

组件 bigdata-node1 bigdata-node2 bigdata-node3
OS centos7.6 centos7.6 centos7.6
JDK jvm jvm jvm
Scala scala scala scala
HDFS NameNode/SecondaryNameNode/DataNode/JobHistoryServer/ApplicationHistoryServer DataNode DataNode
YARN ResourceManager NodeManager NodeManager
Hive HiveServer2/Metastore/CLI/Beeline CLI/Beeline CLI/Beeline
MySQL N.A N.A MySQL Server
Maven mvn N.A N.A
Spark master/HistoryServer worker worker

安装介质

版本:spark-2.0.0.tgz
下载:http://spark.apache.org/downloads.html

环境准备

安装JDK

参考:《CentOS7.6-安装JDK-1.8.221

安装Scala

参考:《CentOS7.6-安装Scala-2.11.8

安装Hadoop

参考:《CentOS7.6-安装Hadoop-2.7.2

安装Hive

参考:《CentOS7.6-安装Hive-2.3.4

安装Maven

Maven的版本建议根据pom.xml文件中requireMavenVersion的要求进行选择安装。参考:《CentOS7.6-安装Maven-3.3.9

编译Spark

(1).解压

  1. cd /share
  2. tar -zxvf spark-2.0.0.tgz -C ~/modules/

(2).配置部署包环境

make-distribution.sh已经带有Maven编译过程,可以生成部署包,位于根目录下,文件名类似于spark--bin-.tgz。

  1. export SPARK_HOME=/home/vagrant/modules/spark-2.0.0
  2. cd $SPARK_HOME/dev
  3. ./change-scala-version.sh 2.11
  4. vi make-distribution.sh

配置如下:

  1. # 替换以下几个参数值
  2. MVN="$MAVEN_HOME/bin/mvn"
  3. # 删除之前的配置,用下面的值进行替换
  4. VERSION=2.0.0
  5. SCALA_VERSION=2.11.8
  6. SPARK_HADOOP_VERSION=2.7.2
  7. SPARK_HIVE=1 #1表示打包hive,非1值为不打包hive

(3).编译部署包

  1. cd $SPARK_HOME/dev
  2. export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
  3. # 包含Hive用于SparkSQL
  4. ./make-distribution.sh --name hadoop-2.7.2 --tgz -Phadoop-2.7 -Pyarn -Pscala-2.11 -Dhadoop.version=2.7.2 -Phive -Phive-thriftserver -DskipTests

编译时长大概1h,编译成功后,会在源码根目录下多出一个文件(spark-*-bin-${MAKE_DIS_NAME}.tgz)。

Spark On Hive

(1).解压

  1. cd ~/modules/spark-2.0.0
  2. tar -zxvf spark-2.0.0-bin-hadoop-2.7.2.tgz -C ~/modules/
  3. # 创建软连接(删除使用rm -rf)
  4. ln -s ~/modules/spark-2.0.0-bin-hadoop-2.7.2 ~/modules/spark

(2).配置Spark

  1. 配置spark-env.sh。

    1. vi ~/modules/spark/conf/spark-env.sh

    配置如下:

    1. export JAVA_HOME=/home/vagrant/modules/jdk1.8.0_221
    2. export SCALA_HOME=/home/vagrant/modules/scala-2.11.8
    3. export HADOOP_HOME=/home/vagrant/modules/hadoop-2.7.2
    4. export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
    5. export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
    6. export SPARK_LAUNCH_WITH_SCALA=0
    7. export SPARK_WORKER_MEMORY=1g
    8. export SPARK_DRIVER_MEMORY=1g
    9. export SPARK_MASTER_IP=192.168.0.101
    10. export SPARK_LIBRARY_PATH=/home/vagrant/modules/spark/lib
    11. export SPARK_MASTER_WEBUI_PORT=8081
    12. export SPARK_WORKER_DIR=/home/vagrant/modules/spark/work
    13. export SPARK_MASTER_PORT=7077
    14. export SPARK_WORKER_PORT=7078
    15. export SPARK_LOG_DIR=/home/vagrant/modules/spark/log
    16. export SPARK_PID_DIR=/home/vagrant/modules/spark/run
    17. export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080 -Dspark.history.retainedApplications=3 -Dspark.history.fs.logDirectory=hdfs://bigdata-node1:9000/spark-log"
  2. 配置slaves。

    1. vi ~/modules/spark/conf/slaves

    配置如下:

    1. bigdata-node1
    2. bigdata-node2
    3. bigdata-node3
  3. 配置spark-defaults.conf。

    1. vi ~/modules/spark/conf/spark-defaults.conf

    配置如下:

    1. spark.master spark://bigdata-node1:7077
    2. spark.home /home/vagrant/modules/spark2
    3. spark.eventLog.enabled true
    4. spark.eventLog.dir hdfs://bigdata-node1:9000/spark-log
    5. spark.eventLog.compress true
    6. spark.serializer org.apache.spark.serializer.KryoSerializer
    7. spark.executor.memory 1g
    8. spark.driver.memory 1g
    9. spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"

    配置说明详见附录:《附录:A-**spark-defaults.conf参数说明**

    (3)添加Hive集成jar包

    添加mysql-connector-java-5.1.40-bin.jar至${SPARK_HOME}/jars。

    (4).分发

    1. scp -r ~/modules/spark-2.0.0-bin-hadoop-2.7.2 vagrant@bigdata-node2:~/modules/
    2. scp -r ~/modules/spark-2.0.0-bin-hadoop-2.7.2 vagrant@bigdata-node3:~/modules/

    (5).环境变量设置

    注意:此处只需要修改worker节点的配置(bigdata-node2、**bigdata-node3**)

    1. vi ~/.bashrc # :$到达行尾添加

    配置如下:

    1. export SPARK_HOME=/home/vagrant/modules/spark
    2. export PATH=$SPARK_HOME/bin:$PATH

    环境变量生效:

    1. source ~/.bashrc

    (6)添加Hive集成配置

    1. # 由于Hive在各个节点的配置不一致(服务端/客户端),建议下面脚本单独在各节点执行
    2. cp -rf ~/modules/apache-hive-2.3.4-bin/conf/hive-site.xml ~/modules/spark/conf/
    3. vi ~/modules/spark/conf/hive-site.xml

    配置如下(追加):

    1. <!-- 每次启动spark的时候,会检查hive的版本 -->
    2. <property>
    3. <name>hive.metastore.schema.verification</name>
    4. <value>false</value>
    5. </property>

    (7)配置Yarn

    1. vi ~/modules/hadoop-2.7.2/etc/hadoop/yarn-site.xml

    配置如下:

    1. <!--日志聚合到HDFS提供给WEB UI查看 -->
    2. <property>
    3. <name>yarn.log-aggregation-enable</name>
    4. <value>true</value>
    5. </property>
    6. <!-- 配置日志服务器的地址,work节点使用 -->
    7. <property>
    8. <name>yarn.log.server.url</name>
    9. <value>http://bigdata-node1:19888/jobhistory/logs/</value>
    10. </property>
    11. <!-- 配置日志过期时间,单位秒 -->
    12. <property>
    13. <name>yarn.log-aggregation.retain-seconds</name>
    14. <value>86400</value>
    15. </property>
    16. <property>
    17. <name>yarn.resourcemanager.scheduler.class</name>
    18. <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
    19. </property>
    20. <!--以下为解决spark-shell yarn client模式运行报错问题而增加的配置,估计spark-summit也会有这个问题。2个配置只用配置一个即可解决问题,当然都配置也没问题-->
    21. <!--是否启动一个线程检查每个任务正使用的虚拟内存量,如果任务超出分配值,则直接将其杀掉,默认是true。-->
    22. <property>
    23. <name>yarn.nodemanager.vmem-check-enabled</name>
    24. <value>false</value>
    25. </property>
    26. <!--是否启动一个线程检查每个任务正使用的物理内存量,如果任务超出分配值,则直接将其杀掉,默认是true。-->
    27. <property>
    28. <name>yarn.nodemanager.pmem-check-enabled</name>
    29. <value>false</value>
    30. </property>
    31. <!--任务每使用1MB物理内存,最多可使用虚拟内存量,默认是2.1-->
    32. <property>
    33. <name>yarn.nodemanager.vmem-pmem-ratio</name>
    34. <value>4</value>
    35. </property>

    分发(Yarn集群需要重启):

    1. scp -r ~/modules/hadoop-2.7.2/etc/hadoop/yarn-site.xml vagrant@bigdata-node2:~/modules/hadoop-2.7.2/etc/hadoop/yarn-site.xml
    2. scp -r ~/modules/hadoop-2.7.2/etc/hadoop/yarn-site.xml vagrant@bigdata-node3:~/modules/hadoop-2.7.2/etc/hadoop/yarn-site.xml

    启动Spark

    注意:在启动Spark集群之前,要确保Hadoop集群和YARN均已启动。

    1. cd ~/modules/spark
    2. # 启动Spark集群
    3. sbin/start-all.sh
    4. sbin/stop-all.sh
    5. # 创建Spark日志文件目录(HDFS),第一运行之前
    6. ~/modules/hadoop-2.7.2/bin/hdfs dfs -mkdir -p /spark-log
    7. sbin/start-history-server.sh
    8. sbin/stop-history-server.sh

    验证

    端口查看:

    1. sudo netstat -anop |grep 50070
    2. sudo netstat -anop |grep 8088
    3. sudo netstat -anop |grep 8081
    4. sudo netstat -anop |grep 18080
    5. sudo netstat -anop |grep 7077
    6. sudo netstat -anop |grep 7078

    浏览器入口:

组件 URL
Hadoop Web UI http://bigdata-node1:50070
Yarn Web UI http://bigdata-node1:8088
Spark Web UI http://bigdata-node1:8081
Spark History Server http://bigdata-node1:18080

Standalone

(1)client模式

  • spark-shell

注意:spark-shell只支持client模式(Driver运行在本地),不支持cluster模式。

  1. # 测试数据准备
  2. cd ~/modules/hadoop-2.7.2
  3. bin/hdfs dfs -mkdir -p /tmp/input
  4. ## bin/hdfs dfs -rm -r /tmp/output
  5. bin/hdfs dfs -put ~/modules/hadoop-2.7.2/etc/hadoop/core-site.xml /tmp/input
  6. bin/hdfs dfs -text /tmp/input/core-site.xml
  7. # spark-shell
  8. cd ~/modules/spark
  9. bin/spark-shell --master spark://bigdata-node1:7077
  10. scala>> sc.textFile("hdfs://bigdata-node1:9000/tmp/input/core-site.xml").flatMap(_.split(" ")).map(x=>(x,1)).reduceByKey(_+_).map(x=>(x._2,x._1)).sortByKey(false).map(x=>(x._2,x._1)).take(10)
  • spark-sql

注意:spark-sql只支持client模式,不支持cluster模式。

  1. # 注意:如果Hive使用客户端模式,请确保Hive Metastore服务器已启动
  2. cd ~/modules/spark
  3. bin/spark-sql --master spark://bigdata-node1:7077
  4. spark-sql>> use default;
  5. spark-sql>> select sum(id) from stu;
  6. spark-sql>> select max(id) from stu;
  7. spark-sql>> select avg(id) from stu;
  • spark-submit

    1. cd ~/modules/spark
    2. bin/spark-submit --master spark://bigdata-node1:7077 --deploy-mode client --class org.apache.spark.examples.SparkPi examples/jars/spark-examples_2.11-2.0.0.jar 10

    (2)cluster模式

  • spark-submit

    1. cd ~/modules/spark
    2. bin/spark-submit --master spark://bigdata-node1:7077 --deploy-mode cluster --class org.apache.spark.examples.SparkPi examples/jars/spark-examples_2.11-2.0.0.jar 10

    Standalone-cluster模式driver会在集群的随机一台worker节点上启动。如果提交多个application,那么每个application的driver会分散到集群的worker节点,起到一个分担流量的作用。这种模式适用于生产模式。其日志无法从控制台获得,可通过WebUI的方式读取。

    Spark on yarn

    (1)client模式

  • Spark-shell

注意:spark-shell只支持client模式(Driver运行在本地),不支持cluster模式。

  1. cd ~/modules/spark
  2. bin/spark-shell --master yarn --deploy-mode client
  3. scala>> sc.textFile("hdfs://bigdata-node1:9000/tmp/input/core-site.xml").flatMap(_.split(" ")).map(x=>(x,1)).reduceByKey(_+_).map(x=>(x._2,x._1)).sortByKey(false).map(x=>(x._2,x._1)).take(10)
  • spark-sql

注意:spark-sql只支持client模式,不支持cluster模式。

  1. # 注意:如果Hive使用客户端模式,请确保Hive Metastore服务器已启动
  2. cd ~/modules/spark
  3. bin/spark-sql --master yarn --deploy-mode client
  4. spark-sql>> use default;
  5. spark-sql>> select sum(id) from stu;
  6. spark-sql>> select max(id) from stu;
  7. spark-sql>> select avg(id) from stu;
  • spark-submit

    1. cd ~/modules/spark
    2. bin/spark-submit --class org.apache.spark.examples.SparkPi \
    3. --master yarn \
    4. --deploy-mode client \
    5. --driver-memory 1g \
    6. --executor-memory 1g \
    7. --executor-cores 1 \
    8. --queue thequeue \
    9. examples/jars/spark-examples*.jar \
    10. 10

    (2)cluster模式

  • spark-submit

    1. cd ~/modules/spark
    2. bin/spark-submit --class org.apache.spark.examples.SparkPi \
    3. --master yarn \
    4. --deploy-mode cluster \
    5. --driver-memory 1g \
    6. --executor-memory 1g \
    7. --executor-cores 1 \
    8. --queue thequeue \
    9. examples/jars/spark-examples*.jar \
    10. 10

    问题

  1. Spark编译错误(Maven版本3.2.5):
    1. [INFO] ------------------------------------------------------------------------
    2. [INFO] BUILD FAILURE
    3. [INFO] ------------------------------------------------------------------------
    4. [INFO] Total time: 1.537 s (Wall Clock)
    5. [INFO] Finished at: 2020-09-23T15:58:52+08:00
    6. [INFO] Final Memory: 36M/1061M
    7. [INFO] ------------------------------------------------------------------------
    8. [ERROR] Failed to execute goal org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce (enforce-versions) on project spark-parent_2.11: Some Enforcer rules have failed. Look above for specific messages explaining why the rule failed. -> [Help 1]
    9. [ERROR]
    10. [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
    11. [ERROR] Re-run Maven using the -X switch to enable full debug logging.
    12. [ERROR]
    13. [ERROR] For more information about the errors and possible solutions, please read the following articles:
    14. [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
  • 在源码根目录中将pom.xml中maven版本改为3.2.5。
    1. <maven.version>3.2.5</maven.version>
  1. Spark编译错误(Maven版本3.2.5,该问题未解决,建议使用3.3.9或pom文件中显示版本编译)

    1. [INFO] ------------------------------------------------------------------------
    2. [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on project spark-hive-thriftserver_2.11: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed -> [Help 1]
    3. [ERROR]
    4. [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
    5. [ERROR] Re-run Maven using the -X switch to enable full debug logging.
    6. [ERROR]
    7. [ERROR] For more information about the errors and possible solutions, please read the following articles:
    8. [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException
    9. [ERROR]
    10. [ERROR] After correcting the problems, you can resume the build with the command
    11. [ERROR] mvn <goals> -rf :spark-hive-thriftserver_2.11
  2. Spark编译错误:“[INFO] Spark Project Hive Thrift Server ………………. FAILURE [ * s]”。

解决方案(修改pom.xml):

  • 新增“

    1. <parent>
    2. <groupId>org.apache</groupId>
    3. <artifactId>apache</artifactId>
    4. <version>14</version>
    5. <relativePath></relativePath>
    6. </parent>
  • 新增“”(如果没有需新增,小版本号也需要一致)

    1. <profile>
    2. <id>hadoop-2.7</id>
    3. <properties>
    4. <hadoop.version>2.7.2</hadoop.version>
    5. </properties>
    6. </profile>
  • 修改“

    1. <useZincServer>false</useZincServer>

    参考

    【Spark Standalone和Yarn工作模式】https://blog.csdn.net/qq_39327985/article/details/86513171
    【Spark on Hive和Hive on Spark区别】https://www.jianshu.com/p/7236bcbcc657
    在Yarn上运行spark-shell和spark-sql命令行https://www.cnblogs.com/ggzone/p/5094469.html
    【hive on spark与sparkSQl共存】https://blog.csdn.net/fzuzhanghao1993/article/details/90203357
    【Spark-2.2.0源码编译报错】https://blog.csdn.net/jiaotangX/article/details/78635133

    附录

    附录:A-spark-defaults.conf参数说明

    | 配置项 | 描述 | | —- | —- | | spark.master | 指定Spark运行模式,可以是yarn-client、yarn-cluster、spark://master:7077…. | | spark.home | 指定SPARK_HOME路径 | | spark.eventLog.enabled | 是否记录Spark事件,用于应用程序在完成后重构webUI | | spark.eventLog.dir | 设置spark.eventLog.enabled为true后,该属性为记录spark时间的根目录。在此根目录中,Spark为每个应用程序创建分目录,并将应用程序的时间记录到此目录中。用户可以将此属性设置为HDFS目录,以便History Server读取 | | spark.eventLog.compress | 否压缩记录Spark事件,前提spark.eventLog.enabled为true,默认使用的是snappy | | spark.serializer | 序列化 | | spark.executor.memory | 指定executor的内存 | | spark.driver.memory | 指定dirver的内存 | | spark.executor.extraJavaOptions | Spark GC配置 |