wget https://archive.apache.org/dist/spark/spark-2.4.2/spark-2.4.2.tgz
spark2.4 pom.xml 修改:
<!-- Add vendor maven repositories --><br /> <!-- Cloudera --><br /> <repository><br /> <id>cloudera-releases</id><br /> <url>http://repository.cloudera.com/artifactory/cloudera-repos</url><br /> <releases><br /> <enabled>true</enabled><br /> </releases><br /> <snapshots><br /> <enabled>false</enabled><br /> </snapshots><br /> </repository><br /> <!-- Hortonworks --><br /> <repository><br /> <id>HDPReleases</id><br /> <name>HDP Releases</name><br /> <url>http://repo.hortonworks.com/content/repositories/releases/</url><br /> <snapshots><enabled>false</enabled></snapshots><br /> <releases><enabled>true</enabled></releases><br /> </repository><br /> <repository><br /> <id>HortonworksJettyHadoop</id><br /> <name>HDP Jetty</name><br /> <url>http://repo.hortonworks.com/content/repositories/jetty-hadoop</url><br /> <snapshots><enabled>false</enabled></snapshots><br /> <releases><enabled>true</enabled></releases><br /> </repository><br /> <!-- MapR --><br /> <repository><br /> <id>mapr-releases</id><br /> <url>https://repository.mapr.com/maven/</url><br /> <snapshots><enabled>false</enabled></snapshots><br /> <releases><enabled>true</enabled></releases><br /> </repository>
[root@hdp2 spark-2.4.2]# pwd
/root/spark-2.4.2
[root@hdp2 spark-2.4.2]#
执行编译命令
参数详解
Phadoop hadoop的大版本号
Dhadoop.version=2.6.0-cdh5.7.0 hadoop 的详细版本号
–pip 支持python
–r 支持r
Psparkr支持pyspark
Pkubernetes 支持k8s
Phive-thriftserver 支持hive
-Phive 支持hive
–tgz 打包方式
–name 打包后的生成的名称
-Phive -Phive-thriftserve 连接hive相关
-Pyarn 连接hadoop相关
#仅仅是为了编译源码, 编译后可以导入idea中
[root@hdp2 spark-2.4.2]# ./build/mvn -Pyarn -Phive -Phive-thriftserver -Phadoop-3.1 -Dhadoop.version=3.1.1 -DskipTests clean package
[INFO] ————————————————————————————————————
[INFO] Reactor Summary for Spark Project Parent POM 2.4.2:
[INFO]
[INFO] Spark Project Parent POM ……………………… SUCCESS [ 3.764 s]
[INFO] Spark Project Tags …………………………… SUCCESS [ 7.418 s]
[INFO] Spark Project Sketch …………………………. SUCCESS [ 8.415 s]
[INFO] Spark Project Local DB ……………………….. SUCCESS [ 4.486 s]
[INFO] Spark Project Networking ……………………… SUCCESS [ 8.177 s]
[INFO] Spark Project Shuffle Streaming Service ………… SUCCESS [ 4.475 s]
[INFO] Spark Project Unsafe …………………………. SUCCESS [ 8.950 s]
[INFO] Spark Project Launcher ……………………….. SUCCESS [ 6.291 s]
[INFO] Spark Project Core …………………………… SUCCESS [03:21 min]
[INFO] Spark Project ML Local Library ………………… SUCCESS [ 10.382 s]
[INFO] Spark Project GraphX …………………………. SUCCESS [ 14.782 s]
[INFO] Spark Project Streaming ………………………. SUCCESS [ 43.863 s]
[INFO] Spark Project Catalyst ……………………….. SUCCESS [02:34 min]
[INFO] Spark Project SQL ……………………………. SUCCESS [04:22 min]
[INFO] Spark Project ML Library ……………………… SUCCESS [02:39 min]
[INFO] Spark Project Tools ………………………….. SUCCESS [ 1.203 s]
[INFO] Spark Project Hive …………………………… SUCCESS [01:06 min]
[INFO] Spark Project REPL …………………………… SUCCESS [ 6.720 s]
[INFO] Spark Project YARN Shuffle Service …………….. SUCCESS [ 10.094 s]
[INFO] Spark Project YARN …………………………… SUCCESS [ 16.335 s]
[INFO] Spark Project Hive Thrift Server ………………. SUCCESS [ 18.117 s]
[INFO] Spark Project Assembly ……………………….. SUCCESS [ 5.043 s]
[INFO] Spark Integration for Kafka 0.10 ………………. SUCCESS [ 10.603 s]
[INFO] Kafka 0.10+ Source for Structured Streaming …….. SUCCESS [ 16.859 s]
[INFO] Spark Project Examples ……………………….. SUCCESS [ 23.006 s]
[INFO] Spark Integration for Kafka 0.10 Assembly ………. SUCCESS [ 10.769 s]
[INFO] Spark Avro ………………………………….. SUCCESS [ 9.505 s]
[INFO] ————————————————————————————————————
[INFO] BUILD SUCCESS
[INFO] ————————————————————————————————————
[INFO] Total time: 18:15 min
[INFO] Finished at: 2020-07-08T14:54:06+08:00
[INFO] ————————————————————————————————————
[root@hdp2 spark-2.4.2]#
编译后并打包,打包后可以丢到生产环境
[root@hdp2 spark-2.4.2]# ./dev/make-distribution.sh —name 3.1.1 —tgz -Phadoop-3.1 -Dhadoop.version=3.1.1 -Phive -Phive-thriftserver -Pyarn
[INFO] ————————————————————————————————————
[INFO] Reactor Summary for Spark Project Parent POM 2.4.2:
[INFO]
[INFO] Spark Project Parent POM ……………………… SUCCESS [ 3.506 s]
[INFO] Spark Project Tags …………………………… SUCCESS [ 8.693 s]
[INFO] Spark Project Sketch …………………………. SUCCESS [ 9.827 s]
[INFO] Spark Project Local DB ……………………….. SUCCESS [ 8.492 s]
[INFO] Spark Project Networking ……………………… SUCCESS [ 13.254 s]
[INFO] Spark Project Shuffle Streaming Service ………… SUCCESS [ 5.543 s]
[INFO] Spark Project Unsafe …………………………. SUCCESS [ 14.309 s]
[INFO] Spark Project Launcher ……………………….. SUCCESS [ 11.668 s]
[INFO] Spark Project Core …………………………… SUCCESS [03:21 min]
[INFO] Spark Project ML Local Library ………………… SUCCESS [ 25.752 s]
[INFO] Spark Project GraphX …………………………. SUCCESS [ 20.328 s]
[INFO] Spark Project Streaming ………………………. SUCCESS [ 48.529 s]
[INFO] Spark Project Catalyst ……………………….. SUCCESS [02:39 min]
[INFO] Spark Project SQL ……………………………. SUCCESS [04:56 min]
[INFO] Spark Project ML Library ……………………… SUCCESS [03:03 min]
[INFO] Spark Project Tools ………………………….. SUCCESS [ 8.685 s]
[INFO] Spark Project Hive …………………………… SUCCESS [01:05 min]
[INFO] Spark Project REPL …………………………… SUCCESS [ 7.194 s]
[INFO] Spark Project YARN Shuffle Service …………….. SUCCESS [ 10.414 s]
[INFO] Spark Project YARN …………………………… SUCCESS [ 19.560 s]
[INFO] Spark Project Hive Thrift Server ………………. SUCCESS [ 20.583 s]
[INFO] Spark Project Assembly ……………………….. SUCCESS [ 4.442 s]
[INFO] Spark Integration for Kafka 0.10 ………………. SUCCESS [ 10.688 s]
[INFO] Kafka 0.10+ Source for Structured Streaming …….. SUCCESS [ 19.058 s]
[INFO] Spark Project Examples ……………………….. SUCCESS [ 24.540 s]
[INFO] Spark Integration for Kafka 0.10 Assembly ………. SUCCESS [ 10.721 s]
[INFO] Spark Avro ………………………………….. SUCCESS [ 11.792 s]
[INFO] ————————————————————————————————————
[INFO] BUILD SUCCESS
[INFO] ————————————————————————————————————
[INFO] Total time: 14:57 min (Wall Clock)
[INFO] Finished at: 2020-07-08T16:15:56+08:00
[INFO] ————————————————————————————————————
其中 spark-2.4.2-bin-3.1.1.tgz 为编译的安装包
[root@hdp2 spark-2.4.2]# ls
appveyor.yml data launcher python sql
assembly dev LICENSE R streaming
bin dist licenses README.md target
build docs mllib repl tools
common examples mllib-local resource-managers
conf external NOTICE sbin
CONTRIBUTING.md graphx pom.xml scalastyle-config.xml
core hadoop-cloud project spark-2.4.2-bin-3.1.1.tgz
解压 spark-2.4.2-bin-3.1.1.tgz 后 其中hadoop的依赖jar版本情况:
[root@hdp2 vdb]# cd spark-2.4.2-bin-3.1.1
[root@hdp2 spark-2.4.2-bin-3.1.1]# pwd
/mnt/vdb/spark-2.4.2-bin-3.1.1
[root@hdp2 spark-2.4.2-bin-3.1.1]# ls
bin conf data examples jars python README.md RELEASE sbin yarn
解压后目录详解
bin:客户端相关脚本,如beeline,可以删除cmd的结尾文件
conf:配置文件脚本模板,用时拷贝修改
data:存放的一些测试数据
examples:存放测试用例代码,代码非常好 强烈建议观看学习
jars:一堆jar包,所有jar包放一起,不像1.0那样就几个jar,2.0散开了(最佳实践)
LICENSE、 licenses、 NOTICE、python、README.md、RELEASE等文件夹都可以删除
sbin:服务端的相关脚本,如集群启停命令
yarn:存在yarn相关jar包
其他:
#设置内存2G
export MAVEN_OPTS=”-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m”
#编译前安装一些压缩解压缩工具
yum install -y snappy snappy-devel bzip2 bzip2-devel lzo lzo-devel lzop openssl openssl-devel
