摘要:本文详细记载hadoop-2.6.0-cdh5.7.0源码编译支持压缩以及伪分布式部署的详细步骤,可用于学习以及生产参考

1.需求与设计

1.1需求

直接使用的hadoop-2.6.0-cdh5.7.0.tar.gz包部署的hadoop集群不支持文件压缩,生产上是不可接受的,故需要将hadoop源码下载重新编译支持压缩

1.1概要设计

下载hadoop源码,使用maven编译,使其支持压缩。并成功进行伪分布式集群部署验证压缩功能。

2.环境需求以及部署规划

2.1 硬件环境

一台centos6.X虚拟机

2.2 软件环境:

组件名称 组件版本 百度网盘链接
vm vm10 链接:https://pan.baidu.com/s/1N5i8p8htXz9H_v__YNV1lA 提取码:yasn
centos centos6.7 链接:https://pan.baidu.com/s/1Z_6AcQ_WnvKz1ga_VCSI9Q 提取码:a24x
Hadoop Hadoop-2.6.0-cdh5.7.0-src.tar.gz 链接:https://pan.baidu.com/s/1uRMGIhLSL9QHT-Ee4F16jw 提取码:jb1d
jdk jdk-7u80-linux-x64.tar.gz 链接:https://pan.baidu.com/s/1xSCQ8rjABVI-zDFQS5nCPA 提取码:lfze
maven apache-maven-3.3.9-bin.tar.gz 链接:https://pan.baidu.com/s/1ddkdkLW7r7ahFZmgACGkVw 提取码:fdfz
protobuf protobuf-2.5.0.tar.gz 链接:https://pan.baidu.com/s/1RSNZGd_ThwknMB3vDkEfhQ 提取码:hvc2

注意:
1、编译的JDK版本必须是1.7,1.8的JDK会导致编译失败,采坑

3.安装centos

请参考VM虚拟机安装Centos6.X以及主机和网络配置

4.编译hadoop

4.1安装必要的依赖库

  1. [root@hadoop001 ~]# yum install -y svn ncurses-devel
  2. [root@hadoop001 ~]# yum install -y gcc gcc-c++ make cmake
  3. [root@hadoop001 ~]# yum install -y openssl openssl-devel svn ncurses-devel zlib-devel libtool
  4. [root@hadoop001 ~]# yum install -y snappy snappy-devel bzip2 bzip2-devel lzo lzo-devel lzop autoconf automake cmake

4.2添加用户以及上传软件

  1. [root@hadoop001 ~]# yum install -y lrzsz
  2. [root@hadoop001 ~]# useradd hadoop
  3. [root@hadoop001 ~]# su - hadoop
  4. [hadoop@hadoop001 ~]$ mkdir app soft source lib data maven_repo shell mysql
  5. [hadoop@hadoop001 ~]$ cd soft/
  6. [hadoop@hadoop001 soft]$ rz
  7. [hadoop@hadoop001 soft]$ ll
  8. total 202192
  9. -rw-r--r--. 1 hadoop hadoop 8491533 Apr 7 11:25 apache-maven-3.3.9-bin.tar.gz
  10. -rw-r--r--. 1 hadoop hadoop 42610549 Apr 6 16:55 hadoop-2.6.0-cdh5.7.0-src.tar.gz
  11. -rw-r--r--. 1 hadoop hadoop 153530841 Apr 7 11:12 jdk-7u80-linux-x64.tar.gz
  12. -rw-r--r--. 1 hadoop hadoop 2401901 Apr 7 11:31 protobuf-2.5.0.tar.gz

4.3安装JDK

  • 解压安装包,安装目录必须是/usr/java,安装后记得修改拥有者为root
  1. [hadoop@hadoop001 soft]$ exit
  2. [root@hadoop001 ~]# mkdir /usr/java
  3. [root@hadoop001 ~]# tar -zxvf /home/hadoop/soft/jdk-7u80-linux-x64.tar.gz -C /usr/java
  4. [root@hadoop001 ~]# cd /usr/java/
  5. [root@hadoop001 java]# chown -R root:root jdk1.7.0_80
  • 添加环境变量
  1. [root@hadoop001 jdk1.7.0_80]# vim /etc/profile
  2. #添加如下两行环境变量
  3. export JAVA_HOME=/usr/java/jdk1.7.0_80
  4. export PATH=$JAVA_HOME/bin:$PATH
  5. [root@hadoop001 jdk1.7.0_80]# source /etc/profile
  6. #测试java是否安装成功
  7. [root@hadoop001 jdk1.7.0_80]# java -version
  8. java version "1.7.0_80"
  9. Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
  10. Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)

4.4安装maven

  • 解压
  1. [root@hadoop001 ~]# su - hadoop
  2. [hadoop@hadoop001 ~]$ tar -zxvf ~/soft/apache-maven-3.3.9-bin.tar.gz -C ~/app/
  • 添加环境变量
  1. #修改haoop用户的环境变量
  2. [hadoop@hadoop001 ~]$ vim ~/.bash_profile
  3. #添加或修改如下内容,注意MAVEN_OPTS设置了maven运行的内存,防止内存太小导致编译失败
  4. export MAVEN_HOME=/home/hadoop/app/apache-maven-3.3.9
  5. export MAVEN_OPTS="-Xms1024m -Xmx1024m"
  6. export PATH=$MAVEN_HOME/bin:$PATH
  7. [hadoop@hadoop001 ~]$ source ~/.bash_profile
  8. [hadoop@hadoop001 ~]$ which mvn
  9. ~/app/apache-maven-3.3.9/bin/mvn
  • 配置maven
  1. [hadoop@hadoop001 protobuf-2.5.0]$ vim ~/app/apache-maven-3.3.9/conf/settings.xml
  2. #配置maven的本地仓库位置
  3. <localRepository>/home/hadoop/maven_repo/repo</localRepository>
  4. #添加阿里云中央仓库地址,注意一定要写在<mirrors></mirrors>之间
  5. <mirror>
  6. <id>nexus-aliyun</id>
  7. <mirrorOf>central</mirrorOf>
  8. <name>Nexus aliyun</name>
  9. <url>http://maven.aliyun.com/nexus/content/groups/public</url>
  10. </mirror>
  • (可选)添加jars到本地仓库,网络慢可能导致mvn第一次编译时下载需要超长的时间甚至编译失败
  1. #jar包链接
  2. 链接:https://pan.baidu.com/s/1vq4iVFqqyJNkYzg90bVrfg
  3. 提取码:vugv
  4. 复制这段内容后打开百度网盘手机App,操作更方便哦
  5. #下载后 rz上传解压,注意目录层次
  6. [hadoop@hadoop001 maven_repo]$ rz
  7. [hadoop@hadoop001 maven_repo]$ tar -zxvf repo.tar.gz

4.5安装protobuf

  • 解压
  1. [hadoop@hadoop001 ~]$ tar -zxvf ~/soft/protobuf-2.5.0.tar.gz -C ~/app/
  • 编译软件
  1. [hadoop@hadoop001 protobuf-2.5.0]$ cd ~/app/protobuf-2.5.0/
  2. # --prefix= 是用来待会编译好的包放在为路径
  3. [hadoop@hadoop001 protobuf-2.5.0]$ ./configure --prefix=/home/hadoop/app/protobuf-2.5.0
  4. #编译以及安装
  5. [hadoop@hadoop001 protobuf-2.5.0]$ make
  6. [hadoop@hadoop001 protobuf-2.5.0]$ make install
  • 添加环境变量
  1. [hadoop@hadoop001 protobuf-2.5.0]$ vim ~/.bash_profile
  2. #追加如下两行内容,未编译前是没有bin目录的
  3. export PROTOBUF_HOME=/home/hadoop/app/protobuf-2.5.0
  4. export PATH=$PROTOBUF_HOME/bin:$PATH
  5. [hadoop@hadoop001 protobuf-2.5.0]$ source ~/.bash_profile
  6. #测试是否生效,若出现libprotoc 2.5.0则为生效
  7. [hadoop@hadoop001 protobuf-2.5.0]$ protoc --version
  8. libprotoc 2.5.0

4.6编译hadoop

  • 解压
  1. [hadoop@hadoop001 protobuf-2.5.0]$ tar -zxvf ~/soft/hadoop-2.6.0-cdh5.7.0-src.tar.gz -C ~/source/
  • 编译hadoop使其支持压缩:mvn clean package -Pdist,native -DskipTests -Dtar
  1. #进入hadoop的源码目录
  2. [hadoop@hadoop001 hadoop-2.6.0-cdh5.7.0]$ cd ~/source/hadoop-2.6.0-cdh5.7.0/
  3. #进行编译,第一次编译会下载很多依赖的jar包,快慢由网速决定,需耐心等待,本人亲测耗时
  4. [hadoop@hadoop001 hadoop-2.6.0-cdh5.7.0]$ mvn clean package -Pdist,native -DskipTests -Dtar
  • 若报异常,主要信息如下(无异常跳过):
  1. [FATAL] Non-resolvable parent POM for org.apache.hadoop:hadoop-main:2.6.0-cdh5.7.0: Could not transfer artifact com.cloudera.cdh:cdh-root:pom:5.7.0 from/to cdh.repo (https://repository.cloudera.com/artifactory/cloudera-repos): Remote host closed connectio
  2. #分析:是https://repository.cloudera.com/artifactory/cloudera-repos/com/cloudera/cdh/cdh-root/5.7.0/cdh-root-5.7.0.pom文件下载不了,但是虚拟机确实是ping通远程的仓库,很是费解为什么。
  3. #解决方案:前往本地仓库到目标文件目录,然后 通过wget 文件,来成功获取该文件,重新执行编译命令,或者执行4.5的可选步骤,将需要的jar直接放到本地仓库
  • 查看编译后的包:hadoop-2.6.0-cdh5.7.0.tar.gz
  1. #有 BUILD SUCCESS 信息则表示编译成功
  2. [INFO] Apache Hadoop Scheduler Load Simulator ............. SUCCESS [ 13.592 s]
  3. [INFO] Apache Hadoop Tools Dist ........................... SUCCESS [ 12.042 s]
  4. [INFO] Apache Hadoop Tools ................................ SUCCESS [ 0.094 s]
  5. [INFO] Apache Hadoop Distribution ......................... SUCCESS [01:49 min]
  6. [INFO] ------------------------------------------------------------------------
  7. [INFO] BUILD SUCCESS
  8. [INFO] ------------------------------------------------------------------------
  9. [INFO] Total time: 37:39 min
  10. [INFO] Finished at: 2019-04-07T16:48:42+08:00
  11. [INFO] Final Memory: 200M/989M
  12. [INFO] ------------------------------------------------------------------------
  13. [hadoop@hadoop001 hadoop-2.6.0-cdh5.7.0]$
  14. [hadoop@hadoop001 hadoop-2.6.0-cdh5.7.0]$
  15. [hadoop@hadoop001 hadoop-2.6.0-cdh5.7.0]$
  16. [hadoop@hadoop001 hadoop-2.6.0-cdh5.7.0]$
  17. [hadoop@hadoop001 hadoop-2.6.0-cdh5.7.0]$ ll /home/hadoop/source/hadoop-2.6.0-cdh5.7.0/hadoop-dist/target/
  18. total 564036
  19. drwxrwxr-x. 2 hadoop hadoop 4096 Apr 7 16:46 antrun
  20. drwxrwxr-x. 3 hadoop hadoop 4096 Apr 7 16:46 classes
  21. -rw-rw-r--. 1 hadoop hadoop 1998 Apr 7 16:46 dist-layout-stitching.sh
  22. -rw-rw-r--. 1 hadoop hadoop 690 Apr 7 16:47 dist-tar-stitching.sh
  23. drwxrwxr-x. 9 hadoop hadoop 4096 Apr 7 16:47 hadoop-2.6.0-cdh5.7.0
  24. -rw-rw-r--. 1 hadoop hadoop 191880143 Apr 7 16:47 hadoop-2.6.0-cdh5.7.0.tar.gz
  25. -rw-rw-r--. 1 hadoop hadoop 7314 Apr 7 16:47 hadoop-dist-2.6.0-cdh5.7.0.jar
  26. -rw-rw-r--. 1 hadoop hadoop 385618309 Apr 7 16:48 hadoop-dist-2.6.0-cdh5.7.0-javadoc.jar
  27. -rw-rw-r--. 1 hadoop hadoop 4855 Apr 7 16:47 hadoop-dist-2.6.0-cdh5.7.0-sources.jar
  28. -rw-rw-r--. 1 hadoop hadoop 4855 Apr 7 16:47 hadoop-dist-2.6.0-cdh5.7.0-test-sources.jar
  29. drwxrwxr-x. 2 hadoop hadoop 4096 Apr 7 16:47 javadoc-bundle-options
  30. drwxrwxr-x. 2 hadoop hadoop 4096 Apr 7 16:47 maven-archiver
  31. drwxrwxr-x. 3 hadoop hadoop 4096 Apr 7 16:46 maven-shared-archive-resources
  32. drwxrwxr-x. 3 hadoop hadoop 4096 Apr 7 16:46 test-classes
  33. drwxrwxr-x. 2 hadoop hadoop 4096 Apr 7 16:46 test-dir

5.伪分布式部署

5.1解压安装包

  1. [hadoop@hadoop001 hadoop-2.6.0-cdh5.7.0]$ cp /home/hadoop/source/hadoop-2.6.0-cdh5.7.0/hadoop-dist/target/hadoop-2.6.0-cdh5.7.0.tar.gz /home/hadoop/soft/
  2. [hadoop@hadoop001 ~]$ cd ~
  3. [hadoop@hadoop001 ~]$ tar -zxvf ~/soft/hadoop-2.6.0-cdh5.7.0.tar.gz -C ~/app/

5.2配置环境变量

  1. [hadoop@hadoop001 ~]$ vim ~/.bash_profile
  2. export HADOOP_HOME=/home/hadoop/app/hadoop-2.6.0-cdh5.7.0
  3. export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
  4. [hadoop@hadoop001 ~]$ source ~/.bash_profile
  5. [hadoop@hadoop001 ~]$ which hadoop
  6. ~/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop

5.3配置ssh

  1. [hadoop@hadoop001 ~]$ rm -rf ~/.ssh
  2. [hadoop@hadoop001 ~]$ ssh-keygen
  3. [hadoop@hadoop001 ~]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
  4. [hadoop@hadoop001 ~]$ chmod 600 ~/.ssh/authorized_keys
  5. #测试ssh是否成功,用户第一ssh会提示输入是否连接,yes。成功显示时间
  6. [hadoop@hadoop001 ~]$ ssh hadoop001 date

5.4修改配置文件

  • 编辑hadoop-env.sh文件相关配置
  1. [hadoop@hadoop001 ~]$ vim ~/app/hadoop-2.6.0-cdh5.7.0/etc/hadoop/hadoop-env.sh
  2. #将JDK的安装目录修改为绝对路径
  3. export JAVA_HOME=/usr/java/jdk1.7.0_80
  • 编辑core-site.xml文件相关配置
  1. [hadoop@hadoop001 ~]$ vim ~/app/hadoop-2.6.0-cdh5.7.0/etc/hadoop/core-site.xml
  2. #添加如下配置
  3. <property>
  4. <name>fs.defaultFS</name>
  5. <value>hdfs://hadoop001:9000</value>
  6. </property>
  • 编辑hdfs-site.xml文件相关配置
  1. [hadoop@hadoop001 ~]$ vim ~/app/hadoop-2.6.0-cdh5.7.0/etc/hadoop/hdfs-site.xml
  2. #添加如下配置
  3. <property>
  4. <name>dfs.replication</name>
  5. <value>1</value>
  6. </property>
  7. <property>
  8. <name>dfs.namenode.secondary.http-address</name>
  9. <value>hadoop001:50090</value>
  10. </property>
  11. <property>
  12. <name>dfs.namenode.secondary.https-address</name>
  13. <value>hadoop001:50091</value>
  14. </property>
  • 添加slaves节点
  1. [hadoop@hadoop001 ~]$ vim ~/app/hadoop-2.6.0-cdh5.7.0/etc/hadoop/slaves
  2. #添加一行内容如下
  3. hadoop001
  • 编辑yarn-site.xml文件相关配置
  1. [hadoop@hadoop001 ~]$ vim ~/app/hadoop-2.6.0-cdh5.7.0/etc/hadoop/yarn-site.xml
  2. #添加如下配置
  3. <property>
  4. <name>yarn.nodemanager.aux-services</name>
  5. <value>mapreduce_shuffle</value>
  6. </property>
  • 编辑yarn-site.xml文件相关配置
  1. [hadoop@hadoop001 ~]$ cp ~/app/hadoop-2.6.0-cdh5.7.0/etc/hadoop/mapred-site.xml.template ~/app/hadoop-2.6.0-cdh5.7.0/etc/hadoop/mapred-site.xml
  2. [hadoop@hadoop001 ~]$ vim ~/app/hadoop-2.6.0-cdh5.7.0/etc/hadoop/mapred-site.xml
  3. #添加如下配置
  4. <property>
  5. <name>mapreduce.framework.name</name>
  6. <value>yarn</value>
  7. </property>

5.5格式化namenode

  1. [hadoop@hadoop001 ~]$ hdfs namenode -format
  2. #若出现 has been successfully formatted 则表示格式化成功
  3. 19/04/07 17:42:31 INFO namenode.FSImage: Allocated new BlockPoolId: BP-565897555-192.168.175.135-1554630151139
  4. 19/04/07 17:42:31 INFO common.Storage: Storage directory /tmp/hadoop-hadoop/dfs/name has been successfully formatted.
  5. 19/04/07 17:42:32 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
  6. 19/04/07 17:42:32 INFO util.ExitUtil: Exiting with status 0
  7. 19/04/07 17:42:32 INFO namenode.NameNode: SHUTDOWN_MSG:
  8. /************************************************************
  9. SHUTDOWN_MSG: Shutting down NameNode at hadoop001/192.168.175.135
  10. ************************************************************/

5.6启动hadoop

  1. [hadoop@hadoop001 ~]$ start-all.sh
  2. This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
  3. Starting namenodes on [hadoop001]
  4. hadoop001: starting namenode, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/logs/hadoop-hadoop-namenode-hadoop001.out
  5. hadoop001: starting datanode, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/logs/hadoop-hadoop-datanode-hadoop001.out
  6. Starting secondary namenodes [hadoop001]
  7. hadoop001: starting secondarynamenode, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/logs/hadoop-hadoop-secondarynamenode-hadoop001.out
  8. starting yarn daemons
  9. starting resourcemanager, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/logs/yarn-hadoop-resourcemanager-hadoop001.out
  10. hadoop001: starting nodemanager, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/logs/yarn-hadoop-nodemanager-hadoop001.out
  11. #查询出五个守护进程
  12. [hadoop@hadoop001 ~]$ jps
  13. 56598 SecondaryNameNode
  14. 56933 Jps
  15. 56896 NodeManager
  16. 56801 ResourceManager
  17. 56470 DataNode
  18. 56376 NameNode

6.验证hadoop

6.1 hdfs验证

  1. [hadoop@hadoop001 ~]$ hdfs dfs -put ~/app/hadoop-2.6.0-cdh5.7.0/README.txt hdfs://hadoop001:9000/
  2. [hadoop@hadoop001 ~]$ hdfs dfs -ls /
  3. Found 1 items
  4. -rw-r--r-- 1 hadoop supergroup 1366 2019-04-07 17:52 /README.txt

6.2yarn验证

  1. #执行MR用列代码
  2. [hadoop@hadoop001 ~]$ hadoop jar /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar pi 4 10

6.2检测压缩格式

  1. #true表示支持的意思
  2. [hadoop@hadoop001 ~]$ hadoop checknative
  3. 19/04/07 17:50:08 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native
  4. 19/04/07 17:50:08 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
  5. Native library checking:
  6. hadoop: true /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/lib/native/libhadoop.so.1.0.0
  7. zlib: true /lib64/libz.so.1
  8. snappy: true /usr/lib64/libsnappy.so.1
  9. lz4: true revision:99
  10. bzip2: true /lib64/libbz2.so.1
  11. openssl: true /usr/lib64/libcrypto.so

扩展1:protobuf是什么?

protobuf它是一种轻便高效的数据格式,类似于Json,平台无关、语言无关、可扩展,可用于通讯协议和数据存储等领域。
优点:
平台无关,语言无关,可扩展;
提供了友好的动态库,使用简单;
解析速度快,比对应的XML快约20-100倍;
序列化数据非常简洁、紧凑,与XML相比,其序列化之后的数据量约为1/3到1/10。

扩展2:mvn命令也是可以进行debug的,mvn -X 命令