Hadoop概述

Hadoop2.X组成

Hadoop包括Common、HDFS、Yarn、MapReduce

Common辅助工具
HDFS数据存储
- NameNode：存储文件元数据，可以理解为目录
- DataNode：本地文件系统存储文件块数据，即实实在在的数据
- SecondaryNameNode：监控HDFS状态的辅助后台程序
Yarn资源调度
- ResourceManager（RM）：处理客户端请求，监控NodeManager，监控ApplicationMaster，资源调度与分配
- NodeManager（NM）：管理单个节点上的资源，处理RM、ApplicationMaster的命令
- ApplicationMaster（AM）：
- Container：资源抽象，封装
MapReduce计算
- Map阶段：并行处理输入数据
- Reduce阶段：对Map结果汇总

虚拟机环境配置

CentOS7中,如果选择GNOME桌面,无法自定义安装程序,因此系统会安装openjdk 1.7和1.8两个版本,默认是1.8-u102，我们需要删除这两个。

先查看

[root@hadoop101 ~]# rpm -qa|grep java

再删除对应的版本号

[root@hadoop101 ~]# rpm -e --nodeps java-1.8.0-openjdk-1.8.0.242.b08-1.el7.x86_64
[root@hadoop101 ~]# rpm -e --nodeps java-1.7.0-openjdk-1.7.0.251-2.6.21.1.el7.x86_64

接着，配置网络

vim /etc/sysconfig/network-scripts/ifcfg-ens33

TYPE="Ethernet"
PROXY_METHOD="none"
BROWSER_ONLY="no"
BOOTPROTO="static"
DEFROUTE="yes"
IPV4_FAILURE_FATAL="no"
IPV6INIT="yes"
IPV6_AUTOCONF="yes"
IPV6_DEFROUTE="yes"
IPV6_FAILURE_FATAL="no"
IPV6_ADDR_GEN_MODE="stable-privacy"
NAME="ens33"
DEVICE="ens33"
ONBOOT="yes"
IPV6_PRIVACY="no"
IPADDR=192.168.133.130
GATEWAY=192.168.133.2
NETMASK=255.255.255.0
DNS1=192.168.133.2

关闭防火墙

[root@hadoop101 ~]# systemctl stop firewalld.service

防火墙开机禁用

[root@hadoop101 ~]# systemctl disable firewalld.service

接着，开启虚拟机ip和名称的映射，进入/etc/hosts修改

[root@hadoop104 opt]# vim /etc/hosts

添加下面内容

192.168.133.131 hadoop102
192.168.133.132 hadoop103
192.168.133.133 hadoop104

然后可以克隆另一台虚拟机。

克隆完成后只要改ip和hostname，其他的不用改。

修改ip，进入文件修改

[root@hadoop101 ~]# vim /etc/sysconfig/network-scripts/ifcfg-ens33

修改hostname，先查看当前hostname，和前一台机器一样是叫hadoop101

[root@hadoop101 ~]# hostname

这里改成hadoop102

[root@hadoop101 ~]# hostnamectl set-hostname hadoop102

然后重启虚拟机。

安装JDK和Hadoop

进入/opt/目录，创建两个文件夹software、module，并且删除无用的rh/文件夹

[root@hadoop102 /]# cd /opt/
[root@hadoop102 opt]# mkdir software
[root@hadoop102 opt]# mkdir module
[root@hadoop102 opt]# rm -rf rh/

这里software存放安装包，module是安装目录。

现在可以看看opt/文件夹下的内容

[root@hadoop102 opt]# ll
总用量 0
drwxr-xr-x. 2 root root 6 10月 21 14:22 module
drwxr-xr-x. 2 root root 6 10月 21 14:21 software

再将JDK和Hadoop的tar包放进software目录。

software目录内容

[root@hadoop102 opt]# cd software/
[root@hadoop102 software]# ll
总用量 377944
-rw-r--r--. 1 root root 243900138 10月 21 14:28 hadoop-2.8.2.tar.gz
-rw-r--r--. 1 root root 143111803 10月 21 14:28 jdk-8u261-linux-x64.tar.gz

将tar包解压到module目录

[root@hadoop102 software]# tar -zxvf jdk-8u261-linux-x64.tar.gz -C /opt/module/
[root@hadoop102 software]# tar -zxvf hadoop-2.8.2.tar.gz -C /opt/module/

查看module目录

[root@hadoop102 software]# cd ..
[root@hadoop102 opt]# cd module/
[root@hadoop102 module]# ll
总用量 0
drwxr-xr-x. 9   502 dialout 149 10月 20 2017 hadoop-2.8.2
drwxr-xr-x. 8 10143   10143 273 6月  18 14:59 jdk1.8.0_261

配置环境变量，编辑profile文件

[root@hadoop102 jdk1.8.0_261]# vim /etc/profile

在文件末尾插入

##JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0_261
export PATH=$PATH:$JAVA_HOME/bin
##HADOOP_HOME
export HADOOP_HOME=/opt/module/hadoop-2.8.2
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

然后

[root@hadoop102 hadoop-2.8.2]# source /etc/profile

这样Java和Hadoop就可以用了，通过java -version和hadoop命令查看是否安装完成。

也可以用scp或rsync命令去拷贝。scp是直接拷贝，rsync是拷贝差异文件速度更快。

除此之外，还有xsync命令，循环复制文件到所有节点的相同目录下。

虚拟机的相关配置

在虚拟机上安装完Java和Hadoop并配置好环境变量之后，先创建input文件夹# mkdir input，在/opt/module/hadoop-2.8.2目录下创建该文件夹，。/opt/software/文件夹下存放Java和Hadoop的安装包，/opt/module/是存放Java和Hadoop安装的位置，这里Hadoop是2.8.2、JDK是1.8 。

将.xml文件拷贝到input文件夹下

[root@hadoop-1 hadoop-2.8.2]# cp etc/hadoop/*.xml input/

查看拷贝结果

[root@hadoop-1 hadoop-2.8.2]# ls input/
capacity-scheduler.xml  core-site.xml  hadoop-policy.xml  hdfs-site.xml  httpfs-site.xml  kms-acls.xml  kms-site.xml  yarn-site.xml

运行jar包

[root@hadoop-1 hadoop-2.8.2]# hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar grep input/ output 'dfs[a-z.]+'

注意：这里是把输入放在input文件夹，把输出放在output问价夹，input文件夹是前面创建的，output不能自己创建待会程序自己创建。

接着查看output是不是创建了

[root@hadoop-1 hadoop-2.8.2]# ll
总用量 124
drwxr-xr-x. 2  502 dialout   194 10月 20 2017 bin
drwxr-xr-x. 3  502 dialout    20 10月 20 2017 etc
drwxr-xr-x. 2  502 dialout   106 10月 20 2017 include
drwxr-xr-x. 2 root root      187 10月 21 09:31 input
drwxr-xr-x. 3  502 dialout    20 10月 20 2017 lib
drwxr-xr-x. 2  502 dialout   239 10月 20 2017 libexec
-rw-r--r--. 1  502 dialout 99253 10月 20 2017 LICENSE.txt
-rw-r--r--. 1  502 dialout 15915 10月 20 2017 NOTICE.txt
drwxr-xr-x. 2 root root       88 10月 21 09:44 output
-rw-r--r--. 1  502 dialout  1366 10月 20 2017 README.txt
drwxr-xr-x. 2  502 dialout  4096 10月 20 2017 sbin
drwxr-xr-x. 4  502 dialout    31 10月 20 2017 share

再进入output文件夹，

[root@hadoop-1 hadoop-2.8.2]# cd output/
[root@hadoop-1 output]# ll
总用量 4
-rw-r--r--. 1 root root 11 10月 21 09:44 part-r-00000
-rw-r--r--. 1 root root  0 10月 21 09:44 _SUCCESS

查看part-r-00000文件的内容

[root@hadoop-1 output]# cat part-r-00000
1    dfsadmin

如果内容是这样，就表示成功。

Word Count案例

hadoop目录下创建wcinput文件夹，创建wc.input文件，并在文件里面写入任意字母文本

[root@hadoop-1 hadoop-2.8.2]# mkdir wcinput
[root@hadoop-1 hadoop-2.8.2]# cd wcinput/
[root@hadoop-1 wcinput]# ll
总用量 0
[root@hadoop-1 wcinput]# touch wc.input
[root@hadoop-1 wcinput]# vim wc.input

接着，运行

[root@hadoop-1 hadoop-2.8.2]# hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar wordcount wcinput/ wcoutput

类似上面的wcinput/是已经创建好的，作为输入；wcoutput是未创建的作为输出位置

进入wcoutput查看

[root@hadoop-1 hadoop-2.8.2]# cd wcoutput/
[root@hadoop-1 wcoutput]# ll
总用量 4
-rw-r--r--. 1 root root 131 10月 21 10:27 part-r-00000
-rw-r--r--. 1 root root   0 10月 21 10:27 _SUCCESS

查看part-r-00000文件，里面是计数结果

[root@hadoop-1 wcoutput]# cat part-r-00000

集群配置

该集群有102、103、104三台机器。

	hadoop102	hadoop103	hadoop104
HDFS	NameNode、DataNode	DataNode	SecondaryNameNode、DataNode
YARN	NodeManager	ResourceManager、NodeManager	NodeManager

xsync脚本

先创建文件，放在bin目录下

[root@hadoop102 ~]# mkdir bin
[root@hadoop102 bin]# touch xsync

文件如下

[root@hadoop102 bin]# vim xsync

#!/bin/bash
#1 获取输入参数个数，如果没有参数，直接退出
pcount=$#
if((pcount==0)); then
echo no args;
exit;
fi
#2 获取文件名称
p1=$1
fname=`basename $p1`
echo fname=$fname
#3 获取上级目录到绝对路径
pdir=`cd -P $(dirname $p1); pwd`
echo pdir=$pdir
#4 获取当前用户名称
user=`whoami`
#5 循环
for((host=103; host<105; host++)); do
        echo ------------------- cdh$host --------------
        rsync -rvl $pdir/$fname $user@cdh$host:$pdir
done

给权限

[root@hadoop102 bin]# chmod 777 xsync

配置102

首先每个机器的名称和ip匹配，进入/etc/hosts输入ip和服务名称。每台机器都要写。

[root@hadoop102 ~]# vim /etc/hosts
192.168.133.131 hadoop102
192.168.133.132 hadoop103
192.168.133.133 hadoop104

先配置hadoop102，进入hadoop

[root@hadoop102 /]# cd /opt/module/hadoop-2.8.2/
[root@hadoop102 hadoop-2.8.2]# cd etc/hadoop/

配置HDFS

修改hadoop-env.sh，改JAVA_HOME位置，先获取位置

[root@hadoop102 ~]# echo $JAVA_HOME 
/opt/module/jdk1.8.0_261

再修改

[root@hadoop102 hadoop]# vim hadoop-env.sh

export JAVA_HOME=/opt/module/jdk1.8.0_261

接着修改core-site.xml文件

[root@hadoop102 hadoop]# vim core-site.xml

添加

<configuration>
        <!--指定HDFS中NameNode地址-->
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://hadoop102:9000</value>
        </property>
        <!--指定Hadoop运行时产生文件的存储目录-->
        <property>
                <name>hadoop.tmp.dir</name>
                <value>/opt/module/hadoop-2.8.2/data/tmp</value>
        </property>
</configuration>

修改HDFS配置文件hdfs-site.xml

[root@hadoop102 hadoop]# vim hdfs-site.xml

<configuration>
    <!--<property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>-->
    <!--指定Hadoop辅助名称节点主机配置-->
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>hadoop104:50090</value>
    </property>
</configuration>

配置yarn

修改yarn-env.sh文件

[root@hadoop102 hadoop]# vim yarn-env.sh

export JAVA_HOME=/opt/module/jdk1.8.0_261

修改yarn-site.xml

[root@hadoop102 hadoop]# vim yarn-site.xml

<configuration>
    <!--Reducer获取数据的方式-->
    <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
    <!--指定yarn的ResourceManager的地址-->
    <property>
            <name>yarn.resourcemanager.hostname</name>
               <value>hadoop103</value>
       </property>
       <!--<!--日志聚集功能--
    <property>
        <name>yarn.log-aggregation-enable</name>
           <value>true</value>
    </property>
    <!--日志保留7天--
    <property>
        <name>yarn.log-aggregation.retain-seconds</name>
        <value>604800</value>
    </property>-->
</configuration>

配置MapReduce

修改mapred-env.sh文件

[root@hadoop102 hadoop]# vim mapred-env.sh

export JAVA_HOME=/opt/module/jdk1.8.0_261

修改mapred-site.xml，先把mapred-site.xml.template复制到mapred-site.xml再修改

[root@hadoop102 hadoop]# cp mapred-site.xml.template mapred-site.xml
[root@hadoop102 hadoop]# vim mapred-site.xml

<configuration>
    <!--指定MR运行在yarn上-->
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <!--历史服务器端地址
    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>hadoop102:10020</value>
    </property>
    <!--历史服务器Web端地址
    <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>hadoop102:19888</value>
    </property>-->
</configuration>

将上述配置同步到103、104，xsync是前面编写的脚本。

[root@hadoop102 hadoop]# cd ..
[root@hadoop102 etc]# xsync hadoop/

启动HDFS

先格式化，删除文件,102\103\104里面都要删（如果已经启动了，要先停止节点再删除文件）

[root@hadoop102 hadoop-2.8.2]# rm -rf data/ logs/

[root@hadoop102 hadoop-2.8.2]# bin/hdfs namenode -format

102上启动

[root@hadoop102 hadoop-2.8.2]# sbin/hadoop-daemon.sh start namenode
[root@hadoop102 hadoop-2.8.2]# sbin/hadoop-daemon.sh start datanode

103上启动

[root@hadoop102 hadoop-2.8.2]# sbin/hadoop-daemon.sh start datanode

104上启动

[root@hadoop102 hadoop-2.8.2]# sbin/hadoop-daemon.sh start datanode

然后本机浏览器上访问http://hadoop102:50070成功。

配置ssh

102上有NameNode，所以它需要访问其他机器，要配置免密登录。

在102进入ssh文件

[root@hadoop102 ~]# ls -all
[root@hadoop102 ~]# cd .ssh/

获取秘钥

[root@hadoop102 .ssh]# ssh-keygen -t rsa

回车后，三次回车

这时.ssh/下面多了两个文件，id_rsa存放私钥，id_rsa.pub存放公钥，known_hosts存放授权过的无密码登录服务器公钥。

[root@hadoop102 .ssh]# ll
总用量 12
-rw-------. 1 root root 1679 10月 22 08:57 id_rsa
-rw-r--r--. 1 root root  396 10月 22 08:57 id_rsa.pub
-rw-r--r--. 1 root root  374 10月 21 17:42 known_hosts

拷贝至103

[root@hadoop102 .ssh]# ssh-copy-id hadoop103

回车后第一次需要输入103的密码

此时，发现103的.ssh/下多了文件authorized_keys，里面存放的是102的公钥

[root@hadoop103 .ssh]# ll
总用量 4
-rw-------. 1 root root 396 10月 22 09:02 authorized_keys

同样的做法拷贝到102（自身也需要拷贝）、104 。

现在102访问103和104就不需要密码了，实现免密通信。

此外，103上面有ResourceManager，也需要配置ssh。

群起集群

配置slaves

进入102，修改slaves

[root@hadoop103 .ssh]# cd /opt/module/hadoop-2.8.2/etc/hadoop/
[root@hadoop103 hadoop]# vim slaves

hadoop102
hadoop103
hadoop104

删除文件里其他内容，添加机器名称（这里不能有多余的空格）。

先查看运行的进程

[root@hadoop102 hadoop-2.8.2]# jps
56437 NameNode
67051 Jps
56543 DataNode

把DataNode、NameNode退出

[root@hadoop102 hadoop-2.8.2]# sbin/hadoop-daemon.sh stop datanode
[root@hadoop102 hadoop-2.8.2]# sbin/hadoop-daemon.sh stop namenode

进入103、104，退出DataNode

[root@hadoop102 hadoop-2.8.2]# sbin/hadoop-daemon.sh stop datanode

这样所有的节点就都正常退出了。

启动HDFS

在102群起

[root@hadoop102 hadoop-2.8.2]# sbin/start-dfs.sh

启动yarn

在103上启动yarn，在另外两个上启动会直接挂，因为ResourceManager配在103上。

[root@hadoop103 hadoop-2.8.2]# sbin/start-yarn.sh

集群测试

传文件测试，将README.txt上传到/目录下

[root@hadoop102 hadoop-2.8.2]# bin/hdfs dfs -put README.txt /

查看

[root@hadoop102 hadoop-2.8.2]# hadoop fs -ls /

这里数据其实是存储在Linux上的，在下面的目录blk_1073741825文件

[root@hadoop102 subdir0]# pwd
/opt/module/hadoop-2.8.2/data/tmp/dfs/data/current/BP-1924141108-192.168.133.131-1603273870979/current/finalized/subdir0/subdir0

上传文件授权（不然删不掉）

[root@hadoop102 hadoop-2.8.2]# hadoop fs -chmod -R 777 /

定时任务

每隔一分钟向/opt/module/hadoop-2.8.2/bailong.txt中追加一个+号。

[root@hadoop102 hadoop-2.8.2]# crontab -e

*/1 * * * * /bin/echo "+" >> /opt/module/hadoop-2.8.2/bailong.txt

启动服务

[root@hadoop102 hadoop-2.8.2]# service crond restart

查看文件

[root@hadoop102 hadoop-2.8.2]# tail bailong.txt

也可以查看脚本

[root@hadoop102 hadoop-2.8.2]# crontab -l
*/1 * * * * /bin/echo "+" >> /opt/module/hadoop-2.8.2/bailong.txt

不需要了可以删除掉

[root@hadoop102 hadoop-2.8.2]# crontab -r

集群时间同步

以102机器为时间服务器，其他所有机器与102进行时间同步。比如，每隔10分钟同步一次时间。此操作需要root权限。

查询机器是否安装ntp，ntp是网络时间协议，通过这个协议同步时间

[root@hadoop102 hadoop-2.8.2]# rpm -qa | grep ntp

修改ntp配置文件

[root@hadoop102 hadoop-2.8.2]# vim /etc/ntp.conf

修改网段上的所有机器可以访问；注释掉其他网络，集群在局域网中；末尾加入本机节点，当网络丢失时依然可以采用本地时间作为时间服务器；

restrict 192.168.133.0 mask 255.255.255.0 nomodify notrap
#server 0.centos.pool.ntp.org iburst
#server 1.centos.pool.ntp.org iburst
#server 2.centos.pool.ntp.org iburst
#server 3.centos.pool.ntp.org iburst
server 127.127.1.0
fudge 127.127.1.0 stratum 10

修改/etc/sysconfig/ntpd文件，让硬件时间与系统时间一致

[root@hadoop102 hadoop-2.8.2]# vim /etc/sysconfig/ntpd

SYNC_HWCLOCK=yes

查看ntpd状态，启动ntpd服务

[root@hadoop102 hadoop-2.8.2]# service ntpd statu
[root@hadoop102 hadoop-2.8.2]# service ntpd start

设置ntpd开机启动

[root@hadoop102 hadoop-2.8.2]# chkconfig ntpd on

这样102就运行其他网段访问了。

其他机器的配置也需要root权限

[root@hadoop103 hadoop-2.8.2]# crontab -e

编写，表示每一小时同步一次时间

* */1 * * * /usr/sbin/ntpdate hadoop102

添加白名单

102上创建dfs.hosts文件

[root@hadoop102 hadoop]# pwd
/opt/module/hadoop-2.8.2/etc/hadoop
[root@hadoop102 hadoop]# touch dfs.hosts
[root@hadoop102 hadoop]# vi dfs.hosts

添加白名单机器名称

hadoop102
hadoop103
hadoop104

修改hdfs-site.xml，使白名单生效

[root@hadoop102 hadoop]# vi hdfs-site.xml

<property>
    <name>dfs.hosts</name>
    <value>/opt/module/hadoop-2.8.2/etc/hadoop/dfs.hosts</value>
</property>

配置分发

[root@hadoop102 hadoop]# xsync hdfs-site.xml

刷新NameNode

[root@hadoop102 hadoop]# hdfs dfsadmin -refreshNodes

更新ResourceManager

[root@hadoop102 hadoop]# yarn rmadmin -refreshNodes

如果数据不平衡，可以用命令实现集群再平衡

[root@hadoop102 sbin]# start-balancer.sh

黑名单

102上创建dfs.hosts.exclude文件，添加退役的机器名称

[root@hadoop102 hadoop-2.8.2]# cd etc/hadoop/
[root@hadoop102 hadoop]# touch dfs.hosts.exclude
[root@hadoop102 hadoop]# vi dfs.hosts.exclude

hadoopexit

修改hdfs-site.xml，使黑名单生效

[root@hadoop102 hadoop]# vi hdfs-site.xml

<property>
    <name>dfs.hosts.exclude</name>
    <value>/opt/module/hadoop-2.8.2/etc/hadoop/dfs.hosts.exclude</value>
</property>

刷新NameNode、ResourceManager，然后把退役的机器上节点正常退出。

环境编译

hadoop-2.8.2-src.tar.gz
jdk-8u261-linux-x64.gz
apache-ant-1.9.9-bin.tar.gz
apache-maven-3.0.5-bin.tar.gz
protobuf-2.5.0.tar.gz

HDFS

常用命令

启动集群hdfs：sbin/start-dfs.sh
启动集群yarn：sbin/start-yarn.sh
help：hadoop fs -help rm
显示目录信息：hadoop fs -ls /
在HDFS上创建目录：hadoop fs -mkdir -p /user
本地剪切到HDFS：hadoop fs -moveFromLocal x.txt /user
追加文件到另一个已存在文件的末尾：hadoop fs -appendToFile x.txt /user/xx.txt
显示：hadoop fs -cat x.txt
移动文件：hadoop fs -mv x.txt /user
下载到本地：hadoop fs -copyToLocal x.txt /local或者hadoop fs -get x.txt /local
删除文件或文件夹：hadoop fs -rm x.txt
删除空目录：hadoop fs -rmdir /user
设置HDFS文件副本数量：hadoop fs -setrep num x.txt

集成IDEA

首先配置好环境变量，新建Maven工程。

添加pom依赖，这里注意对应的hadoop版本

<dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.13</version>
        </dependency>
        <dependency>
            <groupId>org.apache.logging.log4j</groupId>
            <artifactId>log4j-core</artifactId>
            <version>2.12.1</version>
        </dependency>
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <version>1.16.18</version>
        </dependency>
        <!--hadoop-->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.8.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.8.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.8.2</version>
        </dependency>
    </dependencies>

还可以配置个日志属性log4j.properties，在resources下面新建

### 设置###
log4j.rootLogger = debug,stdout,D,E
### 输出信息到控制抬 ###
log4j.appender.stdout = org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target = System.out
log4j.appender.stdout.layout = org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern = [%-5p] %d{yyyy-MM-dd HH:mm:ss,SSS} method:%l%n%m%n
### 输出DEBUG 级别以上的日志到=E://logs/debug.log ###
log4j.appender.D = org.apache.log4j.DailyRollingFileAppender
log4j.appender.D.File = E://logs/debug.log
log4j.appender.D.Append = true
log4j.appender.D.Threshold = DEBUG
log4j.appender.D.layout = org.apache.log4j.PatternLayout
log4j.appender.D.layout.ConversionPattern = %-d{yyyy-MM-dd HH:mm:ss}  [ %t:%r ] - [ %p ]  %m%n
### 输出ERROR 级别以上的日志到=E://logs/error.log ###
log4j.appender.E = org.apache.log4j.DailyRollingFileAppender
log4j.appender.E.File =E://logs/error.log
log4j.appender.E.Append = true
log4j.appender.E.Threshold = ERROR
log4j.appender.E.layout = org.apache.log4j.PatternLayout
log4j.appender.E.layout.ConversionPattern = %-d{yyyy-MM-dd HH:mm:ss}  [ %t:%r ] - [ %p ]  %m%n

测试一下向hdfs新建目录，类HdfsClient.java

package hdfs;
import lombok.extern.slf4j.Slf4j;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
/**
 * @author Administrator
 */
@Slf4j
public class HdfsClient {
    public static void main(String[] args) throws IOException, URISyntaxException, InterruptedException {
        Configuration conf = new Configuration();
        //获取hdfs客户端对象
        URI uri = new URI("hdfs://hadoop102:9000");
        FileSystem fileSystem = FileSystem.get(uri, conf, "root");
        //在hdfs上创建路径
        Path path = new Path("/man");
        fileSystem.mkdirs(path);
        //关闭资源
        fileSystem.close();
        log.info("over");
    }
}

此外还有其他的基本文件操作

 /**
     * 创建目录
     */
public static void createDir() throws URISyntaxException, IOException, InterruptedException {
    Configuration conf = new Configuration();
    //获取hdfs客户端对象
    URI uri = new URI("hdfs://hadoop102:9000");
    FileSystem fileSystem = FileSystem.get(uri, conf, "root");
    //在hdfs上创建路径
    Path path = new Path("/man");
    fileSystem.mkdirs(path);
    //关闭资源
    fileSystem.close();
}
/**
     * 上传文件到HDFS
     */
public static void copyFromLocal() throws URISyntaxException, IOException, InterruptedException {
    Configuration conf = new Configuration();
    //获取hdfs客户端对象
    URI uri = new URI("hdfs://hadoop102:9000");
    FileSystem fileSystem = FileSystem.get(uri, conf, "root");
    Path localPath = new Path("C:/Users/Administrator/Desktop/songjiang.txt");
    Path hdfsPath = new Path("/man/songjiang.txt");
    fileSystem.copyFromLocalFile(localPath,hdfsPath);
    fileSystem.close();
}
/**
     * 将文件从hdfs拷贝到本地
     */
public static void copyToLocal() throws IOException, InterruptedException, URISyntaxException {
    Configuration conf = new Configuration();
    //获取hdfs客户端对象
    URI uri = new URI("hdfs://hadoop102:9000");
    FileSystem fileSystem = FileSystem.get(uri, conf, "root");
    Path localPath = new Path("C:/Users/Administrator/Desktop/down.txt");
    Path hdfsPath = new Path("/man/songjiang.txt");
    fileSystem.copyToLocalFile(false,hdfsPath,localPath,true);
    fileSystem.close();
}
/**
     * 文件更名
     */
public static void reName() throws URISyntaxException, IOException, InterruptedException {
    Configuration conf = new Configuration();
    //获取hdfs客户端对象
    URI uri = new URI("hdfs://hadoop102:9000");
    FileSystem fileSystem = FileSystem.get(uri, conf, "root");
    Path hdfsOldPath = new Path("/man/songjiang.txt");
    Path hdfsNewPath = new Path("/man/shuihu.txt");
    fileSystem.rename(hdfsOldPath,hdfsNewPath);
    fileSystem.close();
}
/**
     * 查看文件详情
     */
public static void listFile() throws URISyntaxException, IOException, InterruptedException {
    Configuration conf = new Configuration();
    //获取hdfs客户端对象
    URI uri = new URI("hdfs://hadoop102:9000");
    FileSystem fileSystem = FileSystem.get(uri, conf, "root");
    RemoteIterator<LocatedFileStatus> listFiles = fileSystem.listFiles(new Path("/"), true);
    while (listFiles.hasNext()){
        LocatedFileStatus fileStatus = listFiles.next();
        System.out.println("============="+fileStatus.getPath().getName()+"=============");
        System.out.println("文件名称："+fileStatus.getPath().getName()+"\n文件路径："+fileStatus.getPath()+"\n文件权限："+fileStatus.getPermission()+"\n文件大小："+fileStatus.getLen()
                           +"\n分区大小："+fileStatus.getBlockSize()+"\n文件分组："+fileStatus.getGroup()+"\n文件所有者："+fileStatus.getOwner());
        BlockLocation[] blockLocations = fileStatus.getBlockLocations();
        for (BlockLocation blockLocation:blockLocations){
            String[] hosts = blockLocation.getHosts();
            System.out.printf("所在区间：");
            for (String host:hosts){
                System.out.printf(host+"\t");
            }
            System.out.println();
        }
    }
    fileSystem.close();
}
/**
     * 判断是文件还是文件夹
     */
public static void listStatus() throws URISyntaxException, IOException, InterruptedException {
    Configuration conf = new Configuration();
    //获取hdfs客户端对象
    URI uri = new URI("hdfs://hadoop102:9000");
    FileSystem fileSystem = FileSystem.get(uri, conf, "root");
    FileStatus[] fileStatuses = fileSystem.listStatus(new Path("/man"));
    for (FileStatus fileStatuse:fileStatuses){
        if (fileStatuse.isFile()){
            System.out.println("文件："+fileStatuse.getPath().getName());
        }else {
            System.out.println("文件夹："+fileStatuse.getPath().getName());
        }
    }
    fileSystem.close();
}

MapReduce

这块程序分成三个部分：Mapper、Reducer、Driver。