date: 2020-11-17title: NameNode故障恢复 #标题
tags: hadoop #标签
categories: Hadoop # 分类

NameNode故障后,有两种方法恢复数据。

将secondarynamenode中数据拷贝到namenode存储数据的目录

模拟namenode故障

  1. $ jps | grep -w NameNode | awk '{print $1}' | xargs kill -9 # 停止namenode
  2. $ rm -rf data/tmp/dfs/name/* # 删除namenode的数据

拷贝secondarnamenode的数据

  1. # 将secondarynamenode的数据拷贝到namenode的name目录中(以下操作在namenode上执行)
  2. $ rsync -az 192.168.20.4:/apps/usr/hadoop-2.9.2/data/tmp/dfs/namesecondary/ /apps/usr/hadoop-2.9.2/data/tmp/dfs/name/
  3. $ hadoop-daemon.sh start namenode # 单独启动namenode

访问namenode的50070端口,数据正常,如下:

NameNode故障恢复 - 图1

使用-importCheckpoint选项启动namenode守护进程,从而将secondarynamenode中数据拷贝到namenode目录中

修改hdfs-site.xml

  1. <configuration> # 添加如下配置
  2. <!--指定多长时间检查下操作次数,时间短一些,设置为60s-->
  3. <property>
  4. <name>dfs.namenode.checkpoint.check.period</name>
  5. <value>60</value>
  6. </property>
  7. <!--指定namenode的数据目录-->
  8. <property>
  9. <name>dfs.namenode.name.dir</name>
  10. <value>/apps/usr/hadoop-2.9.2/data/tmp/dfs/name</value>
  11. </property>

模拟namenode故障

  1. $ jps | grep -w NameNode | awk '{print $1}' | xargs kill -9 # 停止namenode
  2. $ rm -rf data/tmp/dfs/name/* # 删除namenode的数据

如果secondarynamenode不和namenode在同一个主机上,需要将secondarynamenode存储数据的目录拷贝到namenode存储数据的平级目录,并删除in_use.lock文件

  1. # 以下操作在namenode上执行
  2. $ rsync -az 192.168.20.4:/apps/usr/hadoop-2.9.2/data/tmp/dfs/namesecondary /apps/usr/hadoop-2.9.2/data/tmp/dfs/
  3. $ pwd # 确认当前目录
  4. /apps/usr/hadoop-2.9.2/data/tmp/dfs
  5. $ ls # 确认有以下目录
  6. data name namesecondary
  7. $ rm -f namesecondary/in_use.lock # 删除secondarynamenode目录中的锁文件
  8. $ hdfs namenode -importCheckpoint # 执行此命令,最后会输出如下(低版本的可能不会有输出,只能多等待一会了)
  9. 20/11/17 07:39:51 INFO hdfs.StateChange: STATE* Safe mode ON, in safe mode extension.
  10. The reported blocks 18 has reached the threshold 0.9990 of total blocks 18. The number of live datanodes 3 has reached the minimum number 0. In safe mode extension. Safe mode will be turned off automatically in 9 seconds.
  11. 20/11/17 07:40:01 INFO hdfs.StateChange: STATE* Safe mode is OFF
  12. 20/11/17 07:40:01 INFO hdfs.StateChange: STATE* Leaving safe mode after 30 secs
  13. 20/11/17 07:40:01 INFO hdfs.StateChange: STATE* Network topology has 1 racks and 3 datanodes
  14. 20/11/17 07:40:01 INFO hdfs.StateChange: STATE* UnderReplicatedBlocks has 0 blocks
  15. # 执行上面的指令后,你再开一个终端,会发现namenode已经在监听50070了,
  16. # 大概等两分钟后,即可ctrl+c终端此指令,然后正常启动namenode即可,数据已经恢复了。
  17. $ hadoop-daemon.sh start namenode # 启动namenode
  18. # 测试hadoop可用性
  19. $ hadoop fs -mkdir /aaa # 创建目录
  20. $ hadoop fs -ls / # 查看目录
  21. Found 4 items
  22. drwxr-xr-x - root supergroup 0 2020-11-17 06:25 /a
  23. drwxr-xr-x - root supergroup 0 2020-11-17 07:45 /aaa
  24. drwxrwx--- - root supergroup 0 2020-11-12 22:21 /tmp
  25. drwxr-xr-x - root supergroup 0 2020-11-12 22:18 /user