概述

集群启动

非HA

6 286 DatanodeProtocol.versionRequest

  1. final String buildVersion;
  2. String blockPoolID = ""; // id of the block pool
  3. String softwareVersion;
  4. long capabilities;
  5. HAServiceState state;
  6. HAServiceState
  7. INITIALIZING("initializing"),
  8. ACTIVE("active"),
  9. STANDBY("standby"),
  10. STOPPING("stopping");

6 305 DatanodeProtocol.registerDatanode
321 sendHeartbeat
329 blockReport

HA

运行时

心跳

DN向NNdfs.namenode.rpc-address(9000)发送
DatanodeProtocol.sendHeartbeat
DatanodeProtocol.blockReceivedAndDeleted

DataNode自身状态和持有Block信息。DataNode不知道Block属于哪个文件。

sendHeartbeat的响应

  1. /** Commands returned from the namenode to the datanode */
  2. private final DatanodeCommand[] commands;
  3. /** Information about the current HA-related state of the NN */
  4. private final NNHAStatusHeartbeat haStatus;
  5. private final RollingUpgradeStatus rollingUpdateStatus;
  6. private final long fullBlockReportLeaseId;

DataNode死亡判定

timeout = 10heartbeat.interval + 2heartbeat.recheck.interval
心跳间隔时间:dfs.heartbeat.interval 心跳时间:3s
检查一次消耗的时间:heartbeat.recheck.interval checktime : 5min
最终结果默认是630s。
如果10倍心跳间隔的时间DN没有发送心跳。有可能DN心跳线程挂了,但DN还存活。所以NN会主动发送心跳检查,如果两次都没返回则判定DN死亡。

日志

写文件

2019-11-07 12:32:38,291 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1073748689_7865, replicas=192.168.124.14:50010, 192.168.124.13:50010 for /flink/checkpoint/e43958d071536c9a941374d0d1ba80ac/chk-400/_metadata

2019-11-07 12:32:38,301 INFO org.apache.hadoop.hdfs.StateChange: BLOCK fsync: /flink/checkpoint/e43958d071536c9a941374d0d1ba80ac/chk-400/metadata for DFSClient_NONMAPREDUCE-136599707_15
2019-11-07 12:32:38,305 INFO org.apache.hadoop.hdfs.StateChange: DIR
completeFile: /flink/checkpoint/e43958d071536c9a941374d0d1ba80ac/chk-400/metadata is closed by DFSClient_NONMAPREDUCE-136599707_15