概述
集群启动
非HA
6 286 DatanodeProtocol.versionRequest
final String buildVersion;
String blockPoolID = ""; // id of the block pool
String softwareVersion;
long capabilities;
HAServiceState state;
HAServiceState
INITIALIZING("initializing"),
ACTIVE("active"),
STANDBY("standby"),
STOPPING("stopping");
6 305 DatanodeProtocol.registerDatanode
321 sendHeartbeat
329 blockReport
HA
运行时
心跳
DN向NNdfs.namenode.rpc-address(9000)发送DatanodeProtocol.sendHeartbeat
和DatanodeProtocol.blockReceivedAndDeleted
DataNode自身状态和持有Block信息。DataNode不知道Block属于哪个文件。
sendHeartbeat的响应
/** Commands returned from the namenode to the datanode */
private final DatanodeCommand[] commands;
/** Information about the current HA-related state of the NN */
private final NNHAStatusHeartbeat haStatus;
private final RollingUpgradeStatus rollingUpdateStatus;
private final long fullBlockReportLeaseId;
DataNode死亡判定
timeout = 10heartbeat.interval + 2heartbeat.recheck.interval
心跳间隔时间:dfs.heartbeat.interval 心跳时间:3s
检查一次消耗的时间:heartbeat.recheck.interval checktime : 5min
最终结果默认是630s。
如果10倍心跳间隔的时间DN没有发送心跳。有可能DN心跳线程挂了,但DN还存活。所以NN会主动发送心跳检查,如果两次都没返回则判定DN死亡。
日志
写文件
2019-11-07 12:32:38,291 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1073748689_7865, replicas=192.168.124.14:50010, 192.168.124.13:50010 for /flink/checkpoint/e43958d071536c9a941374d0d1ba80ac/chk-400/_metadata
2019-11-07 12:32:38,301 INFO org.apache.hadoop.hdfs.StateChange: BLOCK fsync: /flink/checkpoint/e43958d071536c9a941374d0d1ba80ac/chk-400/metadata for DFSClient_NONMAPREDUCE-136599707_15
2019-11-07 12:32:38,305 INFO org.apache.hadoop.hdfs.StateChange: DIR completeFile: /flink/checkpoint/e43958d071536c9a941374d0d1ba80ac/chk-400/metadata is closed by DFSClient_NONMAPREDUCE-136599707_15