NameNode RPC延迟分析 - 《大数据运维》

环境：
背景：
问题定位：
解决方式：
后续思考：
参考文档：

环境：

hadoop: 3.0.0-cdh6.3.2

背景：

刚把CDH的报警接入到钉钉群，不接不知道，一接吓一跳，时不时蹦出一个NameNode RPC延迟报警惊扰到大家，所以决定一定要把延迟的元凶揪出来。
图片.png

问题定位：

NameNode RPC延迟

图片.png

可以看到上面3段时间的延迟情况，查看Active NameNode 对应时间的日志：

2021-01-26 09:02:39,014 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 10.xx.xx.197 2021-01-26 09:02:39,014 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling edit logs 2021-01-26 09:02:39,014 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 1252312052, 1252312467 2021-01-26 09:02:39,949 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 420 Total time for transactions(ms): 21 Number of t ransactions batched in Syncs: 32 Number of syncs: 388 SyncTimes(ms): 445 15210 2021-01-26 09:02:39,954 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /data/dfs/nn/current/edits_inprogress_000000 0001252312052 -> /data/dfs/nn/current/edits_0000000001252312052-0000000001252312471 2021-01-26 09:02:39,955 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 1252312472 2021-01-26 09:02:40,672 INFO org.apache.sentry.hdfs.SentryAuthorizationInfo: Received updates from Sentry Server. Size of PathUpdates 1 PermUpdates 0 2021-01-26 09:02:40,672 INFO org.apache.sentry.hdfs.UpdateableAuthzPaths: Applying Path update. Adding path for authz object cloudera_manager_metastore_ca nary_test_db_hive_hivemetastore_140d02447c5fd9c8e3f975f0bc0f78e5.cm_test_table authz path [[user, hue, .cloudera_manager_hive_metastore_canary, hive_HIVEM ETASTORE_140d02447c5fd9c8e3f975f0bc0f78e5, cm_test_table, p1=p0, p2=420]] 2021-01-26 09:03:05,679 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write lock held for 26663 ms via java.lang.Thread.getStackTrace(Thread.java:1559) org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:263) org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1604) org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:4606) org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:1276) org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:146) org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:12974) org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) java.security.AccessController.doPrivileged(Native Method) javax.security.auth.Subject.doAs(Subject.java:422) org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)

org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)

    Number of suppressed write-lock reports: 0
    Longest write-lock held interval: 26663

2021-01-26 09:03:05,683 WARN org.apache.hadoop.ipc.Server: IPC Server handler 44 on 8022, call Call#25095 Retry#0 org.apache.hadoop.hdfs.server.protocol.N

amenodeProtocol.rollEditLog from 10.xx.xx.197:44732: output error

2021-01-26 09:03:05,684 INFO org.apache.hadoop.ipc.Server: IPC Server handler 44 on 8022 caught an exception

java.nio.channels.ClosedChannelException

    at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270)
    at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
    at org.apache.hadoop.ipc.Server.channelWrite(Server.java:3242)
    at org.apache.hadoop.ipc.Server.access$1700(Server.java:137)
    at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1466)
    at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1536)
    at org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2586)
    at org.apache.hadoop.ipc.Server$Connection.access$300(Server.java:1608)
    at org.apache.hadoop.ipc.Server$RpcCall.doResponse(Server.java:933)
    at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:767)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)

再来看看StandBy NameNode 的日志：

2021-01-26 09:02:39,011 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Triggering log roll on remote NameNode

2021-01-26 09:02:40,201 INFO org.apache.sentry.hdfs.SentryAuthorizationInfo: Received updates from Sentry Server. Size of PathUpdates 1 PermUpdates 0

2021-01-26 09:02:40,201 INFO org.apache.sentry.hdfs.UpdateableAuthzPaths: Applying Path update. Adding path for authz object cloudera_manager_metastore_ca

nary_test_db_hive_hivemetastore_140d02447c5fd9c8e3f975f0bc0f78e5.cm_test_table authz path [[user, hue, .cloudera_manager_hive_metastore_canary, hive_HIVEM

ETASTORE_140d02447c5fd9c8e3f975f0bc0f78e5, cm_test_table, p1=p0, p2=420]]

2021-01-26 09:02:59,035 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Exception from remote name node RemoteNameNodeInfo [nnId=namenode106

, ipcAddress=olap-hdfs-1.online.xx.com/10.xx.xx.11:8022, httpAddress=http://olap-hdfs-1.online.xx.com:9870], try next.

java.net.SocketTimeoutException: Call From olap-hdfs-2.online.xx.com/10.xx.xx.197 to olap-hdfs-1.online.xx.com:8022 failed

on socket timeout exception: java.net.SocketTimeoutException: 20000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels

.SocketChannel[connected local=/10.xx.xx.197:44732 remote=olap-hdfs-1.online.xx.com/10.xx.xx.11:8022]; For more details see: http://wi

ki.apache.org/hadoop/SocketTimeout
    at sun.reflect.GeneratedConstructorAccessor23.newInstance(Unknown Source)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:824)
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:774)
    at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1503)
    at org.apache.hadoop.ipc.Client.call(Client.java:1445)
    at org.apache.hadoop.ipc.Client.call(Client.java:1355)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
    at com.sun.proxy.$Proxy23.rollEditLog(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:150)
    at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:365)
    at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:362)
    at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$MultipleNameNodeProxy.call(EditLogTailer.java:504)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.SocketTimeoutException: 20000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.xx.xx.197:44732 remote=olap-hdfs-1.online.xx.com/10.xx.xx.11:8022]
    at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
    at java.io.FilterInputStream.read(FilterInputStream.java:133)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
    at java.io.FilterInputStream.read(FilterInputStream.java:83)
    at java.io.FilterInputStream.read(FilterInputStream.java:83)
    at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:554)
    at java.io.DataInputStream.readInt(DataInputStream.java:387)
    at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1804)
    at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1166)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1062)

分析日志 StandBy NameNode 触发了 Active NameNode 的edit log 的日志滚动，让我们看看源码：

org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer

 /**
   * Trigger the active node to roll its logs.
   */
  @VisibleForTesting
  void triggerActiveLogRoll() {
    LOG.info("Triggering log roll on remote NameNode");
    Future<Void> future = null;
    try {
      //利用线程池提交getNameNodeProxy()，接下来看下此方法
      future = rollEditsRpcExecutor.submit(getNameNodeProxy());
      future.get(rollEditsTimeoutMs, TimeUnit.MILLISECONDS);
      lastRollTriggerTxId = lastLoadedTxnId;
    } catch (ExecutionException e) {
      Throwable cause = e.getCause();
      if (cause instanceof RemoteException) {
        IOException ioe = ((RemoteException) cause).unwrapRemoteException();
        if (ioe instanceof StandbyException) {
          LOG.info("Skipping log roll. Remote node is not in Active state: " +
              ioe.getMessage().split("\n")[0]);
          return;
        }
      }
      LOG.warn("Unable to trigger a roll of the active NN", e);
    } catch (TimeoutException e) {
      if (future != null) {
        future.cancel(true);
      }
      LOG.warn(String.format(
          "Unable to finish rolling edits in %d ms", rollEditsTimeoutMs));
    } catch (InterruptedException e) {
      LOG.warn("Unable to trigger a roll of the active NN", e);
    }
  }
  /**
   * NameNodeProxy factory method.
   * @return a Callable to roll logs on remote NameNode.
   */
  @VisibleForTesting
  Callable<Void> getNameNodeProxy() {
    return new MultipleNameNodeProxy<Void>() {
      @Override
      protected Void doWork() throws IOException {
        //利用Active NameNode的代理远程调用rollEditLog()方法
        cachedActiveProxy.rollEditLog();
        return null;
      }
    };
  }

它首先会获得Active NameNode的代理，并调用其rollEditLog()方法滚动编辑日志，然后将上次StandBy NameNode加载的最高事务ID，即lastLoadedTxnId，赋值给上次编辑日志滚动开始时的最新事务ID，即lastRollTriggerTxId，这么做是为了方便进行日志回滚以及逻辑判断。

画个图理清下调用关系：

图片.png

接下来我们只需要关心Active NameNode 在执行rollEditLog() 方法为什么会发生超时即可，首先查看了机器负载/CPU/内存/服务GC/网络IO均表示很无辜，因为NameNode的 dfs.namenode.name.dir =/data/dfs/nn; 所以我们看下/data所属磁盘的IO利用率，喔呵果然这里有猫腻，发现RPC延迟的时间都能和util 100%对上；

图片.png

解决方式：

将NameNode的 dfs.namenode.name.dir 的目录放到一个IO利用率较正常的磁盘上；

后续思考：

因为作者的集群是混合部署，存在资源竞争的问题，所以在这里也建议大家生产环境的混部一定要谨慎，如果非混部不可的话，也一定要梳理好对磁盘的依赖，不要像作者一样将一块磁盘供各个组件使用；

参考文档：

https://www.cnblogs.com/RunnyHao/articles/5400729.html HDFS 的 RPC 机制
https://blog.csdn.net/dengxing1234/article/details/60956951 Hadoop-2.X HA模式下的FSImage和EditsLog合并过程
https://www.cnblogs.com/jirimutu01/p/5556232.html HDFS源码分析之EditLogTailer
https://developer.ibm.com/zh/articles/os-cn-hadoop-name-node/ Hadoop NameNode 高可用 (High Availability) 实现解析