1、background

We have seen many incidents of overloaded HDFS namenode due to 1) misconfigurations or 2) “bad” MR jobs or Hive queries that create large amount of RPC requests in a short period of time. There are quite a few features that have been introduced in HDP 2.3/2.4 to protect HDFS namenode. This article summarize the deployment steps of these features with an incomplete list of known issues and possible solutions for them.

2、optimizing

Enable Async Audit Logging开启异步日志(本文已有配置说明)
Dedicated Service RPC Port拆分serviceRPC端口(本文已有配置说明)
Dedicated Lifeline RPC Port for HA拆分lifeLineRPC端口(本文已有配置说明)
Enable FairCallQueue on Client RPC Port开启RPC公平调度队列(本文已有配置说明)
Enable RPC Client Backoff on Client RPC port开启backoff退避
Enable RPC Caller Context to track the “bad” jobs<br />Enable Response time based backoff with DecayedRpcScheduler
Check JMX for namenode client RPC call queue length and average queue time<br />Check JMX for namenode DecayRpcScheduler when FCQ is enabled<br />NNtop (HDFS-6982)
Tuning configuration when deleting a large Dir slowly (本文已有配置说明)
Injection of patch to improve FBR when NN started

3、Enable Async Audit Logging

Enable async audit logging by setting
dfs.namenode.audit.log.async to true in hdfs-site.xml. This can minimize the impact of audit log I/Os on namenode performance.

  1. <property>
  2. <name>dfs.namenode.audit.log.async</name>
  3. <value>true</value>
  4. </property>

4、Dedicated Service RPC Port

Configuring a separate service RPC port can improve the responsiveness of the NameNode by allowing DataNode and client requests to be processed via separate RPC queues. Datanode and all other services should be connected to the new service RPC address and clients connect to the well known addresses specified by dfs.namenode.rpc-address.

Adding a service RPC port to an HA cluster with automatic failover via ZKFCs (with/wo Kerberos) requires some additional steps as follows:

1、Add the following settings to hdfs-site.xml

<property>
  <name>dfs.namenode.servicerpc-address.mycluster.nn1</name>
  <value>nn1.example.com:8040</value>
</property>
<property>
  <name>dfs.namenode.servicerpc-address.mycluster.nn2</name>
  <value>nn2.example.com:8040</value>
</property>

2. If the cluster is not Kerberos enabled, skip this step

If the cluster is kerberos enabled, create two new hdfs_jass.conf files for nn1 and nn2 and copy them to /etc/hadoop/conf/hdfs_jaas.conf, respectively

nn1:

Client { 
com.sun.security.auth.module.Krb5LoginModule required 
useKeyTab=true 
storeKey=true 
useTicketCache=false 
keyTab="/etc/security/keytabs/nn.service.keytab" principal="nn/c6401.ambari.apache.org@EXAMPLE.COM";
};

nn2:

Client { 
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
storeKey=true
useTicketCache=false
keyTab="/etc/security/keytabs/nn.service.keytab" principal="nn/c6402.ambari.apache.org@EXAMPLE.COM";
};

Add the following to hadoop-env.sh

export HADOOP_NAMENODE_OPTS="
-Dzookeeper.sasl.client=true 
-Dzookeeper.sasl.client.username=zookeeper 
-Djava.security.auth.login.config=/etc/hadoop/conf/hdfs_jaas.conf 
-Dzookeeper.sasl.clientconfig=Client ${HADOOP_NAMENODE_OPTS}"

3. Restart NameNodes

4. Restart DataNodes

Restart DataNodes to connect to the new NameNode service RPC port instead of the NameNode client RPC port

5. Stop the ZKFC

Stop the ZKFC processes on both NameNodes

6. -formatZK

Run the following command to reset the ZKFC state in ZooKeeper
hdfs zkfc -formatZK

Known issues:

  • 1 Without step 6 you will see the following exception after ZKFC restart.

    java.lang.RuntimeException:Mismatched address 
    stored in ZK forNameNode
    
  • 2 Without step 2 in a Kerberos enabled HA cluster, you will see the following exception when running step 6 ``` 16/03/23 03:30:53 INFO ha.ActiveStandbyElector: Recursively deleting /hadoop-ha/hdp64ha from ZK…16/03/23 03:30:53 ERROR ha.ZKFailoverController: Unable to clear zk parent znodejava.io.IOException: Couldn’t clear parent znode /hadoop-ha/hdp64haat org.apache.hadoop.ha.ActiveStandbyElector.clearParentZNode(ActiveStandbyElector.java:380)at org.apache.hadoop.ha.ZKFailoverController.formatZK(ZKFailoverController.java:267)at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:212)at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:61)at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:172)at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:168)at java.security.AccessController.doPrivileged(Native Method)at javax.security.auth.Subject.doAs(Subject.java:360)at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637)at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:442)at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:168)at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:183)

Caused by: org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode = Directory not empty for /hadoop-ha/hdp64haat org.apache.zookeeper.KeeperException.create(KeeperException.java:125)at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:873)at org.apache.zookeeper.ZKUtil.deleteRecursive(ZKUtil.java:54)at org.apache.hadoop.ha.ActiveStandbyElector$1.run(ActiveStandbyElector.java:375)at org.apache.hadoop.ha.ActiveStandbyElector$1.run(ActiveStandbyElector.java:372)at org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1041)at org.apache.hadoop.ha.ActiveStandbyElector.clearParentZNode(ActiveStandbyElector.java:372) … 11 more

<a name="6xF0c"></a>
# 5、Dedicated Lifeline RPC Port for HA
HDFS-9311 allows using a separate RPC address to isolate health checks and liveness from client RPC port which could be exhausted due to “bad” jobs. Here is an example to configure this feature in a HA cluster.

dfs.namenode.lifeline.rpc-address.mycluster.nn1 nn1.example.com:8050

dfs.namenode.lifeline.rpc-address.mycluster.nn2 nn1.example.com:8050

也就是说<br />【RPC拆分patch参数配置】

dfs.namenode.servicerpc-address.gaofeng.nn1=gaofeng-nn-01:8022 dfs.namenode.servicerpc-address.gaofeng.nn2=gaofeng-nn-02:8022 dfs.namenode.lifeline.rpc-address.gaofeng.nn1=gaofeng-nn-01:8023 dfs.namenode.lifeline.rpc-address.gaofeng.nn2=gaofeng-nn-02:8023 dfs.namenode.service.handler.count=50 dfs.namenode.lifeline.handler.count=50

按照上面的配置完成后重启受影响的组件,之后进行-formatZK即可<br />但是拆分队列在hadoop3有bug原因:<br />sendLifeline NPE异常<br />NameNode在处理DataNode发送的生命线消息时出现NPE,这将导致NN计算的maxLoad异常。<br />由于在choose DataNode中DataNode被标识为busy并且无法分配可用的节点,程序循环的执行会导致高CPU并降低集群的处理性能。<br />解决办法:打入HDFS-15556<br />Duplicated issue为HDFS-14042
<a name="IX3n8"></a>
# 6、Enable FairCallQueue on Client RPC Port
[https://blog.csdn.net/Androidlushangderen/article/details/80860637](https://blog.csdn.net/Androidlushangderen/article/details/80860637)<br />[https://issues.apache.org/jira/browse/HADOOP-9640](https://issues.apache.org/jira/browse/HADOOP-9640)<br />[http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FairCallQueue.html](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FairCallQueue.html)<br />[http://www.itpub.net/2019/06/25/2270/#comments](http://www.itpub.net/2019/06/25/2270/#comments)<br />[https://tech.ebayinc.com/engineering/quality-of-service-in-hadoop/](https://tech.ebayinc.com/engineering/quality-of-service-in-hadoop/)<br />[https://support.huawei.com/enterprise/en/doc/EDOC1100074552/ddc366b3/optimizing-hdfs-namenode-rpc-qos](https://support.huawei.com/enterprise/en/doc/EDOC1100074552/ddc366b3/optimizing-hdfs-namenode-rpc-qos)<br />[https://www.infoq.cn/article/7o96tvjwnelq4xp-7puh](https://www.infoq.cn/article/7o96tvjwnelq4xp-7puh)
<a name="Yxgxa"></a>
# 10、HDFS jmx&health check
jmx<br />参考:Hadoop+JMX+Monitoring+and+Alerting官网指标list<br />Check JMX for namenode client RPC call queue length and average queue time``<br />Check JMX for namenode DecayRpcScheduler when FCQ is enabled<br />NNtop (HDFS-6982)``<br />The HDFS monitoring commands I often use in production are summarized below
<a name="zxUE9"></a>
## 【NN audit cmd count】
查看hdfs-audit审计日志中cmd一份中的个数<br />`cat /var/log/hadoop/ocdp/hdfs-audit.log/awk '{print $2}'|awk -F ':' '{print $1":"$2}'|sort|uniq -c`
<a name="WakJH"></a>
## 后台查看健康检查
hdfs dfsadmin -report |head
<a name="KGV8Q"></a>
## 三副本情况下查看块数
`hdfs dfsadmin -report |grep 'Num of Blocks'|awk -F ':' '{print $2}'|awk '{sum +=$1};END {print sum/3}'`这是一个大约值<br />`curl --silent [http://192.168.1.1:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem](http://192.168.1.1:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem) | grep -i "blocktotal"`这是50070显示的值
<a name="ZbsPB"></a>
## curl -u admin:admin -X GET [http://192.168.1.1:8080/api/v1/clusters/cluster1/hosts/host-192-168-1-1/host_components/NAMENODE?fields=metrics/rpc](http://192.168.1.1:8080/api/v1/clusters/cluster1/hosts/host-192-168-1-1/host_components/NAMENODE?fields=metrics/rpc)PendingDeletionBlocks

- (Ambari方式)

curl -u admin:admin -X GET http://192.168.1.1:8080/api/v1/clusters/testqjcluster/hosts/host-192-168-1-1/host_components/NAMENODE?fields=metrics/rpc PendingDeletionBlocks


- (Hadoop方式)

curl —silent http://192.168.1.1:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem | grep -i “PendingDeletionBlocks”

<a name="tcWNr"></a>
## 查看RPC指标

- (Ambari方式)

curl -u admin:admin -X GET http://192.168.1.1:8080/api/v1/clusters/cluster1/hosts/host-192-168-1-1/host_components/NAMENODE?fields=metrics/rpc ``` (Hadoop方式)
依据上文的端口配置,监控如下

(client=客户端跟NN的交互)
curl —silent http://192.168.1.1:50070/jmx?qry=Hadoop:service=NameNode,name=RpcActivityForPort8020

(service=NN跟JN的交互)
curl —silent http://192.168.1.1:50070/jmx?qry=Hadoop:service=NameNode,name=RpcActivityForPort8040

(lifeLine=DN跟NN的交互)
curl —silent http://192.168.1.1:50070/jmx?qry=Hadoop:service=NameNode,name=RpcActivityForPort8050

11、delete优化

https://blog.csdn.net/androidlushangderen/article/details/83472804

HDFS-13831patch

打入HDFS-13831patch,将dfs.namenode.block.deletion.increment(default 1000)降低为100

FoldedTreeSet碎片阈值

dfs.namenode.storageinfo.defragment.ratio=0.75->0.9
ipc.8020.callqueue.impl=org.apache.hadoop.ipc.FairCallQueue
按照commitor的建议,调大FoldedTreeSet(Hadoop3存储blockInfo的数据结构)的碎片阈值参数

ipc.server.read.threadpool.size

Reader线程数,默认1->100

dfs.namenode.service.handler.count

Handler线程数,默认10->361

ipc.server.handler.queue.size

每个 Handler 处理的最大 Call 队列长度,默认100->1000。

12、Injection of patch to improve FBR when NN started

线上1.4k节点规模集群的HDFS原本NN启动后2亿多块上报不到一小时就完成,在Hadoop2.7升级到3.1后,发现上报的时间需要4个小时左右,严重影响线上环境
在打入如下patch后,可以解决这个问题
HDFS-14366
HDFS-14859
HDFS-14632
HDFS-14171