一、问题现象

涉及到yarn的所有任务都会报nm连接超时问题,导致任务失败。
以下是hive日志报错

  1. 2020-07-14T15:22:22,075 ERROR [HiveServer2-Background-Pool: Thread-2925053]: SessionState (:()) - Vertex failed, vertexName=Map 1, vertexId=vertex_1594164369302_64662_1_00, diagnostics=[Task failed, taskId=task_1594164369302_64662_1_00_000036, diagnostics=[TaskAttempt 0 failed, info=[Container launch failed for container_e460_1594164369302_64662_01_000019 : java.io.IOException: DestHost:destPort oc-qj-hdp-30-156:45454 , LocalHost:localPort oc-qj-hdp-30-92/10.170.30.92:0. Failed on local exception: java.io.IOException: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.170.30.92:47710 remote=oc-qj-hdp-30-156/10.170.30.156:45454]
  2. at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  3. at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  4. at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  5. at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  6. at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
  7. at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:806)
  8. at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1501)
  9. at org.apache.hadoop.ipc.Client.call(Client.java:1443)
  10. at org.apache.hadoop.ipc.Client.call(Client.java:1353)
  11. at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
  12. at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
  13. at com.sun.proxy.$Proxy42.startContainers(Unknown Source)
  14. at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
  15. at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  16. at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  17. at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  18. at java.lang.reflect.Method.invoke(Method.java:498)
  19. at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
  20. at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
  21. at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
  22. at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
  23. at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
  24. at com.sun.proxy.$Proxy43.startContainers(Unknown Source)
  25. at org.apache.tez.dag.app.launcher.TezContainerLauncherImpl$Container.launch(TezContainerLauncherImpl.java:166)
  26. at org.apache.tez.dag.app.launcher.TezContainerLauncherImpl$EventProcessor.run(TezContainerLauncherImpl.java:396)
  27. at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  28. at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  29. at java.lang.Thread.run(Thread.java:748)
  30. Caused by: java.io.IOException: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.170.30.92:47710 remote=oc-qj-hdp-30-156/10.170.30.156:45454]
  31. at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:757)
  32. at java.security.AccessController.doPrivileged(Native Method)

二、问题分析

出现端口连接超时的问题,可以从网络,线程,内存,异常job等方面分析原因,此集群经过检查,排除了网络原因,内存原因。

1.先执行一个hadoop自带mr测试例子看看执行日志

开启debug:
export HADOOP_ROOT_LOGGER=DEBUG,console
执行mr任务
cd /usr/hdp/3.1.0.0-78/hadoop-mapreduce
hadoop jar ./hadoop-mapreduce-examples-3.1.1.3.1.0.0-78.jar pi 10 10
如下是nm的日志:
yarn异常:YarnException: Failed while publishing entity

  1. 2020-07-14 14:16:54,038 INFO [Thread-126] org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain. Thread state is :WAITING
  2. 2020-07-14 14:16:54,138 INFO [Thread-126] org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain. Thread state is :WAITING
  3. 2020-07-14 14:16:54,238 INFO [Thread-126] org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain. Thread state is :WAITING
  4. 2020-07-14 14:16:54,338 INFO [Thread-126] org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain. Thread state is :WAITING
  5. 2020-07-14 14:16:54,438 INFO [Thread-126] org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain. Thread state is :WAITING
  6. 2020-07-14 14:16:54,538 INFO [Thread-126] org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain. Thread state is :WAITING
  7. 2020-07-14 14:16:54,585 ERROR [Job ATS Event Dispatcher] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Exception while publishing configs on JOB_SUBMITTED Event for the job : job_1594164369302_64100
  8. org.apache.hadoop.yarn.exceptions.YarnException: Failed while publishing entity
  9. at org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl$TimelineEntityDispatcher.dispatchEntities(TimelineV2ClientImpl.java:548)
  10. at org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl.putEntities(TimelineV2ClientImpl.java:149)
  11. at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.publishConfigsOnJobSubmittedEvent(JobHistoryEventHandler.java:1254)
  12. at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.processEventForNewTimelineService(JobHistoryEventHandler.java:1414)
  13. at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleTimelineEvent(JobHistoryEventHandler.java:742)
  14. at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.access$1200(JobHistoryEventHandler.java:93)
  15. at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler$ForwardingEventHandler.handle(JobHistoryEventHandler.java:1795)
  16. at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler$ForwardingEventHandler.handle(JobHistoryEventHandler.java:1791)
  17. at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
  18. at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
  19. at java.lang.Thread.run(Thread.java:748)

2.YARN异常YarnException:Failed while publishing entity分析

mapreduce提交任务计算时,job已经结束,但是容器仍不能关闭持续等待五分钟

  1. 2020-07-14 14:16:54,038 INFO [Thread-126] org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain. Thread state is :WAITING
  2. 2020-07-14 14:16:54,138 INFO [Thread-126] org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain. Thread state is :WAITING
  3. 2020-07-14 14:16:54,238 INFO [Thread-126] org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain. Thread state is :WAITING
  4. 2020-07-14 14:16:54,338 INFO [Thread-126] org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain. Thread state is :WAITING
  5. 2020-07-14 14:16:54,438 INFO [Thread-126] org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain. Thread state is :WAITING
  6. 2020-07-14 14:16:54,538 INFO [Thread-126] org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain. Thread state is :WAITING

几分钟后抛出异常:

  1. 2020-07-14 14:19:11,245 ERROR [Job ATS Event Dispatcher] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Failed to process Event JOB_FINISHED for the job : job_1594164369302_64100
  2. org.apache.hadoop.yarn.exceptions.YarnException: Interrupted while publishing entity
  3. at org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl$TimelineEntityDispatcher.dispatchEntities(TimelineV2ClientImpl.java:551)
  4. at org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl.putEntities(TimelineV2ClientImpl.java:149)
  5. at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.processEventForNewTimelineService(JobHistoryEventHandler.java:1405)
  6. at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleTimelineEvent(JobHistoryEventHandler.java:742)
  7. at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.access$1200(JobHistoryEventHandler.java:93)
  8. at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler$ForwardingEventHandler.handle(JobHistoryEventHandler.java:1795)
  9. at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler$ForwardingEventHandler.handle(JobHistoryEventHandler.java:1791)
  10. at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
  11. at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
  12. at java.lang.Thread.run(Thread.java:748)
  13. Caused by: java.lang.InterruptedException
  14. at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404)
  15. at java.util.concurrent.FutureTask.get(FutureTask.java:191)
  16. at org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl$TimelineEntityDispatcher.dispatchEntities(TimelineV2ClientImpl.java:545)
  17. ……
  18. Caused by: com.sun.jersey.api.client.ClientHandlerException: java.net.SocketTimeoutException: Read timed out
  19. ……
  20. Caused by: java.net.SocketTimeoutException: Read timed out

3.根据日志提示,分析ATS这个东东是啥玩意

Timeline Service v2 默认集成嵌入HBase(HMaster、HRegionServer),进程启动 User: yarn-ats
YARN Timeline Service v.2 使用一系列collector(writers)去写数据到后端存储。collectors,AM会把跟应用相关的数据发送到timeline collectors。
根据查相关资料,发生这种情况是因为来自ATSv2的嵌入式HBASE崩溃。

三、问题解决方式

解决这个问题的方法需要重置ATsv2内嵌HBASE数据库

1.停止Yarn服务

Ambari -> Yarn-Actions -> Stop

2.删除Zookeeper上的ATSv2 Znode

zookeeper-client -server zookeeper-quorum-servers
rmr /atsv2-hbase-unsecure或rmr /atsv2-hbase-secure(如果是kerberized集群)

3.从HDFS移动Hbase时间线服务器Hbase嵌入式数据库

hdfs dfs -mv /atsv2/hbase /tmp/
hadoop fs -mv /services/sync/ocdp/hbase.yarnfile /tmp/

4.销毁ats在数据库中的服务

yarn app -destroy ats-hbase

5.开始使用yarn服务

Ambari - > Yarn-Actions- > Start

6.再次重新提交任务,发现程序正常,问题解决