一、问题现象
涉及到yarn的所有任务都会报nm连接超时问题,导致任务失败。
以下是hive日志报错
2020-07-14T15:22:22,075 ERROR [HiveServer2-Background-Pool: Thread-2925053]: SessionState (:()) - Vertex failed, vertexName=Map 1, vertexId=vertex_1594164369302_64662_1_00, diagnostics=[Task failed, taskId=task_1594164369302_64662_1_00_000036, diagnostics=[TaskAttempt 0 failed, info=[Container launch failed for container_e460_1594164369302_64662_01_000019 : java.io.IOException: DestHost:destPort oc-qj-hdp-30-156:45454 , LocalHost:localPort oc-qj-hdp-30-92/10.170.30.92:0. Failed on local exception: java.io.IOException: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.170.30.92:47710 remote=oc-qj-hdp-30-156/10.170.30.156:45454]
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:806)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1501)
at org.apache.hadoop.ipc.Client.call(Client.java:1443)
at org.apache.hadoop.ipc.Client.call(Client.java:1353)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy42.startContainers(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy43.startContainers(Unknown Source)
at org.apache.tez.dag.app.launcher.TezContainerLauncherImpl$Container.launch(TezContainerLauncherImpl.java:166)
at org.apache.tez.dag.app.launcher.TezContainerLauncherImpl$EventProcessor.run(TezContainerLauncherImpl.java:396)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.170.30.92:47710 remote=oc-qj-hdp-30-156/10.170.30.156:45454]
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:757)
at java.security.AccessController.doPrivileged(Native Method)
二、问题分析
出现端口连接超时的问题,可以从网络,线程,内存,异常job等方面分析原因,此集群经过检查,排除了网络原因,内存原因。
1.先执行一个hadoop自带mr测试例子看看执行日志
开启debug:
export HADOOP_ROOT_LOGGER=DEBUG,console
执行mr任务
cd /usr/hdp/3.1.0.0-78/hadoop-mapreduce
hadoop jar ./hadoop-mapreduce-examples-3.1.1.3.1.0.0-78.jar pi 10 10
如下是nm的日志:
yarn异常:YarnException: Failed while publishing entity
2020-07-14 14:16:54,038 INFO [Thread-126] org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain. Thread state is :WAITING
2020-07-14 14:16:54,138 INFO [Thread-126] org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain. Thread state is :WAITING
2020-07-14 14:16:54,238 INFO [Thread-126] org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain. Thread state is :WAITING
2020-07-14 14:16:54,338 INFO [Thread-126] org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain. Thread state is :WAITING
2020-07-14 14:16:54,438 INFO [Thread-126] org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain. Thread state is :WAITING
2020-07-14 14:16:54,538 INFO [Thread-126] org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain. Thread state is :WAITING
2020-07-14 14:16:54,585 ERROR [Job ATS Event Dispatcher] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Exception while publishing configs on JOB_SUBMITTED Event for the job : job_1594164369302_64100
org.apache.hadoop.yarn.exceptions.YarnException: Failed while publishing entity
at org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl$TimelineEntityDispatcher.dispatchEntities(TimelineV2ClientImpl.java:548)
at org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl.putEntities(TimelineV2ClientImpl.java:149)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.publishConfigsOnJobSubmittedEvent(JobHistoryEventHandler.java:1254)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.processEventForNewTimelineService(JobHistoryEventHandler.java:1414)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleTimelineEvent(JobHistoryEventHandler.java:742)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.access$1200(JobHistoryEventHandler.java:93)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler$ForwardingEventHandler.handle(JobHistoryEventHandler.java:1795)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler$ForwardingEventHandler.handle(JobHistoryEventHandler.java:1791)
at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
at java.lang.Thread.run(Thread.java:748)
2.YARN异常YarnException:Failed while publishing entity分析
mapreduce提交任务计算时,job已经结束,但是容器仍不能关闭持续等待五分钟
2020-07-14 14:16:54,038 INFO [Thread-126] org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain. Thread state is :WAITING
2020-07-14 14:16:54,138 INFO [Thread-126] org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain. Thread state is :WAITING
2020-07-14 14:16:54,238 INFO [Thread-126] org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain. Thread state is :WAITING
2020-07-14 14:16:54,338 INFO [Thread-126] org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain. Thread state is :WAITING
2020-07-14 14:16:54,438 INFO [Thread-126] org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain. Thread state is :WAITING
2020-07-14 14:16:54,538 INFO [Thread-126] org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain. Thread state is :WAITING
几分钟后抛出异常:
2020-07-14 14:19:11,245 ERROR [Job ATS Event Dispatcher] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Failed to process Event JOB_FINISHED for the job : job_1594164369302_64100
org.apache.hadoop.yarn.exceptions.YarnException: Interrupted while publishing entity
at org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl$TimelineEntityDispatcher.dispatchEntities(TimelineV2ClientImpl.java:551)
at org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl.putEntities(TimelineV2ClientImpl.java:149)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.processEventForNewTimelineService(JobHistoryEventHandler.java:1405)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleTimelineEvent(JobHistoryEventHandler.java:742)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.access$1200(JobHistoryEventHandler.java:93)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler$ForwardingEventHandler.handle(JobHistoryEventHandler.java:1795)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler$ForwardingEventHandler.handle(JobHistoryEventHandler.java:1791)
at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.InterruptedException
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404)
at java.util.concurrent.FutureTask.get(FutureTask.java:191)
at org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl$TimelineEntityDispatcher.dispatchEntities(TimelineV2ClientImpl.java:545)
……
Caused by: com.sun.jersey.api.client.ClientHandlerException: java.net.SocketTimeoutException: Read timed out
……
Caused by: java.net.SocketTimeoutException: Read timed out
3.根据日志提示,分析ATS这个东东是啥玩意
Timeline Service v2 默认集成嵌入HBase(HMaster、HRegionServer),进程启动 User: yarn-ats
YARN Timeline Service v.2 使用一系列collector(writers)去写数据到后端存储。collectors,AM会把跟应用相关的数据发送到timeline collectors。
根据查相关资料,发生这种情况是因为来自ATSv2的嵌入式HBASE崩溃。
三、问题解决方式
1.停止Yarn服务
Ambari -> Yarn-Actions -> Stop
2.删除Zookeeper上的ATSv2 Znode
zookeeper-client -server zookeeper-quorum-servers
rmr /atsv2-hbase-unsecure或rmr /atsv2-hbase-secure(如果是kerberized集群)
3.从HDFS移动Hbase时间线服务器Hbase嵌入式数据库
hdfs dfs -mv /atsv2/hbase /tmp/
hadoop fs -mv /services/sync/ocdp/hbase.yarnfile /tmp/
4.销毁ats在数据库中的服务
yarn app -destroy ats-hbase
5.开始使用yarn服务
Ambari - > Yarn-Actions- > Start