背景
系统测试第一轮过程中,发现数据智能平台的所有数据查看预览功能及数据探索SQL执行均无法使用。该问题对系统影响是致命的,因为这样导致系统完全无法使用。经过与系统测试人员沟通,并没有发现情景有什么规律;即该问题无法进行复现,给问题排查带来了巨大的不确定性及难度。
问题描述
- 数据智能平台所有的表记录查看功能无法使用;
- 数据智能平台数据探索SQL无法执行;
- sparkSQL thrift server已停止;
- 10000端口不是listen状态
- 18088界面没有看到running状态的对应job
- sparkSQL thrift server对应的日志没有更新,并以Error日志结尾;
根因分析
分析sparkSQL thrift server日志
vim /mnt/disk1/log/cluster001/SERVICE-SPARK-admin/spark-root-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-hdh163.out
19/12/18 09:52:02 ERROR TransportClient: Failed to send RPC 7415692333163725305 to /10.66.113.163:47243: java.nio.channels.ClosedChannelExceptionjava.nio.channels.ClosedChannelExceptionat io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)19/12/18 09:52:02 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster....19/12/18 09:52:02 ERROR Client: Application application_1575963887809_1812 not found.19/12/18 09:52:02 INFO Client: Deleted staging directory hdfs://SERVICE-HADOOP-admin/user/root/.sparkStaging/application_1575963887809_181219/12/18 09:52:02 ERROR YarnClientSchedulerBackend: Yarn application has already exited with state KILLED!19/12/18 09:52:02 INFO HiveServer2: Shutting down HiveServer2...19/12/18 09:52:02 INFO SparkUI: Stopped Spark web UI at http://10.66.113.163:404819/12/18 09:52:02 INFO AbstractService: Service:CLIService is stopped.19/12/18 09:52:02 INFO AbstractService: Service:HiveServer2 is stopped.
通过日志总结得到如下信息:**
sparkSQL thrift server已停止;
- 停止时间为:19/12/18 09:52 附近
- 根据异常日志暂时无法确切的分析出导致异常停止的原因
- applicationId 为 application_1575963887809_1812
- 任务关键信息为:ERROR YarnClientSchedulerBackend: Yarn application has already exited with state KILLED!
- 通过任务applicationId 与 关键信息分析得到,应该到yarn resourcemanager服务日志里面去分析看看能不能得到其他的原因
- 进入下一阶段:
- yarn web页面18088 查看该任务的相关信息
- 分析Yarn Resource manager日志该任务的相关信息
- 进入下一阶段:
- 日志 line63
- 通过任务applicationId 与 关键信息分析得到,应该到yarn resourcemanager服务日志里面去分析看看能不能得到其他的原因
Yarn Resourcemanager 相关分析
yarn web页面18088
- yarn web页面无法搜索得到该applicationId的信息
- 非常奇怪、矛盾
yarn web页面的最早的job时间为Wed Dec 18 09:59:32【20191218-09:59:32】
vim /mnt/disk1/log/cluster001/SERVICE-HADOOP-admin/yarn/yarn-root-resourcemanager-hdh163.log
2019-12-18 09:52:03 ERROR ...unknown application application_1575963887809_0754 Skipping.
- 暂时无法断定原因;继续
2019-12-17 15:44:55,204 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Updating the current master key for generating delegation tokens2019-12-18 09:52:01,610 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced...2019-12-18 09:52:01,611 INFO org.apache.hadoop.ha.ActiveStandbyElector: No old node to fence2019-12-18 09:52:01,611 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing znode /yarn-leader-election/SERVICE-HADOOP-admin/ActiveBreadCrumb to indicate that the local node is the most recent active...2019-12-18 09:52:01,612 INFO org.apache.hadoop.conf.Configuration: found resource yarn-site.xml at file:/etc/cluster001/SERVICE-HADOOP-admin/yarn-site.xml2019-12-18 09:52:01,614 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root OPERATION=refreshAdminAcls TARGET=AdminService RESULT=SUCCESS2019-12-18 09:52:01,614 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioning to active state2019-12-18 09:52:01,623 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager: Rolling master-key for container-tokens2019-12-18 09:52:01,623 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Rolling master-key for nm-tokens
通过日志可以看到,日志时间有明显的断层(2019-12-17 15:44:55,2019-12-18 09:52:01)(且在异常停止时刻附近);
- resourcemanager HA 切换?进一步求证
- resourcemanager 为了实现HA(high available)高可用,有两个services:active、standby
- resourcemanager HA 切换?进一步求证
resourcemanager另外一个服务的日志
2019-12-18 09:52:12,468 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder2019-12-18 09:52:12,468 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Returning, interrupted : java.lang.InterruptedException2019-12-18 09:52:12,469 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: NMLivelinessMonitor thread interrupted2019-12-18 09:52:12,470 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: AMLivelinessMonitor thread interrupted2019-12-18 09:52:12,470 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: AMLivelinessMonitor thread interrupted2019-12-18 09:52:12,470 ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted2019-12-18 09:52:12,470 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.ContainerAllocationExpirer thread interrupted2019-12-18 09:52:12,470 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioned to standby state2019-12-18 09:52:12,470 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: SHUTDOWN_MSG:/************************************************************SHUTDOWN_MSG: Shutting down ResourceManager at hdh163/10.66.113.163************************************************************/2019-12-18 09:53:12,137 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: STARTUP_MSG:/************************************************************STARTUP_MSG: Starting ResourceManagerSTARTUP_MSG: host = hdh163/10.66.113.163STARTUP_MSG: args = []STARTUP_MSG: version = 2.6.0-cdh5.7.0
结论:resourcemanager的确发生了HA主从节点切换
- resourcemanager **HA主从节点切换 会导致 sparksql thrift server 对应的yarn job失败?进行复现性测试。**
- 快接近真相了
复现性测试
- 环境,开发环境118(集群环境,保证resourcemanager HA)
- 实验准备
- resourcemanager 服务正常
- sparksql thrift server服务正常
- 10000端口存在
- netstat -tunlp | grep 10000
- beeline 连接验证功能正常可用
- 10000端口存在
- 实验操作步骤及结果
停止 awaken120 上面的resourcemanager服务导致HA主从节点切换
${HADOOP_HOME}/sbin/yarn-daemon.sh stop resourcemanager
${HADOOP_HOME}/bin/yarn rmadmin -getServiceState awaken120
输出 standby
${HADOOP_HOME}/bin/yarn rmadmin -getServiceState awaken118
输出 active
- 查看thrift server 是否可用,beeline验证- ${SPARK_HOME}/bin/beeline -u jdbc:hive2://localhost:10000 -n root # 密码为空,直接键入enter- 失败- 查看thrift server日志与之前的异常日志是否类似- 完全类似<a name="sjh9Q"></a>### 定位分析总结- **resourcemanager ****HA主从节点切换 会导致 yarn上正在running的所有任务失败**- **resourcemanager ****HA主从节点切换 会导致 sparksql thrift server 对应的yarn job失败**<a name="kE7KO"></a>## 解决方案经过与组件组同事沟通,确定如果hadoop配置文件yarn-site.xml里面增加如下配置可以解决;- vim yarn-site.xml # vim 在命令模式下键入 :set paste 然后进入insert模式可以解决粘贴格式错乱问题```basic<property><name>yarn.resourcemanager.recovery.enabled</name><value>true</value></property><property><name>yarn.resourcemanager.work-preserving-recovery.enabled</name><value>true</value></property>
- 原有配置为:
<property><name>yarn.resourcemanager.recovery.enabled</name><value>false</value></property>
验证测试
- 更改配置文件yarn-site.xml 并分发到各个集群节点
带resourcemanager的节点重启resourcemanager 使配置更改生效
- ${HADOOP_HOME}/sbin/yarn-daemon.sh stop resourcemanager
- ${HADOOP_HOME}/sbin/yarn-daemon.sh start resourcemanager
实验操作步骤及结果
停止 awaken120 上面的resourcemanager服务导致HA主从节点切换
${HADOOP_HOME}/sbin/yarn-daemon.sh stop resourcemanager
${HADOOP_HOME}/bin/yarn rmadmin -getServiceState awaken120
输出 standby
${HADOOP_HOME}/bin/yarn rmadmin -getServiceState awaken118
输出 active
- yarn web页面可以看到thrift服务对应的job依旧存在- 查看thrift server 是否可用,beeline验证```basic[root@awaken118 ~]# beeline -u jdbc:hive2://localhost:10000 -n rootConnecting to jdbc:hive2://localhost:1000019/12/18 16:43:29 INFO Utils: Supplied authorities: localhost:1000019/12/18 16:43:29 INFO Utils: Resolved authority: localhost:1000019/12/18 16:43:29 INFO HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://localhost:10000Connected to: Spark SQL (version 2.1.0)Driver: Hive JDBC (version 1.2.1.spark2)Transaction isolation: TRANSACTION_REPEATABLE_READBeeline version 1.2.1.spark2 by Apache Hive0: jdbc:hive2://localhost:10000> show databases;+-------------------------+--+| databaseName |+-------------------------+--+| awaken_graph || awaken_result || awaken_temp || default || origin_algorithmengine || origin_csvload || origin_dag_experiment || origin_dag_view || origin_dataload || origin_datamerge || origin_kafkaload |+-------------------------+--+11 rows selected (0.413 seconds)0: jdbc:hive2://localhost:10000>
总结
- 上文的解决方案可以解决本问题
规避策略
- 开启sparkSQL thrift server服务watchdog
总结
- 由该问题可以得到,目前多维这边的hadoop发布包应该均会携带该缺陷;建议各现场及相关环境尽快根据本文方案进行同步修复。
- 分析问题当单点(spark thrift server)无法得到突破时;需要进行关联服务分析,争取问题得到转移并接近解决。
- 与组件组建立沟通问题解决机制;因为该问题在组价组的视角很早已经得到了解决,但是我们这边的组件配置没有得到同步。
备注
sparksql thrift server由于是底层服务,导致异常停止的原因很多;上文只是针对目前的这个异常日志进行分析并得到解决。如果异常日志不是类似这种,需要重新具体问题具体分析,但是本文的方法及思路可以进行参考借鉴。
附录:系统测试复现参照
export HADOOP_HOME=/usr/lib/cluster001/SERVICE-HADOOP-admin/${HADOOP_HOME}/bin/yarn rmadmin -getServiceState hdh14${HADOOP_HOME}/bin/yarn rmadmin -getServiceState hdh19application_1576735164547_0015 root Awaken Insight Thrift Server[root@hdh19 ~]# ${HADOOP_HOME}/sbin/yarn-daemon.sh stop resourcemanagerstopping resourcemanagerSTOPPED
邮件综述
- resourcemanager HA主从节点切换 会导致 yarn上正在running的所有任务失败
- 具体比如数据查询、SQL执行sparksql thrift server任务
