背景

系统测试第一轮过程中,发现数据智能平台的所有数据查看预览功能及数据探索SQL执行均无法使用。该问题对系统影响是致命的,因为这样导致系统完全无法使用。经过与系统测试人员沟通,并没有发现情景有什么规律;即该问题无法进行复现,给问题排查带来了巨大的不确定性及难度。

问题描述

  1. 数据智能平台所有的表记录查看功能无法使用;
  2. 数据智能平台数据探索SQL无法执行;
  3. sparkSQL thrift server已停止;
    1. 10000端口不是listen状态
    2. 18088界面没有看到running状态的对应job
  4. sparkSQL thrift server对应的日志没有更新,并以Error日志结尾;

根因分析

分析sparkSQL thrift server日志

  • vim /mnt/disk1/log/cluster001/SERVICE-SPARK-admin/spark-root-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-hdh163.out

    1. 19/12/18 09:52:02 ERROR TransportClient: Failed to send RPC 7415692333163725305 to /10.66.113.163:47243: java.nio.channels.ClosedChannelException
    2. java.nio.channels.ClosedChannelException
    3. at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
    4. 19/12/18 09:52:02 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
    5. ...
    6. 19/12/18 09:52:02 ERROR Client: Application application_1575963887809_1812 not found.
    7. 19/12/18 09:52:02 INFO Client: Deleted staging directory hdfs://SERVICE-HADOOP-admin/user/root/.sparkStaging/application_1575963887809_1812
    8. 19/12/18 09:52:02 ERROR YarnClientSchedulerBackend: Yarn application has already exited with state KILLED!
    9. 19/12/18 09:52:02 INFO HiveServer2: Shutting down HiveServer2
    10. ...
    11. 19/12/18 09:52:02 INFO SparkUI: Stopped Spark web UI at http://10.66.113.163:4048
    12. 19/12/18 09:52:02 INFO AbstractService: Service:CLIService is stopped.
    13. 19/12/18 09:52:02 INFO AbstractService: Service:HiveServer2 is stopped.

    通过日志总结得到如下信息:**

  • sparkSQL thrift server已停止;

  • 停止时间为:19/12/18 09:52 附近
  • 根据异常日志暂时无法确切的分析出导致异常停止的原因
  • applicationId 为 application_1575963887809_1812
  • 任务关键信息为:ERROR YarnClientSchedulerBackend: Yarn application has already exited with state KILLED!
    • 通过任务applicationId 与 关键信息分析得到,应该到yarn resourcemanager服务日志里面去分析看看能不能得到其他的原因
      • 进入下一阶段:
        • yarn web页面18088 查看该任务的相关信息
        • 分析Yarn Resource manager日志该任务的相关信息
    • 日志 line63

Yarn Resourcemanager 相关分析

yarn web页面18088

  • yarn web页面无法搜索得到该applicationId的信息
    • 非常奇怪、矛盾
  • yarn web页面的最早的job时间为Wed Dec 18 09:59:32【20191218-09:59:32】

    • 为什么呢?

      分析Yarn Resource manager日志

  • vim /mnt/disk1/log/cluster001/SERVICE-HADOOP-admin/yarn/yarn-root-resourcemanager-hdh163.log

    1. 2019-12-18 09:52:03 ERROR ...unknown application application_1575963887809_0754 Skipping.
    • 暂时无法断定原因;继续
  1. 2019-12-17 15:44:55,204 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Updating the current master key for generating delegation tokens
  2. 2019-12-18 09:52:01,610 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced...
  3. 2019-12-18 09:52:01,611 INFO org.apache.hadoop.ha.ActiveStandbyElector: No old node to fence
  4. 2019-12-18 09:52:01,611 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing znode /yarn-leader-election/SERVICE-HADOOP-admin/ActiveBreadCrumb to indicate that the local node is the most recent active...
  5. 2019-12-18 09:52:01,612 INFO org.apache.hadoop.conf.Configuration: found resource yarn-site.xml at file:/etc/cluster001/SERVICE-HADOOP-admin/yarn-site.xml
  6. 2019-12-18 09:52:01,614 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root OPERATION=refreshAdminAcls TARGET=AdminService RESULT=SUCCESS
  7. 2019-12-18 09:52:01,614 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioning to active state
  8. 2019-12-18 09:52:01,623 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager: Rolling master-key for container-tokens
  9. 2019-12-18 09:52:01,623 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Rolling master-key for nm-tokens
  • 通过日志可以看到,日志时间有明显的断层(2019-12-17 15:44:55,2019-12-18 09:52:01)(且在异常停止时刻附近);

    • resourcemanager HA 切换?进一步求证
      • resourcemanager 为了实现HA(high available)高可用,有两个services:active、standby
  • resourcemanager另外一个服务的日志

    1. 2019-12-18 09:52:12,468 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
    2. 2019-12-18 09:52:12,468 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Returning, interrupted : java.lang.InterruptedException
    3. 2019-12-18 09:52:12,469 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: NMLivelinessMonitor thread interrupted
    4. 2019-12-18 09:52:12,470 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: AMLivelinessMonitor thread interrupted
    5. 2019-12-18 09:52:12,470 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: AMLivelinessMonitor thread interrupted
    6. 2019-12-18 09:52:12,470 ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted
    7. 2019-12-18 09:52:12,470 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.ContainerAllocationExpirer thread interrupted
    8. 2019-12-18 09:52:12,470 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioned to standby state
    9. 2019-12-18 09:52:12,470 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: SHUTDOWN_MSG:
    10. /************************************************************
    11. SHUTDOWN_MSG: Shutting down ResourceManager at hdh163/10.66.113.163
    12. ************************************************************/
    13. 2019-12-18 09:53:12,137 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: STARTUP_MSG:
    14. /************************************************************
    15. STARTUP_MSG: Starting ResourceManager
    16. STARTUP_MSG: host = hdh163/10.66.113.163
    17. STARTUP_MSG: args = []
    18. STARTUP_MSG: version = 2.6.0-cdh5.7.0
  • 结论:resourcemanager的确发生了HA主从节点切换

  • resourcemanager **HA主从节点切换 会导致 sparksql thrift server 对应的yarn job失败?进行复现性测试。**
    • 快接近真相了

复现性测试

  • 环境,开发环境118(集群环境,保证resourcemanager HA)
  • 实验准备
    • resourcemanager 服务正常
    • sparksql thrift server服务正常
      • 10000端口存在
        • netstat -tunlp | grep 10000
      • beeline 连接验证功能正常可用
  • 实验操作步骤及结果
    • 人为的制造发生 resourcemanager HA主从节点切换 ```basic ${HADOOP_HOME}/bin/yarn rmadmin -getServiceState awaken120

      输出 active

      ${HADOOP_HOME}/bin/yarn rmadmin -getServiceState awaken118

      输出 standby

停止 awaken120 上面的resourcemanager服务导致HA主从节点切换

${HADOOP_HOME}/sbin/yarn-daemon.sh stop resourcemanager

${HADOOP_HOME}/bin/yarn rmadmin -getServiceState awaken120

输出 standby

${HADOOP_HOME}/bin/yarn rmadmin -getServiceState awaken118

输出 active

  1. - 查看thrift server 是否可用,beeline验证
  2. - ${SPARK_HOME}/bin/beeline -u jdbc:hive2://localhost:10000 -n root # 密码为空,直接键入enter
  3. - 失败
  4. - 查看thrift server日志与之前的异常日志是否类似
  5. - 完全类似
  6. <a name="sjh9Q"></a>
  7. ### 定位分析总结
  8. - **resourcemanager ****HA主从节点切换 会导致 yarn上正在running的所有任务失败**
  9. - **resourcemanager ****HA主从节点切换 会导致 sparksql thrift server 对应的yarn job失败**
  10. <a name="kE7KO"></a>
  11. ## 解决方案
  12. 经过与组件组同事沟通,确定如果hadoop配置文件yarn-site.xml里面增加如下配置可以解决;
  13. - vim yarn-site.xml # vim 在命令模式下键入 :set paste 然后进入insert模式可以解决粘贴格式错乱问题
  14. ```basic
  15. <property>
  16. <name>yarn.resourcemanager.recovery.enabled</name>
  17. <value>true</value>
  18. </property>
  19. <property>
  20. <name>yarn.resourcemanager.work-preserving-recovery.enabled</name>
  21. <value>true</value>
  22. </property>
  • 原有配置为:
    1. <property>
    2. <name>yarn.resourcemanager.recovery.enabled</name>
    3. <value>false</value>
    4. </property>

验证测试

  • 更改配置文件yarn-site.xml 并分发到各个集群节点
  • 带resourcemanager的节点重启resourcemanager 使配置更改生效

    • ${HADOOP_HOME}/sbin/yarn-daemon.sh stop resourcemanager
    • ${HADOOP_HOME}/sbin/yarn-daemon.sh start resourcemanager
  • 实验操作步骤及结果

    • 人为的制造发生 resourcemanager HA主从节点切换 ```basic ${HADOOP_HOME}/bin/yarn rmadmin -getServiceState awaken120

      输出 active

      ${HADOOP_HOME}/bin/yarn rmadmin -getServiceState awaken118

      输出 standby

停止 awaken120 上面的resourcemanager服务导致HA主从节点切换

${HADOOP_HOME}/sbin/yarn-daemon.sh stop resourcemanager

${HADOOP_HOME}/bin/yarn rmadmin -getServiceState awaken120

输出 standby

${HADOOP_HOME}/bin/yarn rmadmin -getServiceState awaken118

输出 active

  1. - yarn web页面可以看到thrift服务对应的job依旧存在
  2. - 查看thrift server 是否可用,beeline验证
  3. ```basic
  4. [root@awaken118 ~]# beeline -u jdbc:hive2://localhost:10000 -n root
  5. Connecting to jdbc:hive2://localhost:10000
  6. 19/12/18 16:43:29 INFO Utils: Supplied authorities: localhost:10000
  7. 19/12/18 16:43:29 INFO Utils: Resolved authority: localhost:10000
  8. 19/12/18 16:43:29 INFO HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://localhost:10000
  9. Connected to: Spark SQL (version 2.1.0)
  10. Driver: Hive JDBC (version 1.2.1.spark2)
  11. Transaction isolation: TRANSACTION_REPEATABLE_READ
  12. Beeline version 1.2.1.spark2 by Apache Hive
  13. 0: jdbc:hive2://localhost:10000> show databases;
  14. +-------------------------+--+
  15. | databaseName |
  16. +-------------------------+--+
  17. | awaken_graph |
  18. | awaken_result |
  19. | awaken_temp |
  20. | default |
  21. | origin_algorithmengine |
  22. | origin_csvload |
  23. | origin_dag_experiment |
  24. | origin_dag_view |
  25. | origin_dataload |
  26. | origin_datamerge |
  27. | origin_kafkaload |
  28. +-------------------------+--+
  29. 11 rows selected (0.413 seconds)
  30. 0: jdbc:hive2://localhost:10000>

总结

  • 上文的解决方案可以解决本问题

规避策略

  1. 开启sparkSQL thrift server服务watchdog

总结

  1. 由该问题可以得到,目前多维这边的hadoop发布包应该均会携带该缺陷;建议各现场及相关环境尽快根据本文方案进行同步修复。
  2. 分析问题当单点(spark thrift server)无法得到突破时;需要进行关联服务分析,争取问题得到转移并接近解决。
  3. 与组件组建立沟通问题解决机制;因为该问题在组价组的视角很早已经得到了解决,但是我们这边的组件配置没有得到同步。

备注

sparksql thrift server由于是底层服务,导致异常停止的原因很多;上文只是针对目前的这个异常日志进行分析并得到解决。如果异常日志不是类似这种,需要重新具体问题具体分析,但是本文的方法及思路可以进行参考借鉴。

附录:系统测试复现参照

  1. export HADOOP_HOME=/usr/lib/cluster001/SERVICE-HADOOP-admin/
  2. ${HADOOP_HOME}/bin/yarn rmadmin -getServiceState hdh14
  3. ${HADOOP_HOME}/bin/yarn rmadmin -getServiceState hdh19
  4. application_1576735164547_0015 root Awaken Insight Thrift Server
  5. [root@hdh19 ~]# ${HADOOP_HOME}/sbin/yarn-daemon.sh stop resourcemanager
  6. stopping resourcemanager
  7. STOPPED

邮件综述

  1. resourcemanager HA主从节点切换 会导致 yarn上正在running的所有任务失败
  2. 具体比如数据查询、SQL执行sparksql thrift server任务