问题

无法使用KeyTab重新登录的问题排查

Flink 配置了KeyTab认证登录后,发现在一天后Kerberos Token过期,但无法使用KeyTab重新登录认证,会报如下的错误

  1. 2020-02-24 16:46:34,914 DEBUG org.apache.hadoop.security.UserGroupInformation - PrivilegedAction as:flink@HADOOP.COM (auth:KERBEROS) from:org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:717)
  2. 2020-02-24 16:46:34,914 DEBUG org.apache.hadoop.security.UserGroupInformation - PrivilegedAction as:flink@HADOOP.COM (auth:KERBEROS) from:org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:717)
  3. 2020-02-24 16:46:34,919 DEBUG org.apache.hadoop.security.UserGroupInformation - PrivilegedActionException as:flink@HADOOP.COM (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
  4. 2020-02-24 16:46:34,919 DEBUG org.apache.hadoop.security.UserGroupInformation - PrivilegedActionException as:flink@HADOOP.COM (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
  5. 2020-02-24 16:46:34,921 DEBUG org.apache.hadoop.security.UserGroupInformation - PrivilegedAction as:flink@HADOOP.COM (auth:KERBEROS) from:org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:643)
  6. 2020-02-24 16:46:34,921 DEBUG org.apache.hadoop.security.UserGroupInformation - PrivilegedAction as:flink@HADOOP.COM (auth:KERBEROS) from:org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:643)
  7. 2020-02-24 16:46:34,921 DEBUG org.apache.hadoop.security.UserGroupInformation - Initiating logout for flink@HADOOP.COM
  8. 2020-02-24 16:46:34,921 DEBUG org.apache.hadoop.security.UserGroupInformation - Initiating logout for flink@HADOOP.COM
  9. 2020-02-24 16:46:34,922 DEBUG org.apache.hadoop.security.UserGroupInformation - hadoop logout
  10. 2020-02-24 16:46:34,922 DEBUG org.apache.hadoop.security.UserGroupInformation - hadoop logout
  11. 2020-02-24 16:46:34,923 DEBUG org.apache.hadoop.security.UserGroupInformation - Initiating re-login for flink@HADOOP.COM
  12. 2020-02-24 16:46:34,923 DEBUG org.apache.hadoop.security.UserGroupInformation - Initiating re-login for flink@HADOOP.COM
  13. 2020-02-24 16:46:34,924 DEBUG org.apache.hadoop.security.UserGroupInformation - hadoop login
  14. 2020-02-24 16:46:34,924 DEBUG org.apache.hadoop.security.UserGroupInformation - hadoop login
  15. 2020-02-24 16:46:34,924 DEBUG org.apache.hadoop.security.UserGroupInformation - hadoop login commit
  16. 2020-02-24 16:46:34,924 DEBUG org.apache.hadoop.security.UserGroupInformation - hadoop login commit
  17. 2020-02-24 16:46:34,924 DEBUG org.apache.hadoop.security.UserGroupInformation - using existing subject:[flink@HADOOP.COM, UnixPrincipal: flink, UnixNumericUserPrincipal: 1030, UnixNumericGroupPrincipal [Primary Group]: 1016]
  18. 2020-02-24 16:46:34,924 DEBUG org.apache.hadoop.security.UserGroupInformation - using existing subject:[flink@HADOOP.COM, UnixPrincipal: flink, UnixNumericUserPrincipal: 1030, UnixNumericGroupPrincipal [Primary Group]: 1016]
  19. 2020-02-24 16:46:35,534 DEBUG org.apache.hadoop.security.UserGroupInformation - PrivilegedAction as:flink@HADOOP.COM (auth:KERBEROS) from:org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:717)
  20. 2020-02-24 16:46:35,534 DEBUG org.apache.hadoop.security.UserGroupInformation - PrivilegedAction as:flink@HADOOP.COM (auth:KERBEROS) from:org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:717)
  21. 2020-02-24 16:46:40,231 DEBUG org.apache.hadoop.security.UserGroupInformation - PrivilegedActionException as:flink@HADOOP.COM (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
  22. 2020-02-24 16:46:40,231 DEBUG org.apache.hadoop.security.UserGroupInformation - PrivilegedActionException as:flink@HADOOP.COM (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
  23. 2020-02-24 16:46:40,231 DEBUG org.apache.hadoop.security.UserGroupInformation - PrivilegedAction as:flink@HADOOP.COM (auth:KERBEROS) from:org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:643)
  24. 2020-02-24 16:46:40,231 DEBUG org.apache.hadoop.security.UserGroupInformation - PrivilegedAction as:flink@HADOOP.COM (auth:KERBEROS) from:org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:643)
  25. 2020-02-24 16:46:40,231 WARN org.apache.hadoop.ipc.Client - Couldn't setup connection for flink@HADOOP.COM to hz-hadoop-test-199-141-26/10.199.141.26:8020
  26. 2020-02-24 16:46:40,232 DEBUG org.apache.hadoop.security.UserGroupInformation - PrivilegedActionException as:flink@HADOOP.COM (auth:KERBEROS) cause:java.io.IOException: Couldn't setup connection for flink@HADOOP.COM to hz-hadoop-test-199-141-26/10.199.141.26:8020
  27. 2020-02-24 16:46:40,232 DEBUG org.apache.hadoop.security.UserGroupInformation - PrivilegedActionException as:flink@HADOOP.COM (auth:KERBEROS) cause:java.io.IOException: Couldn't setup connection for flink@HADOOP.COM to hz-hadoop-test-199-141-26/10.199.141.26:8020
  28. 2020-02-24 16:46:40,234 INFO org.apache.hadoop.io.retry.RetryInvocationHandler - Exception while invoking mkdirs of class ClientNamenodeProtocolTranslatorPB over hz-hadoop-test-199-141-26/10.199.141.26:8020. Trying to fail over immediately.
  29. java.io.IOException: Failed on local exception: java.io.IOException: Couldn't setup connection for flink@HADOOP.COM to hz-hadoop-test-199-141-26/10.199.141.26:8020; Host Details : local host is: "hz-hadoop-test-199-141-32/10.199.141.32"; destination host is: "hz-hadoop-test-199-141-26":8020;
  30. at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
  31. at org.apache.hadoop.ipc.Client.call(Client.java:1472)
  32. at org.apache.hadoop.ipc.Client.call(Client.java:1399)
  33. at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
  34. at com.sun.proxy.$Proxy42.mkdirs(Unknown Source)
  35. at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:539)
  36. at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  37. at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  38. at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  39. at java.lang.reflect.Method.invoke(Method.java:498)
  40. at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
  41. at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
  42. at com.sun.proxy.$Proxy43.mkdirs(Unknown Source)
  43. at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2753)
  44. at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2724)
  45. at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:870)
  46. at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:866)
  47. at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
  48. at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:866)
  49. at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:859)
  50. at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1817)
  51. at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.mkdirs(HadoopFileSystem.java:170)
  52. at org.apache.flink.runtime.state.filesystem.FsCheckpointStorage.initializeBaseLocations(FsCheckpointStorage.java:111)
  53. at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.<init>(CheckpointCoordinator.java:280)
  54. at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.<init>(CheckpointCoordinator.java:205)
  55. at org.apache.flink.runtime.executiongraph.ExecutionGraph.enableCheckpointing(ExecutionGraph.java:486)
  56. at org.apache.flink.runtime.executiongraph.ExecutionGraphBuilder.buildGraph(ExecutionGraphBuilder.java:338)
  57. at org.apache.flink.runtime.scheduler.SchedulerBase.createExecutionGraph(SchedulerBase.java:255)
  58. at org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:227)
  59. at org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:215)
  60. at org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:120)
  61. at org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:105)
  62. at org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:278)
  63. at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:266)
  64. at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:98)
  65. at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:40)
  66. at org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.<init>(JobManagerRunnerImpl.java:146)
  67. at org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:84)
  68. at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$6(Dispatcher.java:379)
  69. at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34)
  70. at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
  71. at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
  72. at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
  73. at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
  74. at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
  75. at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
  76. at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
  77. Caused by: java.io.IOException: Couldn't setup connection for flink@HADOOP.COM to hz-hadoop-test-199-141-26/10.199.141.26:8020
  78. at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:672)
  79. at java.security.AccessController.doPrivileged(Native Method)
  80. at javax.security.auth.Subject.doAs(Subject.java:422)
  81. at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
  82. at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:643)
  83. at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:730)
  84. at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
  85. at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
  86. at org.apache.hadoop.ipc.Client.call(Client.java:1438)
  87. ... 45 more
  88. Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
  89. at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
  90. at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:413)
  91. at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:553)
  92. at org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:368)
  93. at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:722)
  94. at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:718)
  95. at java.security.AccessController.doPrivileged(Native Method)
  96. at javax.security.auth.Subject.doAs(Subject.java:422)
  97. at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
  98. at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:717)
  99. ... 48 more
  100. Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
  101. at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
  102. at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:122)
  103. at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
  104. at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:224)
  105. at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
  106. at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
  107. at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192)
  108. ... 57 more
  • 发现问题后,到测试环境使用Arthas查看UserGroupInformation中的loginUser信息,发现loginUser中的subject的principals在过期前和过期后是不一样的,过期前包含KerberosPrincipal, 过期后不包含KerberosPrincipal,只有UnixPrincipal, 从这个变化推断过期后并没有安配置的KeyTab的认证方式进行认证
  • 重新看了UserGroupInformation中认证相关的代码后确认过期后使用的是Kerberos Ticket的方式重新认证的,而不是正确的Kerberos KeyTab的认证方式,在UserGroupInformation的方法中使用了这种认证方式,并且是后台线程自动调用的方法只有reloginFromTicketCache()这个方法,为了上确认这个想法,我修改了Flink集群的日志配置,在日志只代码了方法调用的信息,重启集群,一天后当Kerberos Ticket过期后查看日志,发现刷新Kerberos认证信息时调用的就是reloginFromTicketCache()这个方法,而不是正确的reloginFromKeytab()方法
  • 在查看这两个方法的调用情况后确认当UserGroupInformation.isLoginKeytabBased()返回是true时会使用reloginFromKeytab()重新认证Kerberos,当UserGroupInformation.isLoginKeytabBased()返回false, UserGroupInformation.isLoginTicketBased()返回true时,会使用reloginFromTicketCache()重新认证Kerberos,而这两个方法的内部是根据当前登录用户的UserGroupInformation实例中的isKeyTab和isKrbTkt两个字段返回的,当前集群中isKeyTab字段应该是false,isKrbTkt字段应该是true, 为了验证这个想法我将Flink集群JobManager的堆信息dump下来在JProfiler中查看,发现所有的UserGroupInformation实例中都是isKeyTab=false,isKrbTkt=true
  • 查看代码后发现isKeytab和isKrbTkt都是final的,都是在UserGroupInformation实例化的时候赋值的,最终通过对应原来Flink 1.9.0集群中UserGroupInformation和当前集群中UserGroupInformation的代码后发现,这部分的代码不一样,后面确认当前Flink集群使用的Hadoop版本是2.6.0, 这个版本的代码在使用JDK1.8时就是有isKeyTab设置不正确的bug,而Flink 1.9.0集群中使用的是Hadoop 2.6.5的版本,这个版本已经修复了上述的bug