问题
无法使用KeyTab重新登录的问题排查
Flink 配置了KeyTab认证登录后,发现在一天后Kerberos Token过期,但无法使用KeyTab重新登录认证,会报如下的错误
2020-02-24 16:46:34,914 DEBUG org.apache.hadoop.security.UserGroupInformation - PrivilegedAction as:flink@HADOOP.COM (auth:KERBEROS) from:org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:717)2020-02-24 16:46:34,914 DEBUG org.apache.hadoop.security.UserGroupInformation - PrivilegedAction as:flink@HADOOP.COM (auth:KERBEROS) from:org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:717)2020-02-24 16:46:34,919 DEBUG org.apache.hadoop.security.UserGroupInformation - PrivilegedActionException as:flink@HADOOP.COM (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]2020-02-24 16:46:34,919 DEBUG org.apache.hadoop.security.UserGroupInformation - PrivilegedActionException as:flink@HADOOP.COM (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]2020-02-24 16:46:34,921 DEBUG org.apache.hadoop.security.UserGroupInformation - PrivilegedAction as:flink@HADOOP.COM (auth:KERBEROS) from:org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:643)2020-02-24 16:46:34,921 DEBUG org.apache.hadoop.security.UserGroupInformation - PrivilegedAction as:flink@HADOOP.COM (auth:KERBEROS) from:org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:643)2020-02-24 16:46:34,921 DEBUG org.apache.hadoop.security.UserGroupInformation - Initiating logout for flink@HADOOP.COM2020-02-24 16:46:34,921 DEBUG org.apache.hadoop.security.UserGroupInformation - Initiating logout for flink@HADOOP.COM2020-02-24 16:46:34,922 DEBUG org.apache.hadoop.security.UserGroupInformation - hadoop logout2020-02-24 16:46:34,922 DEBUG org.apache.hadoop.security.UserGroupInformation - hadoop logout2020-02-24 16:46:34,923 DEBUG org.apache.hadoop.security.UserGroupInformation - Initiating re-login for flink@HADOOP.COM2020-02-24 16:46:34,923 DEBUG org.apache.hadoop.security.UserGroupInformation - Initiating re-login for flink@HADOOP.COM2020-02-24 16:46:34,924 DEBUG org.apache.hadoop.security.UserGroupInformation - hadoop login2020-02-24 16:46:34,924 DEBUG org.apache.hadoop.security.UserGroupInformation - hadoop login2020-02-24 16:46:34,924 DEBUG org.apache.hadoop.security.UserGroupInformation - hadoop login commit2020-02-24 16:46:34,924 DEBUG org.apache.hadoop.security.UserGroupInformation - hadoop login commit2020-02-24 16:46:34,924 DEBUG org.apache.hadoop.security.UserGroupInformation - using existing subject:[flink@HADOOP.COM, UnixPrincipal: flink, UnixNumericUserPrincipal: 1030, UnixNumericGroupPrincipal [Primary Group]: 1016]2020-02-24 16:46:34,924 DEBUG org.apache.hadoop.security.UserGroupInformation - using existing subject:[flink@HADOOP.COM, UnixPrincipal: flink, UnixNumericUserPrincipal: 1030, UnixNumericGroupPrincipal [Primary Group]: 1016]2020-02-24 16:46:35,534 DEBUG org.apache.hadoop.security.UserGroupInformation - PrivilegedAction as:flink@HADOOP.COM (auth:KERBEROS) from:org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:717)2020-02-24 16:46:35,534 DEBUG org.apache.hadoop.security.UserGroupInformation - PrivilegedAction as:flink@HADOOP.COM (auth:KERBEROS) from:org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:717)2020-02-24 16:46:40,231 DEBUG org.apache.hadoop.security.UserGroupInformation - PrivilegedActionException as:flink@HADOOP.COM (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]2020-02-24 16:46:40,231 DEBUG org.apache.hadoop.security.UserGroupInformation - PrivilegedActionException as:flink@HADOOP.COM (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]2020-02-24 16:46:40,231 DEBUG org.apache.hadoop.security.UserGroupInformation - PrivilegedAction as:flink@HADOOP.COM (auth:KERBEROS) from:org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:643)2020-02-24 16:46:40,231 DEBUG org.apache.hadoop.security.UserGroupInformation - PrivilegedAction as:flink@HADOOP.COM (auth:KERBEROS) from:org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:643)2020-02-24 16:46:40,231 WARN org.apache.hadoop.ipc.Client - Couldn't setup connection for flink@HADOOP.COM to hz-hadoop-test-199-141-26/10.199.141.26:80202020-02-24 16:46:40,232 DEBUG org.apache.hadoop.security.UserGroupInformation - PrivilegedActionException as:flink@HADOOP.COM (auth:KERBEROS) cause:java.io.IOException: Couldn't setup connection for flink@HADOOP.COM to hz-hadoop-test-199-141-26/10.199.141.26:80202020-02-24 16:46:40,232 DEBUG org.apache.hadoop.security.UserGroupInformation - PrivilegedActionException as:flink@HADOOP.COM (auth:KERBEROS) cause:java.io.IOException: Couldn't setup connection for flink@HADOOP.COM to hz-hadoop-test-199-141-26/10.199.141.26:80202020-02-24 16:46:40,234 INFO org.apache.hadoop.io.retry.RetryInvocationHandler - Exception while invoking mkdirs of class ClientNamenodeProtocolTranslatorPB over hz-hadoop-test-199-141-26/10.199.141.26:8020. Trying to fail over immediately.java.io.IOException: Failed on local exception: java.io.IOException: Couldn't setup connection for flink@HADOOP.COM to hz-hadoop-test-199-141-26/10.199.141.26:8020; Host Details : local host is: "hz-hadoop-test-199-141-32/10.199.141.32"; destination host is: "hz-hadoop-test-199-141-26":8020;at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)at org.apache.hadoop.ipc.Client.call(Client.java:1472)at org.apache.hadoop.ipc.Client.call(Client.java:1399)at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)at com.sun.proxy.$Proxy42.mkdirs(Unknown Source)at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:539)at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)at java.lang.reflect.Method.invoke(Method.java:498)at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)at com.sun.proxy.$Proxy43.mkdirs(Unknown Source)at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2753)at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2724)at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:870)at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:866)at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:866)at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:859)at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1817)at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.mkdirs(HadoopFileSystem.java:170)at org.apache.flink.runtime.state.filesystem.FsCheckpointStorage.initializeBaseLocations(FsCheckpointStorage.java:111)at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.<init>(CheckpointCoordinator.java:280)at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.<init>(CheckpointCoordinator.java:205)at org.apache.flink.runtime.executiongraph.ExecutionGraph.enableCheckpointing(ExecutionGraph.java:486)at org.apache.flink.runtime.executiongraph.ExecutionGraphBuilder.buildGraph(ExecutionGraphBuilder.java:338)at org.apache.flink.runtime.scheduler.SchedulerBase.createExecutionGraph(SchedulerBase.java:255)at org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:227)at org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:215)at org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:120)at org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:105)at org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:278)at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:266)at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:98)at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:40)at org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.<init>(JobManagerRunnerImpl.java:146)at org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:84)at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$6(Dispatcher.java:379)at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34)at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)Caused by: java.io.IOException: Couldn't setup connection for flink@HADOOP.COM to hz-hadoop-test-199-141-26/10.199.141.26:8020at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:672)at java.security.AccessController.doPrivileged(Native Method)at javax.security.auth.Subject.doAs(Subject.java:422)at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:643)at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:730)at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)at org.apache.hadoop.ipc.Client.call(Client.java:1438)... 45 moreCaused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:413)at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:553)at org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:368)at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:722)at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:718)at java.security.AccessController.doPrivileged(Native Method)at javax.security.auth.Subject.doAs(Subject.java:422)at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:717)... 48 moreCaused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:122)at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:224)at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192)... 57 more
- 发现问题后,到测试环境使用Arthas查看UserGroupInformation中的loginUser信息,发现loginUser中的subject的principals在过期前和过期后是不一样的,过期前包含KerberosPrincipal, 过期后不包含KerberosPrincipal,只有UnixPrincipal, 从这个变化推断过期后并没有安配置的KeyTab的认证方式进行认证
- 重新看了UserGroupInformation中认证相关的代码后确认过期后使用的是Kerberos Ticket的方式重新认证的,而不是正确的Kerberos KeyTab的认证方式,在UserGroupInformation的方法中使用了这种认证方式,并且是后台线程自动调用的方法只有reloginFromTicketCache()这个方法,为了上确认这个想法,我修改了Flink集群的日志配置,在日志只代码了方法调用的信息,重启集群,一天后当Kerberos Ticket过期后查看日志,发现刷新Kerberos认证信息时调用的就是reloginFromTicketCache()这个方法,而不是正确的reloginFromKeytab()方法
- 在查看这两个方法的调用情况后确认当UserGroupInformation.isLoginKeytabBased()返回是true时会使用reloginFromKeytab()重新认证Kerberos,当UserGroupInformation.isLoginKeytabBased()返回false, UserGroupInformation.isLoginTicketBased()返回true时,会使用reloginFromTicketCache()重新认证Kerberos,而这两个方法的内部是根据当前登录用户的UserGroupInformation实例中的isKeyTab和isKrbTkt两个字段返回的,当前集群中isKeyTab字段应该是false,isKrbTkt字段应该是true, 为了验证这个想法我将Flink集群JobManager的堆信息dump下来在JProfiler中查看,发现所有的UserGroupInformation实例中都是isKeyTab=false,isKrbTkt=true
- 查看代码后发现isKeytab和isKrbTkt都是final的,都是在UserGroupInformation实例化的时候赋值的,最终通过对应原来Flink 1.9.0集群中UserGroupInformation和当前集群中UserGroupInformation的代码后发现,这部分的代码不一样,后面确认当前Flink集群使用的Hadoop版本是2.6.0, 这个版本的代码在使用JDK1.8时就是有isKeyTab设置不正确的bug,而Flink 1.9.0集群中使用的是Hadoop 2.6.5的版本,这个版本已经修复了上述的bug
