arthas的使用

启动arthas

使用kubenav登录到容器内控制台,输入以下指令启动arthas

  1. curl -sk https://arthas.aliyun.com/arthas-boot.jar -o ~/.arthas-boot.jar && echo "alias as.sh='java -jar ~/.arthas-boot.jar --repo-mirror aliyun --use-http 2>&1'" >> ~/.bashrc && source ~/.bashrc && echo "source ~/.bashrc" >> ~/.bash_profile && source ~/.bash_profile
  2. as.sh

image.png

查看dashboard

输入dashboard,按回车/enter,会展示线程,内存、gc及jvm相关的一些信息,按ctrl+c可以中断执行。
image.png

thread

thread 查看线程列表
image.png

thread 线程id 查看该线程的堆栈
image.png
image.png
这里我们以前段时间sst环境xxljob故障来举例
故障现象,所有job都调度不了,直接报错
image.png
或者显示调度成功了,但是实际还没有执行,要等很久才能执行到job

image.png

image.png
image.png

  1. "xxl-job, admin JobTriggerPoolHelper-fastTriggerPool-31419390" Id=7462 WAITING on io.netty.bootstrap.AbstractBootstrap$PendingRegistrationPromise@319e5ba0
  2. at java.lang.Object.wait(Native Method)
  3. - waiting on io.netty.bootstrap.AbstractBootstrap$PendingRegistrationPromise@319e5ba0
  4. at java.lang.Object.wait(Object.java:502)
  5. at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:221)
  6. at io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:131)
  7. at io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:30)
  8. at io.netty.util.concurrent.DefaultPromise.sync(DefaultPromise.java:328)
  9. at io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:119)
  10. at io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:30)
  11. at com.xxl.rpc.remoting.net.impl.netty_http.client.NettyHttpConnectClient.init(NettyHttpConnectClient.java:64)
  12. at com.xxl.rpc.remoting.net.common.ConnectClient.getPool(ConnectClient.java:111)
  13. - locked java.lang.Object@5c5edffc <---- but blocks 137 other threads!
  14. at com.xxl.rpc.remoting.net.common.ConnectClient.asyncSend(ConnectClient.java:41)
  15. at com.xxl.rpc.remoting.net.impl.netty_http.client.NettyHttpClient.asyncSend(NettyHttpClient.java:18)
  16. at com.xxl.rpc.remoting.invoker.reference.XxlRpcReferenceBean$1.invoke(XxlRpcReferenceBean.java:216)
  17. at com.sun.proxy.$Proxy98.run(Unknown Source)
  18. at com.xxl.job.admin.core.trigger.XxlJobTrigger.runExecutor(XxlJobTrigger.java:196)
  19. at com.xxl.job.admin.core.trigger.XxlJobTrigger.processTrigger(XxlJobTrigger.java:149)
  20. at com.xxl.job.admin.core.trigger.XxlJobTrigger.trigger(XxlJobTrigger.java:74)
  21. at com.xxl.job.admin.core.thread.JobTriggerPoolHelper$3.run(JobTriggerPoolHelper.java:77)
  22. at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  23. at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  24. at java.lang.Thread.run(Thread.java:748)
  25. Number of locked synchronizers = 1
  26. - java.util.concurrent.ThreadPoolExecutor$Worker@1df6bfe

小技巧:我们可以通过复制堆栈信息到idea,快速找到代码
image.png
image.png
image.png
这样就能快速定位到代码

  1. // ---------------------- client pool map ----------------------
  2. /**
  3. * async send
  4. */
  5. public static void asyncSend(XxlRpcRequest xxlRpcRequest, String address,
  6. Class<? extends ConnectClient> connectClientImpl,
  7. final XxlRpcReferenceBean xxlRpcReferenceBean) throws Exception {
  8. // client pool [tips03 : may save 35ms/100invoke if move it to constructor, but it is necessary. cause by ConcurrentHashMap.get]
  9. ConnectClient clientPool = ConnectClient.getPool(address, connectClientImpl, xxlRpcReferenceBean);
  10. try {
  11. // do invoke
  12. clientPool.send(xxlRpcRequest);
  13. } catch (Exception e) {
  14. throw e;
  15. }
  16. }
  1. // remove-create new client
  2. synchronized (clientLock) {
  3. // get-valid client, avlid repeat
  4. connectClient = connectClientMap.get(address);
  5. if (connectClient!=null && connectClient.isValidate()) {
  6. return connectClient;
  7. }
  8. // remove old
  9. if (connectClient != null) {
  10. connectClient.close();
  11. connectClientMap.remove(address);
  12. }
  13. // set pool
  14. ConnectClient connectClient_new = connectClientImpl.newInstance();
  15. try {
  16. connectClient_new.init(address, xxlRpcReferenceBean.getSerializer(), xxlRpcReferenceBean.getInvokerFactory());
  17. connectClientMap.put(address, connectClient_new);
  18. } catch (Exception e) {
  19. connectClient_new.close();
  20. throw e;
  21. }
  22. return connectClient_new;
  23. }

通过查看代码我们可以看到xxljob调度线程在执行时会获取连接池,而正这个连接池获取的代码被上了锁,导致线程直接被blocked 通过查看日志可以看到有很多调度连接超时的日志,很明显就是xxljob访问不到执行器,链接超时导致,从而引发线程被blocked,每调度一次这样访问不通的执行器就会blocked一个xxljob调度线程,这样就会产生排队,当所有线程都被blocked后就会把线程池给打满,这样所有job都调度不了
经过进一步分析,原因是有同学把本地注册到了sst环境,应用在注册xxljob初始化时,只会获取一次本地ip,之后每30秒会有一次心跳,但是如果本地ip变了,这个心跳上报的还是启动时的ip,如果出现这种情况就会产生上述问题
改善:后续考虑通过nginx加白名单的方式过滤,非sst环境的ip不允许注册xxljob

arthas-idea-plugin

image.png
search class

image.png
image.png
image.png
执行静态方法
image.png

更多详细操作可参考官方语雀文档
https://www.yuque.com/arthas-idea-plugin