1 简述

jstack是jdk自带的线程堆栈分析工具,用于生成java虚拟机当前时刻的线程快照。线程快照是当前java虚拟机内每一条线程正在执行的方法堆栈的集合,生成线程快照的主要目的是定位线程出现长时间停顿的原因,如线程间死锁、死循环、请求外部资源导致的长时间等待等。 线程出现停顿的时候通过jstack来查看各个线程的调用堆栈,就可以知道没有响应的线程到底在后台做什么事情,或者等待什么资源。

  1. [webuser@TESTENV hsperfdata_webuser]$ jstack -help
  2. Usage:
  3. jstack [-l] <pid>
  4. (to connect to running process) 连接活动线程
  5. jstack -F [-m] [-l] <pid>
  6. (to connect to a hung process) 连接阻塞线程
  7. jstack [-m] [-l] <executable> <core>
  8. (to connect to a core file) 连接dump的文件
  9. jstack [-m] [-l] [server_id@]<remote server IP or hostname>
  10. (to connect to a remote debug server) 连接远程服务器
  11. Options:
  12. -F to force a thread dump. Use when jstack <pid> does not respond (process is hung)
  13. -m to print both java and native frames (mixed mode)
  14. -l long listing. Prints additional information about locks
  15. -h or -help to print this help message
  • -F:当进程挂起了,此时 jstack [-l] pid 是没有相应的,这时候可使用此参数来强制打印堆栈信息,强制jstack,一般情况不需要使用。
  • m:打印java和native c/c++框架的所有栈信息。可以打印JVM的堆栈,以及Native的栈帧,一般应用排查不需要使用。
  • l: 打印关于锁的附加信息。

    2 简单使用示例:

    可以先通过jps获得Java进程
    [webuser@-TESTENV hsperfdata_webuser]$ jps
    23457 xx-test.jar
    23479 xx-like.jar
    
    然后执行 jstack pid 查看指定进程的当前堆栈情况 ```shell [webuser@-TESTENV hsperfdata_webuser]$ jstack 23479 2021-04-22 21:16:08 Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.161-b12 mixed mode):

“pool-2-thread-1” #45 prio=5 os_prio=0 tid=0x00007fdfc8b53000 nid=0x7a58 waiting on condition [0x00007fdfa7dfc000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at io.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:577) at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:476) at java.lang.Thread.run(Thread.java:748)

“qtp1383519982-259” #259 prio=5 os_prio=0 tid=0x00007fdfc0445000 nid=0x78c5 waiting on condition [0x00007fdfa1cd0000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method)

- parking to wait for  <0x00000000c5afbbf0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:382)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.idleJobPoll(QueuedThreadPool.java:875)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:925)
at java.lang.Thread.run(Thread.java:748)

“qtp1383519982-258” #258 prio=5 os_prio=0 tid=0x00007fdfc0444000 nid=0x78c4 waiting on condition [0x00007fdfa1dd1000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method)

- parking to wait for  <0x00000000c5afbbf0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:382)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.idleJobPoll(QueuedThreadPool.java:875)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:925)
at java.lang.Thread.run(Thread.java:748)

“qtp1383519982-257” #257 prio=5 os_prio=0 tid=0x00007fdfb8154800 nid=0x78c3 waiting on condition [0x00007fdfa1ed2000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method)

- parking to wait for  <0x00000000c5afbbf0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:382)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.idleJobPoll(QueuedThreadPool.java:875)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:925)
at java.lang.Thread.run(Thread.java:748)

“Attach Listener” #230 daemon prio=9 os_prio=0 tid=0x00007fdfcc00b800 nid=0x75c0 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE

“logback-10” #84 daemon prio=5 os_prio=0 tid=0x00007fdfe0014800 nid=0x5dca waiting on condition [0x00007fdfa22d4000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

- parking to wait for  <0x00000000c515fb90> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1088)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

截出了部分打印内容

也可以将指定进程的当前堆栈情况记录到某个文件中:
```shell
jstack 23479 > /data/logs/jstack_info.txt

3 伪案例分析

在几乎没有什么调用的情况下,服务器cpu使用率一直居高不下,非常有可能时程序中的某个线程出现了死循环。cpu占用率过高,除了死循环,常见的可能还有 内存泄漏,导致大量full GC、主机cpu超卖。

3.1 死循环

运行下述代码,死循环,造成CPU占用较高的情况


import java.util.concurrent.Executor;
import java.util.concurrent.Executors;

/**
 * Created by haoll on 2021/4/22
 */
public class Test {

    private static Executor executor = Executors.newFixedThreadPool(2);


    public static Object block = new Object();


    public static void testMethod1(){
        Task task1 = new Task();
        executor.execute(task1);
    }


    static class Task implements Runnable{

        @Override
        public void run() {
            synchronized (block){
                add();
            }
        }

        public void add(){
            int i = 0;
            while (true) {
                i++;
            }
        }
    }
}

排查开始~
第一步:使用top指令,定位CPU占用较高的进程

[webuser@-TESTENV middleplatform]$ top

top - 21:32:13 up 194 days, 17:42,  1 user,  load average: 1.04, 1.91, 2.12
Tasks: 208 total,   2 running, 206 sleeping,   0 stopped,   0 zombie
%Cpu(s): 53.3 us,  1.3 sy,  0.0 ni, 45.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  3883016 total,   432172 free,  2665676 used,   785168 buff/cache
KiB Swap:  4063228 total,  3602380 free,   460848 used.   989076 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                  
32687 webuser   20   0 3627128 604488  13824 S 103.3 15.6   7:13.00 java                                                                                     
32665 webuser   20   0 3631304 621724  13896 S   4.0 16.0   0:47.94 java                                                                                     
   12 root      20   0       0      0      0 R   0.3  0.0 139:51.84 rcuos/0                                                                                  
  871 root      20   0       0      0      0 S   0.3  0.0   0:00.16 kworker/1:1                                                                              
10413 root      20   0 3575824  25420    772 S   0.3  0.7 387:11.80 java                                                                                     
32673 webuser   20   0 3670260 583704  13892 S   0.3 15.0   0:45.29 java                                                                                     
    1 root      20   0  123508   2564   1392 S   0.0  0.1  72:25.10 systemd                                                                                  
    2 root      20   0       0      0      0 S   0.0  0.0   0:07.44 kthreadd                                                                                 
    3 root      20   0       0      0      0 S   0.0  0.0  10:36.92 ksoftirqd/0                                                                              
    5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H                                                                             
    7 root      rt   0       0      0      0 S   0.0  0.0   0:11.37 migration/0                                                                              
    8 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh                                                                                   
    9 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/0                                                                                  
   10 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/1
从上面的结果我们就可以看出来PID=32687的cpu占有率高的离谱<br />**第二步:使用top -Hp ${进程id}指令,查看指定进程下各个线程的cpu使用情况**
[webuser@CSDN-TESTENV middleplatform-interaction]$ top -Hp 32687

top - 21:34:36 up 194 days, 17:44,  1 user,  load average: 2.23, 1.95, 2.10
Threads:  66 total,   3 running,  63 sleeping,   0 stopped,   0 zombie
%Cpu(s): 53.5 us,  1.9 sy,  0.0 ni, 44.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  3883016 total,   429100 free,  2667968 used,   785948 buff/cache
KiB Swap:  4063228 total,  3602576 free,   460652 used.   986784 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                   
  582 webuser   20   0 3627128 605856  13828 R 99.7 15.6   8:36.79 java                                                                                      
  574 webuser   20   0 3627128 605856  13828 R  2.3 15.6   0:10.19 java                                                                                      
32720 webuser   20   0 3627128 605856  13828 S  1.0 15.6   0:22.43 java                                                                                      
  572 webuser   20   0 3627128 605856  13828 S  0.7 15.6   0:02.07 java                                                                                      
  576 webuser   20   0 3627128 605856  13828 S  0.3 15.6   0:00.95 java                                                                                      
  611 webuser   20   0 3627128 605856  13828 S  0.3 15.6   0:00.27 java                                                                                      
  630 webuser   20   0 3627128 605856  13828 S  0.3 15.6   0:00.69 java                                                                                      
  632 webuser   20   0 3627128 605856  13828 S  0.3 15.6   0:00.76 java                                                                                      
  633 webuser   20   0 3627128 605856  13828 R  0.3 15.6   0:00.69 java                                                                                      
32687 webuser   20   0 3627128 605856  13828 S  0.0 15.6   0:00.00 java                                                                                      
32696 webuser   20   0 3627128 605856  13828 S  0.0 15.6   0:14.69 java                                                                                      
32702 webuser   20   0 3627128 605856  13828 S  0.0 15.6   0:00.41 java                                                                                      
32706 webuser   20   0 3627128 605856  13828 S  0.0 15.6   0:00.38 java                                                                                      
32708 webuser   20   0 3627128 605856  13828 S  0.0 15.6   0:00.54 java                                                                                      
32710 webuser   20   0 3627128 605856  13828 S  0.0 15.6   0:00.00 java                                                                                      
32711 webuser   20   0 3627128 605856  13828 S  0.0 15.6   0:00.01 java                                                                                      
32716 webuser   20   0 3627128 605856  13828 S  0.0 15.6   0:00.00 java                                                                                      
32724 webuser   20   0 3627128 605856  13828 S  0.0 15.6   0:05.37 java                                                                                      
32728 webuser   20   0 3627128 605856  13828 S  0.0 15.6   0:00.04 java                                                                                      
32729 webuser   20   0 3627128 605856  13828 S  0.0 15.6   0:00.30 java                                                                                      sh
发现线程ID为582的cpu占用率为99.7%,继续分析~<br />注:PID虽然名为进程控制符,但其用途不限于特指进程id。如上中的PID列指的就是线程的id。<br />**第三步:使用printf "%x" xxx指令,将582转换为16进制。**
[webuser@-TESTENV middleplatform]$ printf "%x" 582
246

第四步:使用jstack pid指令,查看当前的堆栈信息;并根据上一步得到的16进制的线程id,找到肇事线程。如果线程较多的话,我们可以使用grep指定来根据关键字定位行,如:jstack 32687 | grep -10 246:显示【|】前面的指令打印出的信息中的,含有关键字【246】的那一行以及其前后各10行的数据

[webuser@-TESTENV middleplatform]$ jstack 32687 | grep -10 246
    - parking to wait for  <0x00000000ff5d1ee0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

"pool-3-thread-1" #64 prio=5 os_prio=0 tid=0x00007fec6094d800 nid=0x246 runnable [0x00007fec2b6ce000]
   java.lang.Thread.State: RUNNABLE
    at net.csdn.interaction.like.controller.Test$Task.add(Test.java:37)
    at net.csdn.interaction.like.controller.Test$Task.run(Test.java:30)
    - locked <0x00000000ff655cb0> (a java.lang.Object)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

"sentinel-metric-aggregator-task-thread-1" #63 daemon prio=5 os_prio=0 tid=0x00007fec60822000 nid=0x244 waiting on condition [0x00007fec2b9cf000]
   java.lang.Thread.State: TIMED_WAITING (parking)
--
    at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.idleJobPoll(QueuedThreadPool.java:875)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:925)
    at java.lang.Thread.run(Thread.java:748)

第五步:分析肇事线程堆栈信息。
image.png

  • pool-3-thread-1: 线程名。
  • prio=5: 线程优先级。
  • nid=0x246: 线程十六进制id。
  • java.lang.Thread.State: RUNNABLE 线程状态。
  • at net.csdn.interaction.like.controller.Test$Task.add(Test.java:37)线程当前所处方法。

第六步:我们已经定义到了问题代码所在的位置,这个时候去查看、解决问题即可。

3.2 死锁

import lombok.SneakyThrows;

import java.util.concurrent.Executor;
import java.util.concurrent.Executors;

/**
 * Created by haoll on 2021/4/23
 */
public class DeadlockDemo {

    private static Executor executor = Executors.newFixedThreadPool(2);

    private static final Object sync1 = new Object();

    private static final Object sync2 = new Object();


    static class MyRunnableOne implements Runnable{

        @SneakyThrows
        @Override
        public void run() {
            synchronized (sync1){
                System.out.println("one线程~获得到锁1~");
                Thread.sleep(1000);
                synchronized (sync2){
                    System.out.println("one线程~获取到锁2~");
                }
            }

        }
    }

    static class MyRunnableTwo implements Runnable{

        @SneakyThrows
        @Override
        public void run() {
            synchronized (sync2){
                System.out.println("two线程~获得到锁2~");
                Thread.sleep(1000);
                synchronized (sync1){
                    System.out.println("two线程~获得到锁1~");
                }
            }

        }

    }

    public static void deadLock() {
        MyRunnableOne myRunnableOne = new MyRunnableOne();
        MyRunnableTwo myRunnableTwo = new MyRunnableTwo();
        executor.execute(myRunnableOne);
        executor.execute(myRunnableTwo);
    }

}

上述代码是伪代码,应该不会有人这样去写代码~,一般线上出现死锁的时候,都会造成线上应用卡顿,但卡顿不一定就是死锁造成,要具体问题具体分析。
排查开始~
第一步:使用jps指令,查看所有java进程。
第二步:使用jstack pid指令,查看指定进程的堆栈信息,观察并定位到死锁线程

[webuser@-TESTENV interaction-like]$ jstack 12484 | grep -10 BLOCKED
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

"pool-3-thread-2" #67 prio=5 os_prio=0 tid=0x00007fb7985c4800 nid=0x3210 waiting for monitor entry [0x00007fb75c4e3000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at net.csdn.interaction.like.controller.DeadlockDemo$MyRunnableTwo.run(DeadlockDemo.java:46)
    - waiting to lock <0x00000000c6e7e6e0> (a java.lang.Object)
    - locked <0x00000000c6e7e6f0> (a java.lang.Object)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

"pool-3-thread-1" #66 prio=5 os_prio=0 tid=0x00007fb7985c3800 nid=0x320f waiting for monitor entry [0x00007fb75c5e4000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at net.csdn.interaction.like.controller.DeadlockDemo$MyRunnableOne.run(DeadlockDemo.java:29)
    - waiting to lock <0x00000000c6e7e6f0> (a java.lang.Object)
    - locked <0x00000000c6e7e6e0> (a java.lang.Object)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

"sentinel-metric-aggregator-task-thread-1" #65 daemon prio=5 os_prio=0 tid=0x00007fb7982df800 nid=0x320d waiting on condition [0x00007fb75c8e5000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)

这样发生死锁的线程就定位到了。我们可以通过:设置获取锁的超时时间、死锁检测等手段预防死锁发生。但是如果一旦发生了死锁就没法解开了,只能停掉程序,修复bug后再启动服务了。