1.场景:接口请求响应很慢。

UAT环境所有接口响应都特别慢,最少都要3秒,有的甚至10几秒。


2.初步排查问题。

通过排查,数据库查询正常,redis也正常,排除了数据库和redis问题。通过top命令定位到某台服务器(basic服务)CPU占用特别高,4核最高时候可达350+,所以问题出现在这台服务器上。


3.服务器问题出现在哪(哪些进程和线程导致问题发生)。

通过脚本定位到线程id和大概问题所在(Full GC次数过多/某个线程进入WAITING状态/死锁等)
0)top命令,然后按shift+p按照CPU排序,找到高CPU对应的进程PID。

1)新建脚本并将将下面内容复制到脚本:

  1. #!/bin/bash
  2. #
  3. # 当JVM占用CPU特别高时,查看CPU正在做什么
  4. # 可输入两个参数:1、pid Java进程ID,必须参数 2、打印线程ID上下文行数,可选参数,默认打印10行
  5. #
  6. pid=$1
  7. if test -z $pid
  8. then
  9. echo "pid can not be null!"
  10. exit
  11. else
  12. echo "checking pid($pid)"
  13. fi
  14. if test -z "$(jps -l | cut -d '' -f 1 | grep $pid)"
  15. then
  16. echo "process of $pid is not exists"
  17. exit
  18. fi
  19. lineNum=$2
  20. if test -z $lineNum
  21. then
  22. $lineNum=10
  23. fi
  24. jstack $pid >> "$pid".bak
  25. ps -mp $pid -o THREAD,tid,time | sort -k2r | awk '{if ($1 !="USER" && $2 != "0.0" && $8 !="-") print $8;}' | xargs printf "%x\n" >> "$pid".tmp
  26. tidArray="$( cat $pid.tmp)"
  27. for tid in $tidArray
  28. do
  29. echo "******************************************************************* ThreadId=$tid **************************************************************************"
  30. cat "$pid".bak | grep $tid -A $lineNum
  31. done
  32. rm -rf $pid.bak
  33. rm -rf $pid.tmp

补充说明:
上述脚本等价于:
A. top -H -p [进程id](或者ps -mp pid -o THREAD,tid,time):找到线程id,会有很多个。
B.printf “%x\n” [线程id] :将线程id转16进制,方便下面通过jstack排查。
C.jstack [进程id] |grep -A [行数] [线程id的16进制] :使用jstack命令,打印进程下全部的线程堆栈信息。

2)授权:
chmod 777 脚本名称,授权之后如下:
image.png

3)执行脚本:
./find.sh 18864 60
上面命令2个参数:第一个参数18864进程id,也就是top命令返回的PID; 第二个参数60是打印的行数,如果不写默认是10行。

补充说明:
执行到此步骤就拿到了线程堆栈信息,需要分析以下情况:
A.如果很多GC线程,说明需要排查GC和堆内存信息。
B.如果有类似found 1 deadlock相关,需要排查死锁情况。
C.如果执行多次,比如间隔5s打印一次,发现几次都有相同线程处于等待状态,则需要排查为什么某个线程会一直处于WAITING状态。
因为本次是排查GC相关,所以会有下面4、5、6步骤。

4)使用jstat工具(jdk自带)查看垃圾回收统计:
jstat -gcutil 18864 1000 200
上面命令3个参数:第一个参数18864进程id,也就是top命令返回的PID; 第二个参数1000是执行间隔,单位毫秒;第三个参数是打印次数。

5)通过保存的堆dump记录,服务器下载到本地,通过JProfiler分析dump文件(保存dump前提是要在启动的时候设置 JVM参数 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/mydata/logs/)。

6)使用JProfiler分析dump文件,找到导致内存OOM的对象,并定位到代码。


4.实践。

程序中执行死循环创建对象,导致OOM从而引发频繁Full GC,最终导致CPU使用率特别高。
top找到高CPU的进程号PID18864 (top命令 shift+p按照CPU排序)
![P)8CFL79{U{7X8UL7ZOK{N.png
PID18864对应的就是basic服务(ps aux|grep 18864)
image.png
执行脚本(./find.sh 18864 60),返回结果如下:

  1. [root@dev-app11 sh]# ./find.sh 18864 60
  2. checking pid(18864)
  3. ******************************************************************* ThreadId=49b2 **************************************************************************
  4. "GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007f8d1401e800 nid=0x49b2 runnable
  5. "GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007f8d14020800 nid=0x49b3 runnable
  6. "VM Periodic Task Thread" os_prio=0 tid=0x00007f8d14170800 nid=0x49bc waiting on condition
  7. JNI global references: 292
  8. ******************************************************************* ThreadId=49b3 **************************************************************************
  9. "GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007f8d14020800 nid=0x49b3 runnable
  10. "VM Periodic Task Thread" os_prio=0 tid=0x00007f8d14170800 nid=0x49bc waiting on condition
  11. JNI global references: 292
  12. ******************************************************************* ThreadId=49b9 **************************************************************************
  13. "C2 CompilerThread0" #5 daemon prio=9 os_prio=0 tid=0x00007f8d1413b800 nid=0x49b9 waiting on condition [0x0000000000000000]
  14. java.lang.Thread.State: RUNNABLE
  15. "Signal Dispatcher" #4 daemon prio=9 os_prio=0 tid=0x00007f8d14139800 nid=0x49b8 runnable [0x0000000000000000]
  16. java.lang.Thread.State: RUNNABLE
  17. "Finalizer" #3 daemon prio=8 os_prio=0 tid=0x00007f8d14106800 nid=0x49b6 in Object.wait() [0x00007f8d18390000]
  18. java.lang.Thread.State: WAITING (on object monitor)
  19. at java.lang.Object.wait(Native Method)
  20. at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:144)
  21. - locked <0x00000000f00f1e78> (a java.lang.ref.ReferenceQueue$Lock)
  22. at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:165)
  23. at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:216)
  24. "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007f8d14104000 nid=0x49b5 in Object.wait() [0x00007f8d18491000]
  25. java.lang.Thread.State: WAITING (on object monitor)
  26. at java.lang.Object.wait(Native Method)
  27. at java.lang.Object.wait(Object.java:502)
  28. at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
  29. - locked <0x00000000f00f1f10> (a java.lang.ref.Reference$Lock)
  30. at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)
  31. "VM Thread" os_prio=0 tid=0x00007f8d140fa800 nid=0x49b4 runnable
  32. "GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007f8d1401e800 nid=0x49b2 runnable
  33. "GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007f8d14020800 nid=0x49b3 runnable
  34. "VM Periodic Task Thread" os_prio=0 tid=0x00007f8d14170800 nid=0x49bc waiting on condition
  35. JNI global references: 292
  36. ******************************************************************* ThreadId=49b1 **************************************************************************
  37. "DestroyJavaVM" #86 prio=5 os_prio=0 tid=0x00007f8d14009800 nid=0x49b1 waiting on condition [0x0000000000000000]
  38. java.lang.Thread.State: RUNNABLE
  39. "http-nio-8011-AsyncTimeout" #84 daemon prio=5 os_prio=0 tid=0x00007f8d17d40000 nid=0x4a2e waiting on condition [0x00007f8cbcae4000]
  40. java.lang.Thread.State: TIMED_WAITING (sleeping)
  41. at java.lang.Thread.sleep(Native Method)
  42. at org.apache.coyote.AbstractProtocol$AsyncTimeout.run(AbstractProtocol.java:1200)
  43. at java.lang.Thread.run(Thread.java:748)
  44. "http-nio-8011-Acceptor-0" #83 daemon prio=5 os_prio=0 tid=0x00007f8d165ba000 nid=0x4a2d runnable [0x00007f8cbcbe5000]
  45. java.lang.Thread.State: RUNNABLE
  46. at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
  47. at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
  48. at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
  49. - locked <0x00000000f369d3b0> (a java.lang.Object)
  50. at org.apache.tomcat.util.net.NioEndpoint$Acceptor.run(NioEndpoint.java:453)
  51. at java.lang.Thread.run(Thread.java:748)
  52. "http-nio-8011-ClientPoller-1" #82 daemon prio=5 os_prio=0 tid=0x00007f8d14f42800 nid=0x4a2c runnable [0x00007f8cbcce6000]
  53. java.lang.Thread.State: RUNNABLE
  54. at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
  55. at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
  56. at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
  57. at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
  58. - locked <0x00000000f36c43b8> (a sun.nio.ch.Util$3)
  59. - locked <0x00000000f36c43a8> (a java.util.Collections$UnmodifiableSet)
  60. - locked <0x00000000f36c4290> (a sun.nio.ch.EPollSelectorImpl)
  61. at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
  62. at org.apache.tomcat.util.net.NioEndpoint$Poller.run(NioEndpoint.java:785)
  63. at java.lang.Thread.run(Thread.java:748)
  64. "http-nio-8011-ClientPoller-0" #81 daemon prio=5 os_prio=0 tid=0x00007f8d15d99800 nid=0x4a2b runnable [0x00007f8cbcde7000]
  65. java.lang.Thread.State: RUNNABLE
  66. at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
  67. at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
  68. at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
  69. at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
  70. - locked <0x00000000f36b3b30> (a sun.nio.ch.Util$3)
  71. - locked <0x00000000f36b3b20> (a java.util.Collections$UnmodifiableSet)
  72. - locked <0x00000000f36b3a08> (a sun.nio.ch.EPollSelectorImpl)
  73. at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
  74. at org.apache.tomcat.util.net.NioEndpoint$Poller.run(NioEndpoint.java:785)
  75. at java.lang.Thread.run(Thread.java:748)
  76. "http-nio-8011-exec-10" #80 daemon prio=5 os_prio=0 tid=0x00007f8d1576b000 nid=0x4a2a waiting on condition [0x00007f8cbcee8000]
  77. java.lang.Thread.State: WAITING (parking)
  78. at sun.misc.Unsafe.park(Native Method)
  79. - parking to wait for <0x00000000f36af5e8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
  80. at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
  81. at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
  82. at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
  83. at org.apache.tomcat.util.threads.TaskQueue.take(TaskQueue.java:103)
  84. at org.apache.tomcat.util.threads.TaskQueue.take(TaskQueue.java:31)
  85. at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
  86. at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
  87. at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  88. at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
  89. at java.lang.Thread.run(Thread.java:748)
  90. "http-nio-8011-exec-9" #79 daemon prio=5 os_prio=0 tid=0x00007f8d15769000 nid=0x4a29 waiting on condition [0x00007f8cbcfe9000]
  91. java.lang.Thread.State: WAITING (parking)
  92. ******************************************************************* ThreadId=49b4 **************************************************************************
  93. "VM Thread" os_prio=0 tid=0x00007f8d140fa800 nid=0x49b4 runnable
  94. "GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007f8d1401e800 nid=0x49b2 runnable
  95. "GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007f8d14020800 nid=0x49b3 runnable
  96. "VM Periodic Task Thread" os_prio=0 tid=0x00007f8d14170800 nid=0x49bc waiting on condition
  97. JNI global references: 292
  98. ...........
  99. [root@dev-app11 sh]#

脚本中大部分都是GC问题,使用jstat工具查看垃圾回收统计,发现Full GC频繁(不到1秒就会触发一次):
![P0Y1(4LNBS4QH%S4H8SP{K.png

之前启动的时候,已经设置了参数,将JVM的dump保存了下来:
image.png

下载到本地,通过JProfiler分析dump文件,找到导致内存OOM的对象,并定位到代码:
打开下载的dump文件,操作如下:
image.png
image.png
打开以后如下:
image.png
可以看到4069042共占用162M内存,通过图形化查看这个对象被谁引用:
image.png
image.png
点击“确定”进入下面界面:
image.png
点击“在图标中显示”:
image.png
点击上图中的 加号“+”:
可以找到上一个引用它的地方:
image.png
再次点击上图中的 加号“+”:
可以定位到对象对应代码的类和方法
image.png
鼠标放到上面,可以精确到第几行:
image.png
进入idea查看TestJss类的209行验证:
image.png
定位结束。


5.参考资料。