Go自带的pprof是一个非常好用的性能分析工具,我们可以对运行中的程序进行数据采样得到profile,然后对proflie进行观察和分析得到系统性能瓶颈。我们常见的系统性能问题有这些:

  1. CPU负载过高
  2. 内存占用过高
  3. 阻塞时间过长
  4. GC时间过于频繁
  5. 协程泄露

何为Profile

Profile翻译为画像,可以理解为我们系统在某一个时刻的快照,通过这个快照我们可以观察到此刻系统的信息,如CPU负载、内存占用等,这些系统信息可以帮助我们分析系统性能。

在 go 语言中自带profiling 的库,主要关注的应用运行情况主要包括以下几种:

  • CPU profile:报告程序的 CPU 使用情况,按照一定频率去采集应用程序在 CPU 和寄存器上面的数据
  • Memory Profile(Heap Profile):报告程序的内存使用情况
  • Block Profiling:报告 goroutines 不在运行状态的情况,可以用来分析和查找死锁等性能瓶颈
  • Goroutine Profiling:报告 goroutines 的使用情况,有哪些 goroutine,它们的调用关系是怎样的

性能分析前的必要准备

这里推荐使用这个github仓库的代码利用pprof进行Go的程序分析,我们进行实践的go版本为1.13

  1. root@GIH-L-3453:~# go version
  2. go version go1.13.8 linux/amd64

首先拉项目下来


git clone https://github.com/wolfogre/go-pprof-practice.git

然后使用git mod管理该项目,比如我们将该module命名为ggg

go mod init ggg

编译处二进制程序

go build .

启动程序,看到有日志输出

lijunshi@GIH-L-3453:/mnt/c/Users/lijunshi/services/go-pprof-practice-master$ ./ggg
2021/08/01 17:32:43 dog.go:26: dog eat
2021/08/01 17:32:43 dog.go:30: dog drink
2021/08/01 17:32:43 dog.go:34: dog shit
2021/08/01 17:32:43 dog.go:38: dog pee
2021/08/01 17:32:43 dog.go:42: dog run
2021/08/01 17:32:43 dog.go:47: dog howl
2021/08/01 17:32:43 wolf.go:27: wolf eat
2021/08/01 17:32:43 wolf.go:31: wolf drink
2021/08/01 17:32:43 wolf.go:40: wolf shit
2021/08/01 17:32:43 wolf.go:44: wolf pee
2021/08/01 17:32:43 wolf.go:48: wolf run
2021/08/01 17:32:43 wolf.go:52: wolf howl
2021/08/01 17:32:44 cat.go:25: cat eat
2021/08/01 17:32:44 cat.go:29: cat drink
2021/08/01 17:32:44 cat.go:33: cat shit
2021/08/01 17:32:44 cat.go:37: cat pee

使用top查看系统指标,可以明显看出两个问题:

  1. ggg进程内存占用过高,已经用了2.1G内存;
  2. ggg进程CPU占用过高,CPU负载基本处于50~70%。
top - 17:33:45 up  1:51,  0 users,  load average: 0.38, 0.39, 0.42
Tasks:  13 total,   1 running,  12 sleeping,   0 stopped,   0 zombie
%Cpu(s):  6.4 us,  0.1 sy,  0.0 ni, 93.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  12655.4 total,  10100.0 free,   2219.4 used,    336.0 buff/cache
MiB Swap:   4096.0 total,   4096.0 free,      0.0 used.  10449.5 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 2562 lijunshi  20   0 2647292   2.1g   7160 S  51.2  16.9   0:30.06 ggg
    1 root      20   0     896    576    512 S   0.0   0.0   0:00.01 init
    7 root      20   0     896     84     20 S   0.0   0.0   0:00.00 init
    9 root      20   0    2924   1976   1728 S   0.0   0.0   0:01.38 wslconnect
   10 root      20   0   10036   4992   3320 S   0.0   0.0   0:00.03 bash
  838 root      20   0    9972   3688   3268 S   0.0   0.0   0:00.00 su
  839 lijunshi  20   0   10132   5536   3688 S   0.0   0.0   0:00.12 bash
 1425 root      20   0     896     84     20 S   0.0   0.0   0:00.00 init
 1427 root      20   0    2924   2108   1852 S   0.0   0.0   0:00.13 wslconnect
 1428 root      20   0   10036   5084   3424 S   0.0   0.0   0:00.02 bash
 1447 root      20   0    9972   3492   3076 S   0.0   0.0   0:00.00 su
 1448 lijunshi  20   0   10132   5272   3452 S   0.0   0.0   0:00.07 bash
 2571 lijunshi  20   0   10880   3704   3132 R   0.0   0.0   0:00.00 top

在使用pprof对这个有性能问题的程序分析之前,需要安装一个图形化的工具graphviz,用于描绘火焰图和函数调用关系图,可以帮助我们更好地分析性能问题。对于ubuntu,直接指令执行安装即可。其他系统的安装可以参考这里:https://graphviz.gitlab.io/download/

sudo apt install graphviz

使用pprof进行性能分析是需要进行代码修改的(侵入式),但添加的代码其实只有三行,

import (
...
    _ "net/http/pprof" // 引入pprof 模块
...
)

func main() {
...
    runtime.SetMutexProfileFraction(1)  // 开启采样互斥锁,如果不需要分析该指标就不需要添加该行
    runtime.SetBlockProfileRate(1) // 开启采样阻塞事件,如果不需要分析该指标就不需要添加该行
...

比如我们要分析的这个项目的main函数从应该这么写,就可以直接启用pprof了。注意,我们这个程序绑定的端口是6060。

package main

import (
    "log"
    "net/http"
    _ "net/http/pprof"
    "os"
    "runtime"
    "time"

    "github.com/wolfogre/go-pprof-practice/animal"
)

func main() {
    log.SetFlags(log.Lshortfile | log.LstdFlags)
    log.SetOutput(os.Stdout)

    runtime.GOMAXPROCS(1)
    runtime.SetMutexProfileFraction(1)
    runtime.SetBlockProfileRate(1)

    go func() {
        if err := http.ListenAndServe(":6060", nil); err != nil {
            log.Fatal(err)
        }
        os.Exit(0)
    }()

    for {
        for _, v := range animal.AllAnimals {
            v.Live()
        }
        time.Sleep(time.Second)
    }
}

分析过程

排查CPU负载过高

top看一下进程运行情况,发现CPU负载为75%,因此需要排查该程序为什么CPU占用会如此高。

top - 18:08:48 up  2:26,  0 users,  load average: 0.00, 0.00, 0.03
Tasks:  13 total,   1 running,  12 sleeping,   0 stopped,   0 zombie
%Cpu(s):  9.4 us,  0.0 sy,  0.0 ni, 90.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  12655.4 total,  12225.7 free,     94.7 used,    334.9 buff/cache
MiB Swap:   4096.0 total,   4096.0 free,      0.0 used.  12317.1 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 2575 lijunshi  20   0  482620  12748   5876 S  75.1   0.1   0:02.31 ggg
    1 root      20   0     896    576    512 S   0.0   0.0   0:00.01 init
    7 root      20   0     896     84     20 S   0.0   0.0   0:00.00 init
    9 root      20   0    2924   1976   1728 S   0.0   0.0   0:01.45 wslconnect
   10 root      20   0   10036   4992   3320 S   0.0   0.0   0:00.03 bash
  838 root      20   0    9972   3688   3268 S   0.0   0.0   0:00.00 su
  839 lijunshi  20   0   10132   5536   3688 S   0.0   0.0   0:00.14 bash
 1425 root      20   0     896     84     20 S   0.0   0.0   0:00.00 init
 1427 root      20   0    2924   2108   1852 S   0.0   0.0   0:00.29 wslconnect
 1428 root      20   0   10036   5084   3424 S   0.0   0.0   0:00.02 bash
 1447 root      20   0    9972   3492   3076 S   0.0   0.0   0:00.00 su
 1448 lijunshi  20   0   10132   5272   3452 S   0.0   0.0   0:00.07 bash
 2571 lijunshi  20   0   10880   3704   3132 R   0.0   0.0   0:00.55 top

使用go tool pprof指令进行分析,记得下面的6060是你进程箭监听的端口号,与main内定义的端口号保持一致。

go tool pprof http://localhost:6060/debug/pprof/profile

执行后会进入命令行交互模式,top查看cpu占用情况,list可以打印出指定的函数。

lijunshi@GIH-L-3453:/mnt/c/Users/lijunshi/services/go-pprof-practice-master$ go tool pprof http://localhost:6060/debug/pprof/profile
Fetching profile over HTTP from http://localhost:6060/debug/pprof/profile
Saved profile in /home/lijunshi/pprof/pprof.ggg.samples.cpu.002.pb.gz
File: ggg
Type: cpu
Time: Aug 1, 2021 at 6:11pm (CST)
Duration: 30s, Total samples = 13.15s (43.83%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 13.14s, 99.92% of 13.15s total
Dropped 4 nodes (cum <= 0.07s)
      flat  flat%   sum%        cum   cum%
    13.14s 99.92% 99.92%     13.14s 99.92%  github.com/wolfogre/go-pprof-practice/animal/felidae/tiger.(*Tiger).Eat
         0     0% 99.92%     13.14s 99.92%  github.com/wolfogre/go-pprof-practice/animal/felidae/tiger.(*Tiger).Live
         0     0% 99.92%     13.14s 99.92%  main.main
         0     0% 99.92%     13.14s 99.92%  runtime.main
(pprof) list Eat
Total: 13.15s
ROUTINE ======================== github.com/wolfogre/go-pprof-practice/animal/felidae/tiger.(*Tiger).Eat in /home/lijunshi/go/pkg/mod/github.com/wolfogre/go-pprof-practice@v0.0.0-20190402114113-8ce266a210ee/animal/felidae/tiger/tiger.go
    13.14s     13.14s (flat, cum) 99.92% of Total
         .          .     19:}
         .          .     20:
         .          .     21:func (t *Tiger) Eat() {
         .          .     22:   log.Println(t.Name(), "eat")
         .          .     23:   loop := 10000000000
    13.14s     13.14s     24:   for i := 0; i < loop; i++ {
         .          .     25:           // do nothing
         .          .     26:   }
         .          .     27:}
         .          .     28:
         .          .     29:func (t *Tiger) Drink() {

这里看到 github.com/wolfogre/go-pprof-practice/animal/felidae/tiger.(*Tiger).Eat占用了99.92%的CPU,通过函数打印看出一个无用的空循环导致了CPU的空转,是CPU负载高居不下的原因。

如果我们前面安装了graphviz,那我们可以把这些profile在Web上描绘出来,更直观地看到系统瓶颈。

lijunshi@GIH-L-3453:/mnt/c/Users/lijunshi/services/go-pprof-practice-master$ go tool pprof -http=:8000 http://localhost:6060/debug/pprof/profile
Fetching profile over HTTP from http://localhost:6060/debug/pprof/profile
Saved profile in /home/lijunshi/pprof/pprof.ggg.samples.cpu.003.pb.gz
Serving web UI on http://localhost:8000

我们在浏览器上打开http://localhost:8000

首先观察函数调用链图:

截屏2021-08-01 下午11.10.02.png

观察火焰图:
截屏2021-08-01 下午11.10.15.png

可以直接在Web上查看具体代码
截屏2021-08-01 下午11.10.22.png

找到问题后,我们把相关代码注释掉,并重新编译运行,top继续观察,发现CPU负载已经降下来了。

top - 19:22:46 up  3:40,  0 users,  load average: 0.07, 0.19, 0.30
Tasks:  18 total,   1 running,  17 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  12655.4 total,  10083.7 free,   2234.3 used,    337.3 buff/cache
MiB Swap:   4096.0 total,   4096.0 free,      0.0 used.  10426.6 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
    1 root      20   0     896    576    512 S   0.0   0.0   0:00.01 init
    7 root      20   0     896     84     20 S   0.0   0.0   0:00.00 init
    9 root      20   0    2924   1976   1728 S   0.0   0.0   0:02.18 wslconnect
   10 root      20   0   10036   4992   3320 S   0.0   0.0   0:00.03 bash
  838 root      20   0    9972   3688   3268 S   0.0   0.0   0:00.00 su
  839 lijunshi  20   0   10132   5536   3688 S   0.0   0.0   0:00.21 bash
 2683 root      20   0     896     84     20 S   0.0   0.0   0:00.00 init
 2685 root      20   0    2924   2048   1796 S   0.0   0.0   0:00.07 wslconnect
 2686 root      20   0   10036   5036   3380 S   0.0   0.0   0:00.02 bash
 2705 root      20   0    9972   3748   3328 S   0.0   0.0   0:00.00 su
 2706 lijunshi  20   0   10132   5316   3500 S   0.0   0.0   0:00.03 bash
 3602 root      20   0     896     84     20 S   0.0   0.0   0:00.00 init
 3604 root      20   0    2924   1940   1684 S   0.0   0.0   0:00.05 wslconnect
 3605 root      20   0   10036   5124   3456 S   0.0   0.0   0:00.03 bash
 4602 lijunshi  20   0 2721276   2.1g   7820 S   0.0  16.9   0:01.12 ggg
 4611 lijunshi  20   0  483416  11816   7340 S   0.0   0.1   0:00.00 go
 4617 lijunshi  20   0  485164  16100   7536 S   0.0   0.1   0:00.01 pprof
 4636 root      20   0   10880   3788   3212 R   0.0   0.0   0:00.00 top

排查内存占用过高的问题

现在进程还是占用了2.1G内存,这是明显不合理的,因此这边需要继续排查内存问题。

我们使用go tool pprof http://localhost:6060/debug/pprof/heap查看内存占用情况,使用top+list定位到问题函数。

lijunshi@GIH-L-3453:/mnt/c/Users/lijunshi/services/go-pprof-practice-master$ go tool pprof http://localhost:6060/debug/pprof/heap
Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap
Saved profile in /home/lijunshi/pprof/pprof.ggg.alloc_objects.alloc_space.inuse_objects.inuse_space.007.pb.gz
File: ggg
Type: inuse_space
Time: Aug 1, 2021 at 9:35pm (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 768MB, 100% of 768MB total
      flat  flat%   sum%        cum   cum%
     768MB   100%   100%      768MB   100%  ggg/animal/muridae/mouse.(*Mouse).Steal
         0     0%   100%      768MB   100%  ggg/animal/muridae/mouse.(*Mouse).Live
         0     0%   100%      768MB   100%  main.main
         0     0%   100%      768MB   100%  runtime.main
(pprof) list Steal
Total: 768MB
ROUTINE ======================== ggg/animal/muridae/mouse.(*Mouse).Steal in /mnt/c/Users/lijunshi/services/go-pprof-practice-master/animal/muridae/mouse/mouse.go
     768MB      768MB (flat, cum)   100% of Total
         .          .     46:func (m *Mouse) Steal() {
         .          .     47:   log.Println(m.Name(), "steal")
         .          .     48:
         .          .     49:   max := constant.Gi
         .          .     50:   for len(m.buffer) * constant.Mi < max {
     768MB      768MB     51:           m.buffer = append(m.buffer, [constant.Mi]byte{})
         .          .     52:   }
         .          .     53:}
(pprof)

利用go tool pprof -http=:8000 http://localhost:6060/debug/pprof/heap 获得可视化图表

截屏2021-08-01 下午11.10.56.png

ggg/animal/muridae/mouse.(_Mouse).Steal这个函数申请了1.5G内存,一直持有并未释放。因此我们top看到的2.1G RES就是这个导致的。因此我们为了减少内存占用,首先把ggg/animal/muridae/mouse.(_Mouse).Steal相关不合理逻辑注释掉。处理之后top观察,内存就显示正常了。

排查GC过于频繁

Go的GC如果过于频繁,将导致程序性能下降,因为STW的存在,因此每一次GC都导致程序停止运行一段时间。刚刚在查证内存问题时发现程序在频繁申请无用的内存,这从直觉上就能感知到这回导致GC更为频繁。

我们观察GC日志发现,程序的GC异常频繁。相关指令为GODEBUG=gctrace=1 ./ggg | grep gc

lijunshi@GIH-L-3453:/mnt/c/Users/lijunshi/services/go-pprof-practice-master$ GODEBUG=gctrace=1 ./ggg | grep gc
gc 1 @0.114s 0%: 0.002+0.37+0.002 ms clock, 0.002+0.19/0.076/0+0.002 ms cpu, 16->16->0 MB, 17 MB goal, 1 P
scvg: 0 MB released
scvg: inuse: 16, idle: 46, sys: 63, released: 46, consumed: 16 (MB)
scvg: 1 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 47, consumed: 15 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 49, consumed: 13 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 51, consumed: 11 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 53, consumed: 9 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 55, consumed: 7 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 57, consumed: 5 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 59, consumed: 3 (MB)
gc 2 @3.126s 0%: 0.021+6.7+0.003 ms clock, 0.021+0.23/6.3/0+0.003 ms cpu, 16->16->0 MB, 17 MB goal, 1 P
scvg: inuse: 16, idle: 46, sys: 63, released: 46, consumed: 16 (MB)
scvg: 1 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 47, consumed: 15 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 49, consumed: 13 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 51, consumed: 11 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 53, consumed: 9 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 55, consumed: 7 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 57, consumed: 5 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 59, consumed: 3 (MB)
gc 3 @6.137s 0%: 0.006+0.32+0.003 ms clock, 0.006+0.10/0.11/0+0.003 ms cpu, 16->16->0 MB, 17 MB goal, 1 P
scvg: inuse: 16, idle: 46, sys: 63, released: 46, consumed: 16 (MB)
scvg: 1 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 48, consumed: 15 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 50, consumed: 13 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 52, consumed: 11 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 54, consumed: 9 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 56, consumed: 7 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 58, consumed: 5 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 60, consumed: 3 (MB)
gc 4 @9.142s 0%: 0.005+0.37+0.002 ms clock, 0.005+0.13/0.13/0+0.002 ms cpu, 16->16->0 MB, 17 MB goal, 1 P

重点关注这一行,下面这一行表示GC时将16M内存回收,而且发现每次GC都会把16M内存回收。这就很奇怪了,因此合理怀疑程序某个地方频繁申请16M内存。

gc 6 @15.151s 0%: 0.007+0.43+0.002 ms clock, 0.007+0.14/0.14/0+0.002 ms cpu, 16->16->0 MB, 17 MB goal, 1

使用go tool pprof http://localhost:6060/debug/pprof/allocs查看程序分配内存的情况,并使用top + list 定位问题函数。

lijunshi@GIH-L-3453:/mnt/c/Users/lijunshi/services/go-pprof-practice-master$ go tool pprof http://localhost:6060/debug/pprof/allocs
Fetching profile over HTTP from http://localhost:6060/debug/pprof/allocs
Saved profile in /home/lijunshi/pprof/pprof.ggg.alloc_objects.alloc_space.inuse_objects.inuse_space.010.pb.gz
File: ggg
Type: alloc_space
Time: Aug 1, 2021 at 9:46pm (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 1.27GB, 100% of 1.27GB total
      flat  flat%   sum%        cum   cum%
    1.27GB   100%   100%     1.27GB   100%  ggg/animal/canidae/dog.(*Dog).Run
         0     0%   100%     1.27GB   100%  ggg/animal/canidae/dog.(*Dog).Live
         0     0%   100%     1.27GB   100%  main.main
         0     0%   100%     1.27GB   100%  runtime.main
(pprof) list Run
Total: 1.27GB
ROUTINE ======================== ggg/animal/canidae/dog.(*Dog).Run in /mnt/c/Users/lijunshi/services/go-pprof-practice-master/animal/canidae/dog/dog.go
    1.27GB     1.27GB (flat, cum)   100% of Total
         .          .     38:   log.Println(d.Name(), "pee")
         .          .     39:}
         .          .     40:
         .          .     41:func (d *Dog) Run() {
         .          .     42:   log.Println(d.Name(), "run")
    1.27GB     1.27GB     43:   _ = make([]byte, 16 * constant.Mi)
         .          .     44:}
         .          .     45:
         .          .     46:func (d *Dog) Howl() {
         .          .     47:   log.Println(d.Name(), "howl")
         .          .     48:}

go tool pprof -http=:8000 http://localhost:6060/debug/pprof/allocs生成web图表。

截屏2021-08-01 下午11.11.08.png

从上面的数据看出,animal/canidae/dog/dog.go的Run函数反复申请16M内存,但未实际使用,已经累计申请1.5个G了。这个导致GC频繁,因此需要把这段逻辑优化一下,比如申请时申请较小的内存,改动之后GC频率就很低了(函数内申请小空间都是栈上分配,协程内的私有空间有2K,因此可以栈上分配这1K的内存)。

func (d *Dog) Run() {
    log.Println(d.Name(), "run")
    _ = make([]byte, 16 * constant.Ki)
}

排查goroutine过多

Go的一大优势就是可以启动大量的goroutine,但是如果goroutine数目异常的多,那就可能是发生了goroutine泄露,最直观的影响就是这些goroutine会占用不少的内存资源。我们同样可以使用pprof来查看goroutine数量异常的情况。

lijunshi@GIH-L-3453:/mnt/c/Users/lijunshi/services/go-pprof-practice-master$ go tool pprof http://localhost:6060/debug/pprof/goroutine
Fetching profile over HTTP from http://localhost:6060/debug/pprof/goroutine
Saved profile in /home/lijunshi/pprof/pprof.ggg.goroutine.002.pb.gz
File: ggg
Type: goroutine
Time: Aug 1, 2021 at 9:56pm (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 104, 100% of 104 total
Showing top 10 nodes out of 28
      flat  flat%   sum%        cum   cum%
       102 98.08% 98.08%        102 98.08%  runtime.gopark
         1  0.96% 99.04%          1  0.96%  net/http.(*connReader).backgroundRead
         1  0.96%   100%          1  0.96%  runtime/pprof.writeRuntimeProfile
         0     0%   100%        100 96.15%  ggg/animal/canidae/wolf.(*Wolf).Drink.func1
         0     0%   100%          1  0.96%  internal/poll.(*FD).Accept
         0     0%   100%          1  0.96%  internal/poll.(*pollDesc).wait
         0     0%   100%          1  0.96%  internal/poll.(*pollDesc).waitRead
         0     0%   100%          1  0.96%  internal/poll.runtime_pollWait
         0     0%   100%          1  0.96%  main.main
         0     0%   100%          1  0.96%  main.main.func1
(pprof) list Drink.func1
Total: 104
ROUTINE ======================== ggg/animal/canidae/wolf.(*Wolf).Drink.func1 in /mnt/c/Users/lijunshi/services/go-pprof-practice-master/animal/canidae/wolf/wolf.go
         0        100 (flat, cum) 96.15% of Total
         .          .     29:
         .          .     30:func (w *Wolf) Drink() {
         .          .     31:   log.Println(w.Name(), "drink")
         .          .     32:   for i := 0; i < 10; i++ {
         .          .     33:           go func() {
         .        100     34:                   time.Sleep(30 * time.Second)
         .          .     35:           }()
         .          .     36:   }
         .          .     37:}
         .          .     38:
         .          .     39:func (w *Wolf) Shit() {

截屏2021-08-01 下午11.11.17.png

这里看出,Drink.func1每次调度都会开启10个协程并等待30秒才释放,因此容易导致goroutine泄露,因此对这段代码进行修复,比如减少等待时间。

排查锁的问题

如果我们程序使用到了mutex,那么就有可能出现死锁和锁争用的问题,进而导致进程阻塞。此时我们可以利用pprof来排查锁的问题。

lijunshi@GIH-L-3453:/mnt/c/Users/lijunshi/services/go-pprof-practice-master$ go tool pprof http://localhost:6060/debug/pprof/mutex
Fetching profile over HTTP from http://localhost:6060/debug/pprof/mutex
Saved profile in /home/lijunshi/pprof/pprof.ggg.contentions.delay.001.pb.gz
File: ggg
Type: delay
Time: Aug 1, 2021 at 10:06pm (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 13s, 100% of 13s total
      flat  flat%   sum%        cum   cum%
       13s   100%   100%        13s   100%  sync.(*Mutex).Unlock
         0     0%   100%        13s   100%  ggg/animal/canidae/wolf.(*Wolf).Howl.func1
(pprof) list Howl.func1
Total: 13s
ROUTINE ======================== ggg/animal/canidae/wolf.(*Wolf).Howl.func1 in /mnt/c/Users/lijunshi/services/go-pprof-practice-master/animal/canidae/wolf/wolf.go
         0        13s (flat, cum)   100% of Total
         .          .     53:
         .          .     54:   m := &sync.Mutex{}
         .          .     55:   m.Lock()
         .          .     56:   go func() {
         .          .     57:           time.Sleep(time.Second)
         .        13s     58:           m.Unlock()
         .          .     59:   }()
         .          .     60:   m.Lock()
         .          .     61:}

截屏2021-08-01 下午11.11.23.png

可以看出, ggg/animal/canidae/wolf.(*Wolf).Howl.func1持有锁1秒后才释放,明显对导致锁的争用和协程阻塞。因此移除这段无用的锁占用等待,可以解决这个问题。

排查程序阻塞的问题

锁导致程序阻塞是常见的问题,但是程序的阻塞也有可能是由于系统调用导致的,因为协程可能在等待某个条件或某个资源的到来而暂时放弃CPU的使用权进入阻塞状态。所以当程序异常卡住不动时,我们可以使用pprof来看看程序在等待什么。

lijunshi@GIH-L-3453:/mnt/c/Users/lijunshi/services/go-pprof-practice-master$ go tool pprof http://localhost:6060/debug/pprof/block
Fetching profile over HTTP from http://localhost:6060/debug/pprof/block
Saved profile in /home/lijunshi/pprof/pprof.ggg.contentions.delay.004.pb.gz
File: ggg
Type: delay
Time: Aug 1, 2021 at 10:13pm (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 18.01s, 100% of 18.01s total
Dropped 3 nodes (cum <= 0.09s)
      flat  flat%   sum%        cum   cum%
    18.01s   100%   100%     18.01s   100%  runtime.chanrecv1
         0     0%   100%     18.01s   100%  ggg/animal/felidae/cat.(*Cat).Live
         0     0%   100%     18.01s   100%  ggg/animal/felidae/cat.(*Cat).Pee
         0     0%   100%     18.01s   100%  main.main
         0     0%   100%     18.01s   100%  runtime.main
(pprof) list ggg/animal/felidae/cat.(*Cat).Live
parsing argument regexp ggg/animal/felidae/cat.(*Cat).Live: error parsing regexp: missing argument to repetition operator: `*`
(pprof) list Live
Total: 18.01s
ROUTINE ======================== ggg/animal/canidae/wolf.(*Wolf).Live in /mnt/c/Users/lijunshi/services/go-pprof-practice-master/animal/canidae/wolf/wolf.go
         0   142.11us (flat, cum) 0.00079% of Total
         .          .     18:   w.Eat()
         .          .     19:   w.Drink()
         .          .     20:   w.Shit()
         .          .     21:   w.Pee()
         .          .     22:   w.Run()
         .   142.11us     23:   w.Howl()
         .          .     24:}
         .          .     25:
         .          .     26:func (w *Wolf) Eat() {
         .          .     27:   log.Println(w.Name(), "eat")
         .          .     28:}
ROUTINE ======================== ggg/animal/felidae/cat.(*Cat).Live in /mnt/c/Users/lijunshi/services/go-pprof-practice-master/animal/felidae/cat/cat.go
         0     18.01s (flat, cum)   100% of Total
         .          .     14:
         .          .     15:func (c *Cat) Live() {
         .          .     16:   c.Eat()
         .          .     17:   c.Drink()
         .          .     18:   c.Shit()
         .     18.01s     19:   c.Pee()
         .          .     20:   c.Climb()
         .          .     21:   c.Sneak()
         .          .     22:}
         .          .     23:
         .          .     24:func (c *Cat) Eat() {
(pprof) list Pee
Total: 18.01s
ROUTINE ======================== ggg/animal/felidae/cat.(*Cat).Pee in /mnt/c/Users/lijunshi/services/go-pprof-practice-master/animal/felidae/cat/cat.go
         0     18.01s (flat, cum)   100% of Total
         .          .     34:}
         .          .     35:
         .          .     36:func (c *Cat) Pee() {
         .          .     37:   log.Println(c.Name(), "pee")
         .          .     38:
         .     18.01s     39:   <-time.After(time.Second)
         .          .     40:}
         .          .     41:
         .          .     42:func (c *Cat) Climb() {
         .          .     43:   log.Println(c.Name(), "climb")
         .          .     44:}

截屏2021-08-01 下午11.11.31.png

查看相关代码和函数调用可以看出,animal/felidae/cat/cat.go的Pee() 1秒后才发送数据给channel,这是个阻塞操作,已经累计阻塞18秒了。因此我们可以优化这里阻塞操作,可以减少程序的阻塞时间。