Go自带的pprof是一个非常好用的性能分析工具,我们可以对运行中的程序进行数据采样得到profile,然后对proflie进行观察和分析得到系统性能瓶颈。我们常见的系统性能问题有这些:
- CPU负载过高
- 内存占用过高
- 阻塞时间过长
- GC时间过于频繁
- 协程泄露
何为Profile
Profile翻译为画像,可以理解为我们系统在某一个时刻的快照,通过这个快照我们可以观察到此刻系统的信息,如CPU负载、内存占用等,这些系统信息可以帮助我们分析系统性能。
在 go 语言中自带profiling 的库,主要关注的应用运行情况主要包括以下几种:
- CPU profile:报告程序的 CPU 使用情况,按照一定频率去采集应用程序在 CPU 和寄存器上面的数据
- Memory Profile(Heap Profile):报告程序的内存使用情况
- Block Profiling:报告 goroutines 不在运行状态的情况,可以用来分析和查找死锁等性能瓶颈
- Goroutine Profiling:报告 goroutines 的使用情况,有哪些 goroutine,它们的调用关系是怎样的
性能分析前的必要准备
这里推荐使用这个github仓库的代码利用pprof进行Go的程序分析,我们进行实践的go版本为1.13
root@GIH-L-3453:~# go version
go version go1.13.8 linux/amd64
首先拉项目下来
git clone https://github.com/wolfogre/go-pprof-practice.git
然后使用git mod管理该项目,比如我们将该module命名为ggg
go mod init ggg
编译处二进制程序
go build .
启动程序,看到有日志输出
lijunshi@GIH-L-3453:/mnt/c/Users/lijunshi/services/go-pprof-practice-master$ ./ggg
2021/08/01 17:32:43 dog.go:26: dog eat
2021/08/01 17:32:43 dog.go:30: dog drink
2021/08/01 17:32:43 dog.go:34: dog shit
2021/08/01 17:32:43 dog.go:38: dog pee
2021/08/01 17:32:43 dog.go:42: dog run
2021/08/01 17:32:43 dog.go:47: dog howl
2021/08/01 17:32:43 wolf.go:27: wolf eat
2021/08/01 17:32:43 wolf.go:31: wolf drink
2021/08/01 17:32:43 wolf.go:40: wolf shit
2021/08/01 17:32:43 wolf.go:44: wolf pee
2021/08/01 17:32:43 wolf.go:48: wolf run
2021/08/01 17:32:43 wolf.go:52: wolf howl
2021/08/01 17:32:44 cat.go:25: cat eat
2021/08/01 17:32:44 cat.go:29: cat drink
2021/08/01 17:32:44 cat.go:33: cat shit
2021/08/01 17:32:44 cat.go:37: cat pee
使用top查看系统指标,可以明显看出两个问题:
- ggg进程内存占用过高,已经用了2.1G内存;
- ggg进程CPU占用过高,CPU负载基本处于50~70%。
top - 17:33:45 up 1:51, 0 users, load average: 0.38, 0.39, 0.42
Tasks: 13 total, 1 running, 12 sleeping, 0 stopped, 0 zombie
%Cpu(s): 6.4 us, 0.1 sy, 0.0 ni, 93.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 12655.4 total, 10100.0 free, 2219.4 used, 336.0 buff/cache
MiB Swap: 4096.0 total, 4096.0 free, 0.0 used. 10449.5 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2562 lijunshi 20 0 2647292 2.1g 7160 S 51.2 16.9 0:30.06 ggg
1 root 20 0 896 576 512 S 0.0 0.0 0:00.01 init
7 root 20 0 896 84 20 S 0.0 0.0 0:00.00 init
9 root 20 0 2924 1976 1728 S 0.0 0.0 0:01.38 wslconnect
10 root 20 0 10036 4992 3320 S 0.0 0.0 0:00.03 bash
838 root 20 0 9972 3688 3268 S 0.0 0.0 0:00.00 su
839 lijunshi 20 0 10132 5536 3688 S 0.0 0.0 0:00.12 bash
1425 root 20 0 896 84 20 S 0.0 0.0 0:00.00 init
1427 root 20 0 2924 2108 1852 S 0.0 0.0 0:00.13 wslconnect
1428 root 20 0 10036 5084 3424 S 0.0 0.0 0:00.02 bash
1447 root 20 0 9972 3492 3076 S 0.0 0.0 0:00.00 su
1448 lijunshi 20 0 10132 5272 3452 S 0.0 0.0 0:00.07 bash
2571 lijunshi 20 0 10880 3704 3132 R 0.0 0.0 0:00.00 top
在使用pprof对这个有性能问题的程序分析之前,需要安装一个图形化的工具graphviz,用于描绘火焰图和函数调用关系图,可以帮助我们更好地分析性能问题。对于ubuntu,直接指令执行安装即可。其他系统的安装可以参考这里:https://graphviz.gitlab.io/download/
sudo apt install graphviz
使用pprof进行性能分析是需要进行代码修改的(侵入式),但添加的代码其实只有三行,
import (
...
_ "net/http/pprof" // 引入pprof 模块
...
)
func main() {
...
runtime.SetMutexProfileFraction(1) // 开启采样互斥锁,如果不需要分析该指标就不需要添加该行
runtime.SetBlockProfileRate(1) // 开启采样阻塞事件,如果不需要分析该指标就不需要添加该行
...
比如我们要分析的这个项目的main函数从应该这么写,就可以直接启用pprof了。注意,我们这个程序绑定的端口是6060。
package main
import (
"log"
"net/http"
_ "net/http/pprof"
"os"
"runtime"
"time"
"github.com/wolfogre/go-pprof-practice/animal"
)
func main() {
log.SetFlags(log.Lshortfile | log.LstdFlags)
log.SetOutput(os.Stdout)
runtime.GOMAXPROCS(1)
runtime.SetMutexProfileFraction(1)
runtime.SetBlockProfileRate(1)
go func() {
if err := http.ListenAndServe(":6060", nil); err != nil {
log.Fatal(err)
}
os.Exit(0)
}()
for {
for _, v := range animal.AllAnimals {
v.Live()
}
time.Sleep(time.Second)
}
}
分析过程
排查CPU负载过高
top看一下进程运行情况,发现CPU负载为75%,因此需要排查该程序为什么CPU占用会如此高。
top - 18:08:48 up 2:26, 0 users, load average: 0.00, 0.00, 0.03
Tasks: 13 total, 1 running, 12 sleeping, 0 stopped, 0 zombie
%Cpu(s): 9.4 us, 0.0 sy, 0.0 ni, 90.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 12655.4 total, 12225.7 free, 94.7 used, 334.9 buff/cache
MiB Swap: 4096.0 total, 4096.0 free, 0.0 used. 12317.1 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2575 lijunshi 20 0 482620 12748 5876 S 75.1 0.1 0:02.31 ggg
1 root 20 0 896 576 512 S 0.0 0.0 0:00.01 init
7 root 20 0 896 84 20 S 0.0 0.0 0:00.00 init
9 root 20 0 2924 1976 1728 S 0.0 0.0 0:01.45 wslconnect
10 root 20 0 10036 4992 3320 S 0.0 0.0 0:00.03 bash
838 root 20 0 9972 3688 3268 S 0.0 0.0 0:00.00 su
839 lijunshi 20 0 10132 5536 3688 S 0.0 0.0 0:00.14 bash
1425 root 20 0 896 84 20 S 0.0 0.0 0:00.00 init
1427 root 20 0 2924 2108 1852 S 0.0 0.0 0:00.29 wslconnect
1428 root 20 0 10036 5084 3424 S 0.0 0.0 0:00.02 bash
1447 root 20 0 9972 3492 3076 S 0.0 0.0 0:00.00 su
1448 lijunshi 20 0 10132 5272 3452 S 0.0 0.0 0:00.07 bash
2571 lijunshi 20 0 10880 3704 3132 R 0.0 0.0 0:00.55 top
使用go tool pprof指令进行分析,记得下面的6060是你进程箭监听的端口号,与main内定义的端口号保持一致。
go tool pprof http://localhost:6060/debug/pprof/profile
执行后会进入命令行交互模式,top查看cpu占用情况,list可以打印出指定的函数。
lijunshi@GIH-L-3453:/mnt/c/Users/lijunshi/services/go-pprof-practice-master$ go tool pprof http://localhost:6060/debug/pprof/profile
Fetching profile over HTTP from http://localhost:6060/debug/pprof/profile
Saved profile in /home/lijunshi/pprof/pprof.ggg.samples.cpu.002.pb.gz
File: ggg
Type: cpu
Time: Aug 1, 2021 at 6:11pm (CST)
Duration: 30s, Total samples = 13.15s (43.83%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 13.14s, 99.92% of 13.15s total
Dropped 4 nodes (cum <= 0.07s)
flat flat% sum% cum cum%
13.14s 99.92% 99.92% 13.14s 99.92% github.com/wolfogre/go-pprof-practice/animal/felidae/tiger.(*Tiger).Eat
0 0% 99.92% 13.14s 99.92% github.com/wolfogre/go-pprof-practice/animal/felidae/tiger.(*Tiger).Live
0 0% 99.92% 13.14s 99.92% main.main
0 0% 99.92% 13.14s 99.92% runtime.main
(pprof) list Eat
Total: 13.15s
ROUTINE ======================== github.com/wolfogre/go-pprof-practice/animal/felidae/tiger.(*Tiger).Eat in /home/lijunshi/go/pkg/mod/github.com/wolfogre/go-pprof-practice@v0.0.0-20190402114113-8ce266a210ee/animal/felidae/tiger/tiger.go
13.14s 13.14s (flat, cum) 99.92% of Total
. . 19:}
. . 20:
. . 21:func (t *Tiger) Eat() {
. . 22: log.Println(t.Name(), "eat")
. . 23: loop := 10000000000
13.14s 13.14s 24: for i := 0; i < loop; i++ {
. . 25: // do nothing
. . 26: }
. . 27:}
. . 28:
. . 29:func (t *Tiger) Drink() {
这里看到 github.com/wolfogre/go-pprof-practice/animal/felidae/tiger.(*Tiger).Eat占用了99.92%的CPU,通过函数打印看出一个无用的空循环导致了CPU的空转,是CPU负载高居不下的原因。
如果我们前面安装了graphviz,那我们可以把这些profile在Web上描绘出来,更直观地看到系统瓶颈。
lijunshi@GIH-L-3453:/mnt/c/Users/lijunshi/services/go-pprof-practice-master$ go tool pprof -http=:8000 http://localhost:6060/debug/pprof/profile
Fetching profile over HTTP from http://localhost:6060/debug/pprof/profile
Saved profile in /home/lijunshi/pprof/pprof.ggg.samples.cpu.003.pb.gz
Serving web UI on http://localhost:8000
我们在浏览器上打开http://localhost:8000
首先观察函数调用链图:
观察火焰图:
可以直接在Web上查看具体代码
找到问题后,我们把相关代码注释掉,并重新编译运行,top继续观察,发现CPU负载已经降下来了。
top - 19:22:46 up 3:40, 0 users, load average: 0.07, 0.19, 0.30
Tasks: 18 total, 1 running, 17 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 12655.4 total, 10083.7 free, 2234.3 used, 337.3 buff/cache
MiB Swap: 4096.0 total, 4096.0 free, 0.0 used. 10426.6 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 0 896 576 512 S 0.0 0.0 0:00.01 init
7 root 20 0 896 84 20 S 0.0 0.0 0:00.00 init
9 root 20 0 2924 1976 1728 S 0.0 0.0 0:02.18 wslconnect
10 root 20 0 10036 4992 3320 S 0.0 0.0 0:00.03 bash
838 root 20 0 9972 3688 3268 S 0.0 0.0 0:00.00 su
839 lijunshi 20 0 10132 5536 3688 S 0.0 0.0 0:00.21 bash
2683 root 20 0 896 84 20 S 0.0 0.0 0:00.00 init
2685 root 20 0 2924 2048 1796 S 0.0 0.0 0:00.07 wslconnect
2686 root 20 0 10036 5036 3380 S 0.0 0.0 0:00.02 bash
2705 root 20 0 9972 3748 3328 S 0.0 0.0 0:00.00 su
2706 lijunshi 20 0 10132 5316 3500 S 0.0 0.0 0:00.03 bash
3602 root 20 0 896 84 20 S 0.0 0.0 0:00.00 init
3604 root 20 0 2924 1940 1684 S 0.0 0.0 0:00.05 wslconnect
3605 root 20 0 10036 5124 3456 S 0.0 0.0 0:00.03 bash
4602 lijunshi 20 0 2721276 2.1g 7820 S 0.0 16.9 0:01.12 ggg
4611 lijunshi 20 0 483416 11816 7340 S 0.0 0.1 0:00.00 go
4617 lijunshi 20 0 485164 16100 7536 S 0.0 0.1 0:00.01 pprof
4636 root 20 0 10880 3788 3212 R 0.0 0.0 0:00.00 top
排查内存占用过高的问题
现在进程还是占用了2.1G内存,这是明显不合理的,因此这边需要继续排查内存问题。
我们使用go tool pprof http://localhost:6060/debug/pprof/heap
查看内存占用情况,使用top+list定位到问题函数。
lijunshi@GIH-L-3453:/mnt/c/Users/lijunshi/services/go-pprof-practice-master$ go tool pprof http://localhost:6060/debug/pprof/heap
Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap
Saved profile in /home/lijunshi/pprof/pprof.ggg.alloc_objects.alloc_space.inuse_objects.inuse_space.007.pb.gz
File: ggg
Type: inuse_space
Time: Aug 1, 2021 at 9:35pm (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 768MB, 100% of 768MB total
flat flat% sum% cum cum%
768MB 100% 100% 768MB 100% ggg/animal/muridae/mouse.(*Mouse).Steal
0 0% 100% 768MB 100% ggg/animal/muridae/mouse.(*Mouse).Live
0 0% 100% 768MB 100% main.main
0 0% 100% 768MB 100% runtime.main
(pprof) list Steal
Total: 768MB
ROUTINE ======================== ggg/animal/muridae/mouse.(*Mouse).Steal in /mnt/c/Users/lijunshi/services/go-pprof-practice-master/animal/muridae/mouse/mouse.go
768MB 768MB (flat, cum) 100% of Total
. . 46:func (m *Mouse) Steal() {
. . 47: log.Println(m.Name(), "steal")
. . 48:
. . 49: max := constant.Gi
. . 50: for len(m.buffer) * constant.Mi < max {
768MB 768MB 51: m.buffer = append(m.buffer, [constant.Mi]byte{})
. . 52: }
. . 53:}
(pprof)
利用go tool pprof -http=:8000 http://localhost:6060/debug/pprof/heap
获得可视化图表
ggg/animal/muridae/mouse.(_Mouse).Steal这个函数申请了1.5G内存,一直持有并未释放。因此我们top看到的2.1G RES就是这个导致的。因此我们为了减少内存占用,首先把ggg/animal/muridae/mouse.(_Mouse).Steal相关不合理逻辑注释掉。处理之后top观察,内存就显示正常了。
排查GC过于频繁
Go的GC如果过于频繁,将导致程序性能下降,因为STW的存在,因此每一次GC都导致程序停止运行一段时间。刚刚在查证内存问题时发现程序在频繁申请无用的内存,这从直觉上就能感知到这回导致GC更为频繁。
我们观察GC日志发现,程序的GC异常频繁。相关指令为GODEBUG=gctrace=1 ./ggg | grep gc
lijunshi@GIH-L-3453:/mnt/c/Users/lijunshi/services/go-pprof-practice-master$ GODEBUG=gctrace=1 ./ggg | grep gc
gc 1 @0.114s 0%: 0.002+0.37+0.002 ms clock, 0.002+0.19/0.076/0+0.002 ms cpu, 16->16->0 MB, 17 MB goal, 1 P
scvg: 0 MB released
scvg: inuse: 16, idle: 46, sys: 63, released: 46, consumed: 16 (MB)
scvg: 1 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 47, consumed: 15 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 49, consumed: 13 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 51, consumed: 11 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 53, consumed: 9 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 55, consumed: 7 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 57, consumed: 5 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 59, consumed: 3 (MB)
gc 2 @3.126s 0%: 0.021+6.7+0.003 ms clock, 0.021+0.23/6.3/0+0.003 ms cpu, 16->16->0 MB, 17 MB goal, 1 P
scvg: inuse: 16, idle: 46, sys: 63, released: 46, consumed: 16 (MB)
scvg: 1 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 47, consumed: 15 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 49, consumed: 13 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 51, consumed: 11 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 53, consumed: 9 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 55, consumed: 7 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 57, consumed: 5 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 59, consumed: 3 (MB)
gc 3 @6.137s 0%: 0.006+0.32+0.003 ms clock, 0.006+0.10/0.11/0+0.003 ms cpu, 16->16->0 MB, 17 MB goal, 1 P
scvg: inuse: 16, idle: 46, sys: 63, released: 46, consumed: 16 (MB)
scvg: 1 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 48, consumed: 15 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 50, consumed: 13 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 52, consumed: 11 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 54, consumed: 9 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 56, consumed: 7 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 58, consumed: 5 (MB)
scvg: 2 MB released
scvg: inuse: 0, idle: 62, sys: 63, released: 60, consumed: 3 (MB)
gc 4 @9.142s 0%: 0.005+0.37+0.002 ms clock, 0.005+0.13/0.13/0+0.002 ms cpu, 16->16->0 MB, 17 MB goal, 1 P
重点关注这一行,下面这一行表示GC时将16M内存回收,而且发现每次GC都会把16M内存回收。这就很奇怪了,因此合理怀疑程序某个地方频繁申请16M内存。
gc 6 @15.151s 0%: 0.007+0.43+0.002 ms clock, 0.007+0.14/0.14/0+0.002 ms cpu, 16->16->0 MB, 17 MB goal, 1
使用go tool pprof http://localhost:6060/debug/pprof/allocs
查看程序分配内存的情况,并使用top + list 定位问题函数。
lijunshi@GIH-L-3453:/mnt/c/Users/lijunshi/services/go-pprof-practice-master$ go tool pprof http://localhost:6060/debug/pprof/allocs
Fetching profile over HTTP from http://localhost:6060/debug/pprof/allocs
Saved profile in /home/lijunshi/pprof/pprof.ggg.alloc_objects.alloc_space.inuse_objects.inuse_space.010.pb.gz
File: ggg
Type: alloc_space
Time: Aug 1, 2021 at 9:46pm (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 1.27GB, 100% of 1.27GB total
flat flat% sum% cum cum%
1.27GB 100% 100% 1.27GB 100% ggg/animal/canidae/dog.(*Dog).Run
0 0% 100% 1.27GB 100% ggg/animal/canidae/dog.(*Dog).Live
0 0% 100% 1.27GB 100% main.main
0 0% 100% 1.27GB 100% runtime.main
(pprof) list Run
Total: 1.27GB
ROUTINE ======================== ggg/animal/canidae/dog.(*Dog).Run in /mnt/c/Users/lijunshi/services/go-pprof-practice-master/animal/canidae/dog/dog.go
1.27GB 1.27GB (flat, cum) 100% of Total
. . 38: log.Println(d.Name(), "pee")
. . 39:}
. . 40:
. . 41:func (d *Dog) Run() {
. . 42: log.Println(d.Name(), "run")
1.27GB 1.27GB 43: _ = make([]byte, 16 * constant.Mi)
. . 44:}
. . 45:
. . 46:func (d *Dog) Howl() {
. . 47: log.Println(d.Name(), "howl")
. . 48:}
go tool pprof -http=:8000 http://localhost:6060/debug/pprof/allocs
生成web图表。
从上面的数据看出,animal/canidae/dog/dog.go的Run函数反复申请16M内存,但未实际使用,已经累计申请1.5个G了。这个导致GC频繁,因此需要把这段逻辑优化一下,比如申请时申请较小的内存,改动之后GC频率就很低了(函数内申请小空间都是栈上分配,协程内的私有空间有2K,因此可以栈上分配这1K的内存)。
func (d *Dog) Run() {
log.Println(d.Name(), "run")
_ = make([]byte, 16 * constant.Ki)
}
排查goroutine过多
Go的一大优势就是可以启动大量的goroutine,但是如果goroutine数目异常的多,那就可能是发生了goroutine泄露,最直观的影响就是这些goroutine会占用不少的内存资源。我们同样可以使用pprof来查看goroutine数量异常的情况。
lijunshi@GIH-L-3453:/mnt/c/Users/lijunshi/services/go-pprof-practice-master$ go tool pprof http://localhost:6060/debug/pprof/goroutine
Fetching profile over HTTP from http://localhost:6060/debug/pprof/goroutine
Saved profile in /home/lijunshi/pprof/pprof.ggg.goroutine.002.pb.gz
File: ggg
Type: goroutine
Time: Aug 1, 2021 at 9:56pm (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 104, 100% of 104 total
Showing top 10 nodes out of 28
flat flat% sum% cum cum%
102 98.08% 98.08% 102 98.08% runtime.gopark
1 0.96% 99.04% 1 0.96% net/http.(*connReader).backgroundRead
1 0.96% 100% 1 0.96% runtime/pprof.writeRuntimeProfile
0 0% 100% 100 96.15% ggg/animal/canidae/wolf.(*Wolf).Drink.func1
0 0% 100% 1 0.96% internal/poll.(*FD).Accept
0 0% 100% 1 0.96% internal/poll.(*pollDesc).wait
0 0% 100% 1 0.96% internal/poll.(*pollDesc).waitRead
0 0% 100% 1 0.96% internal/poll.runtime_pollWait
0 0% 100% 1 0.96% main.main
0 0% 100% 1 0.96% main.main.func1
(pprof) list Drink.func1
Total: 104
ROUTINE ======================== ggg/animal/canidae/wolf.(*Wolf).Drink.func1 in /mnt/c/Users/lijunshi/services/go-pprof-practice-master/animal/canidae/wolf/wolf.go
0 100 (flat, cum) 96.15% of Total
. . 29:
. . 30:func (w *Wolf) Drink() {
. . 31: log.Println(w.Name(), "drink")
. . 32: for i := 0; i < 10; i++ {
. . 33: go func() {
. 100 34: time.Sleep(30 * time.Second)
. . 35: }()
. . 36: }
. . 37:}
. . 38:
. . 39:func (w *Wolf) Shit() {
这里看出,Drink.func1每次调度都会开启10个协程并等待30秒才释放,因此容易导致goroutine泄露,因此对这段代码进行修复,比如减少等待时间。
排查锁的问题
如果我们程序使用到了mutex,那么就有可能出现死锁和锁争用的问题,进而导致进程阻塞。此时我们可以利用pprof来排查锁的问题。
lijunshi@GIH-L-3453:/mnt/c/Users/lijunshi/services/go-pprof-practice-master$ go tool pprof http://localhost:6060/debug/pprof/mutex
Fetching profile over HTTP from http://localhost:6060/debug/pprof/mutex
Saved profile in /home/lijunshi/pprof/pprof.ggg.contentions.delay.001.pb.gz
File: ggg
Type: delay
Time: Aug 1, 2021 at 10:06pm (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 13s, 100% of 13s total
flat flat% sum% cum cum%
13s 100% 100% 13s 100% sync.(*Mutex).Unlock
0 0% 100% 13s 100% ggg/animal/canidae/wolf.(*Wolf).Howl.func1
(pprof) list Howl.func1
Total: 13s
ROUTINE ======================== ggg/animal/canidae/wolf.(*Wolf).Howl.func1 in /mnt/c/Users/lijunshi/services/go-pprof-practice-master/animal/canidae/wolf/wolf.go
0 13s (flat, cum) 100% of Total
. . 53:
. . 54: m := &sync.Mutex{}
. . 55: m.Lock()
. . 56: go func() {
. . 57: time.Sleep(time.Second)
. 13s 58: m.Unlock()
. . 59: }()
. . 60: m.Lock()
. . 61:}
可以看出, ggg/animal/canidae/wolf.(*Wolf).Howl.func1持有锁1秒后才释放,明显对导致锁的争用和协程阻塞。因此移除这段无用的锁占用等待,可以解决这个问题。
排查程序阻塞的问题
锁导致程序阻塞是常见的问题,但是程序的阻塞也有可能是由于系统调用导致的,因为协程可能在等待某个条件或某个资源的到来而暂时放弃CPU的使用权进入阻塞状态。所以当程序异常卡住不动时,我们可以使用pprof来看看程序在等待什么。
lijunshi@GIH-L-3453:/mnt/c/Users/lijunshi/services/go-pprof-practice-master$ go tool pprof http://localhost:6060/debug/pprof/block
Fetching profile over HTTP from http://localhost:6060/debug/pprof/block
Saved profile in /home/lijunshi/pprof/pprof.ggg.contentions.delay.004.pb.gz
File: ggg
Type: delay
Time: Aug 1, 2021 at 10:13pm (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 18.01s, 100% of 18.01s total
Dropped 3 nodes (cum <= 0.09s)
flat flat% sum% cum cum%
18.01s 100% 100% 18.01s 100% runtime.chanrecv1
0 0% 100% 18.01s 100% ggg/animal/felidae/cat.(*Cat).Live
0 0% 100% 18.01s 100% ggg/animal/felidae/cat.(*Cat).Pee
0 0% 100% 18.01s 100% main.main
0 0% 100% 18.01s 100% runtime.main
(pprof) list ggg/animal/felidae/cat.(*Cat).Live
parsing argument regexp ggg/animal/felidae/cat.(*Cat).Live: error parsing regexp: missing argument to repetition operator: `*`
(pprof) list Live
Total: 18.01s
ROUTINE ======================== ggg/animal/canidae/wolf.(*Wolf).Live in /mnt/c/Users/lijunshi/services/go-pprof-practice-master/animal/canidae/wolf/wolf.go
0 142.11us (flat, cum) 0.00079% of Total
. . 18: w.Eat()
. . 19: w.Drink()
. . 20: w.Shit()
. . 21: w.Pee()
. . 22: w.Run()
. 142.11us 23: w.Howl()
. . 24:}
. . 25:
. . 26:func (w *Wolf) Eat() {
. . 27: log.Println(w.Name(), "eat")
. . 28:}
ROUTINE ======================== ggg/animal/felidae/cat.(*Cat).Live in /mnt/c/Users/lijunshi/services/go-pprof-practice-master/animal/felidae/cat/cat.go
0 18.01s (flat, cum) 100% of Total
. . 14:
. . 15:func (c *Cat) Live() {
. . 16: c.Eat()
. . 17: c.Drink()
. . 18: c.Shit()
. 18.01s 19: c.Pee()
. . 20: c.Climb()
. . 21: c.Sneak()
. . 22:}
. . 23:
. . 24:func (c *Cat) Eat() {
(pprof) list Pee
Total: 18.01s
ROUTINE ======================== ggg/animal/felidae/cat.(*Cat).Pee in /mnt/c/Users/lijunshi/services/go-pprof-practice-master/animal/felidae/cat/cat.go
0 18.01s (flat, cum) 100% of Total
. . 34:}
. . 35:
. . 36:func (c *Cat) Pee() {
. . 37: log.Println(c.Name(), "pee")
. . 38:
. 18.01s 39: <-time.After(time.Second)
. . 40:}
. . 41:
. . 42:func (c *Cat) Climb() {
. . 43: log.Println(c.Name(), "climb")
. . 44:}
查看相关代码和函数调用可以看出,animal/felidae/cat/cat.go的Pee() 1秒后才发送数据给channel,这是个阻塞操作,已经累计阻塞18秒了。因此我们可以优化这里阻塞操作,可以减少程序的阻塞时间。