Go 语言踩坑记——panic 与 recover

[作者简介] 易乐天,小米信息技术部海外商城组

题记

Go 语言自发布以来,一直以高性能、高并发著称。因为标准库提供了 http 包,即使刚学不久的程序员,也能轻松写出 http 服务程序。

不过,任何事情都有两面性。一门语言,有它值得骄傲的有点,也必定隐藏了不少坑。新手若不知道这些坑,很容易就会掉进坑里。《 Go 语言踩坑记》系列博文将以 Go 语言中的 panicrecover 开头,给大家介绍笔者踩过的各种坑,以及填坑方法。

初识 panic 和 recover

  • panic
    panic 这个词,在英语中具有恐慌、恐慌的等意思。从字面意思理解的话,在 Go 语言中,代表极其严重的问题,程序员最害怕出现的问题。一旦出现,就意味着程序的结束并退出。Go 语言中 panic 关键字主要用于主动抛出异常,类似 java 等语言中的 throw 关键字。
  • recover
    recover 这个词,在英语中具有恢复、复原等意思。从字面意思理解的话,在 Go 语言中,代表将程序状态从严重的错误中恢复到正常状态。Go 语言中 recover 关键字主要用于捕获异常,让程序回到正常状态,类似 java 等语言中的 try ... catch

笔者有过 6 年 linux 系统 C 语言开发经历。C 语言中没有异常捕获的概念,没有 try ... catch ,也没有 panicrecover 。不过,万变不离其宗,异常与 if error then return 方式的差别,主要体现在函数调用栈的深度上。如下图:

Go 语言踩坑记——panic 与 recover - 图1

正常逻辑下的函数调用栈,是逐个回溯的,而异常捕获可以理解为:程序调用栈的长距离跳转。这点在 C 语言里,是通过 setjumplongjump 这两个函数来实现的。

try catchrecoversetjump 等机制会将程序当前状态(主要是 cpu 的栈指针寄存器 sp 和程序计数器 pc , Go 的 recover 是依赖 defer 来维护 sp 和 pc )保存到一个与 throwpaniclongjump共享的内存里。当有异常的时候,从该内存中提取之前保存的 sp 和 pc 寄存器值,直接将函数栈调回到 sp 指向的位置,并执行 ip 寄存器指向的下一条指令,将程序从异常状态中恢复到正常状态。

深入 panic 和 recover

源码

panicrecover 的源码在 Go 源码的 src/runtime/panic.go 里,名为 gopanicgorecover 的函数。

  1. // gopanic 的代码,在 src/runtime/panic.go 第 454 行
  2. // The implementation of the predeclared function panic.
  3. func gopanic(e interface{}) {
  4. gp := getg()
  5. if gp.m.curg != gp {
  6. print("panic: ")
  7. printany(e)
  8. print("\n")
  9. throw("panic on system stack")
  10. }
  11. if gp.m.mallocing != 0 {
  12. print("panic: ")
  13. printany(e)
  14. print("\n")
  15. throw("panic during malloc")
  16. }
  17. if gp.m.preemptoff != "" {
  18. print("panic: ")
  19. printany(e)
  20. print("\n")
  21. print("preempt off reason: ")
  22. print(gp.m.preemptoff)
  23. print("\n")
  24. throw("panic during preemptoff")
  25. }
  26. if gp.m.locks != 0 {
  27. print("panic: ")
  28. printany(e)
  29. print("\n")
  30. throw("panic holding locks")
  31. }
  32. var p _panic
  33. p.arg = e
  34. p.link = gp._panic
  35. gp._panic = (*_panic)(noescape(unsafe.Pointer(&p)))
  36. atomic.Xadd(&runningPanicDefers, 1)
  37. for {
  38. d := gp._defer
  39. if d == nil {
  40. break
  41. }
  42. // If defer was started by earlier panic or Goexit (and, since we're back here, that triggered a new panic),
  43. // take defer off list. The earlier panic or Goexit will not continue running.
  44. if d.started {
  45. if d._panic != nil {
  46. d._panic.aborted = true
  47. }
  48. d._panic = nil
  49. d.fn = nil
  50. gp._defer = d.link
  51. freedefer(d)
  52. continue
  53. }
  54. // Mark defer as started, but keep on list, so that traceback
  55. // can find and update the defer's argument frame if stack growth
  56. // or a garbage collection happens before reflectcall starts executing d.fn.
  57. d.started = true
  58. // Record the panic that is running the defer.
  59. // If there is a new panic during the deferred call, that panic
  60. // will find d in the list and will mark d._panic (this panic) aborted.
  61. d._panic = (*_panic)(noescape(unsafe.Pointer(&p)))
  62. p.argp = unsafe.Pointer(getargp(0))
  63. reflectcall(nil, unsafe.Pointer(d.fn), deferArgs(d), uint32(d.siz), uint32(d.siz))
  64. p.argp = nil
  65. // reflectcall did not panic. Remove d.
  66. if gp._defer != d {
  67. throw("bad defer entry in panic")
  68. }
  69. d._panic = nil
  70. d.fn = nil
  71. gp._defer = d.link
  72. // trigger shrinkage to test stack copy. See stack_test.go:TestStackPanic
  73. //GC()
  74. pc := d.pc
  75. sp := unsafe.Pointer(d.sp) // must be pointer so it gets adjusted during stack copy
  76. freedefer(d)
  77. if p.recovered {
  78. atomic.Xadd(&runningPanicDefers, -1)
  79. gp._panic = p.link
  80. // Aborted panics are marked but remain on the g.panic list.
  81. // Remove them from the list.
  82. for gp._panic != nil && gp._panic.aborted {
  83. gp._panic = gp._panic.link
  84. }
  85. if gp._panic == nil { // must be done with signal
  86. gp.sig = 0
  87. }
  88. // Pass information about recovering frame to recovery.
  89. gp.sigcode0 = uintptr(sp)
  90. gp.sigcode1 = pc
  91. mcall(recovery)
  92. throw("recovery failed") // mcall should not return
  93. }
  94. }
  95. // ran out of deferred calls - old-school panic now
  96. // Because it is unsafe to call arbitrary user code after freezing
  97. // the world, we call preprintpanics to invoke all necessary Error
  98. // and String methods to prepare the panic strings before startpanic.
  99. preprintpanics(gp._panic)
  100. fatalpanic(gp._panic) // should not return
  101. *(*int)(nil) = 0 // not reached
  102. }
  103. // gorecover 的代码,在 src/runtime/panic.go 第 585 行
  104. // The implementation of the predeclared function recover.
  105. // Cannot split the stack because it needs to reliably
  106. // find the stack segment of its caller.
  107. //
  108. // TODO(rsc): Once we commit to CopyStackAlways,
  109. // this doesn't need to be nosplit.
  110. //go:nosplit
  111. func gorecover(argp uintptr) interface{} {
  112. // Must be in a function running as part of a deferred call during the panic.
  113. // Must be called from the topmost function of the call
  114. // (the function used in the defer statement).
  115. // p.argp is the argument pointer of that topmost deferred function call.
  116. // Compare against argp reported by caller.
  117. // If they match, the caller is the one who can recover.
  118. gp := getg()
  119. p := gp._panic
  120. if p != nil && !p.recovered && argp == uintptr(p.argp) {
  121. p.recovered = true
  122. return p.arg
  123. }
  124. return nil
  125. }

从函数代码中我们可以看到 panic 内部主要流程是这样:

  • 获取当前调用者所在的 g ,也就是 goroutine

  • 遍历并执行 g 中的 defer 函数

  • 如果 defer 函数中有调用 recover ,并发现已经发生了 panic ,则将 panic 标记为 recovered

  • 在遍历 defer 的过程中,如果发现已经被标记为 recovered ,则提取出该 defer 的 sp 与 pc,保存在 g 的两个状态码字段中。

  • 调用 runtime.mcall 切到 m->g0 并跳转到 recovery 函数,将前面获取的 g 作为参数传给 recovery 函数。
    runtime.mcall 的代码在 go 源码的 src/runtime/asm_xxx.s 中,xxx 是平台类型,如 amd64 。代码如下: ``` // src/runtime/asm_amd64.s 第 274 行

// func mcall(fn func(*g)) // Switch to m->g0’s stack, call fn(g). // Fn must never return. It should gogo(&g->sched) // to keep running g. TEXT runtime·mcall(SB), NOSPLIT, $0-8 MOVQ fn+0(FP), DI

  1. get_tls(CX)
  2. MOVQ g(CX), AX // save state in g->sched
  3. MOVQ 0(SP), BX // caller's PC
  4. MOVQ BX, (g_sched+gobuf_pc)(AX)
  5. LEAQ fn+0(FP), BX // caller's SP
  6. MOVQ BX, (g_sched+gobuf_sp)(AX)
  7. MOVQ AX, (g_sched+gobuf_g)(AX)
  8. MOVQ BP, (g_sched+gobuf_bp)(AX)
  9. // switch to m->g0 & its stack, call fn
  10. MOVQ g(CX), BX
  11. MOVQ g_m(BX), BX
  12. MOVQ m_g0(BX), SI
  13. CMPQ SI, AX // if g == m->g0 call badmcall
  14. JNE 3(PC)
  15. MOVQ $runtime·badmcall(SB), AX
  16. JMP AX
  17. MOVQ SI, g(CX) // g = m->g0
  18. MOVQ (g_sched+gobuf_sp)(SI), SP // sp = m->g0->sched.sp
  19. PUSHQ AX
  20. MOVQ DI, DX
  21. MOVQ 0(DI), DI
  22. CALL DI
  23. POPQ AX
  24. MOVQ $runtime·badmcall2(SB), AX
  25. JMP AX
  26. RET
  1. <br />这里之所以要切到 `m->g0` ,主要是因为 Go 的 `runtime` 环境是有自己的堆栈和 `goroutine`,而 `recovery` 是在 `runtime` 环境下执行的,所以要先调度到 `m->g0` 来执行 `recovery` 函数。
  2. -
  3. `recovery` 函数中,利用 `g` 中的两个状态码回溯栈指针 sp 并恢复程序计数器 pc 到调度器中,并调用 `gogo` 重新调度 `g` ,将 `g` 恢复到调用 `recover` 函数的位置, goroutine 继续执行。<br />
  4. 代码如下:

// gorecover 的代码,在 src/runtime/panic.go 第 637 行

// Unwind the stack after a deferred function calls recover // after a panic. Then arrange to continue running as though // the caller of the deferred function returned normally. func recovery(gp *g) { // Info about defer passed in G struct. sp := gp.sigcode0 pc := gp.sigcode1

  1. // d's arguments need to be in the stack.
  2. if sp != 0 && (sp < gp.stack.lo || gp.stack.hi < sp) {
  3. print("recover: ", hex(sp), " not in [", hex(gp.stack.lo), ", ", hex(gp.stack.hi), "]\n")
  4. throw("bad recovery")
  5. }
  6. // Make the deferproc for this d return again,
  7. // this time returning 1. The calling function will
  8. // jump to the standard return epilogue.
  9. gp.sched.sp = sp
  10. gp.sched.pc = pc
  11. gp.sched.lr = 0
  12. gp.sched.ret = 1
  13. gogo(&gp.sched)

}

// src/runtime/asm_amd64.s 第 274 行

// func gogo(buf *gobuf) // restore state from Gobuf; longjmp TEXT runtime·gogo(SB), NOSPLIT, $16-8 MOVQ buf+0(FP), BX // gobuf MOVQ gobuf_g(BX), DX MOVQ 0(DX), CX // make sure g != nil get_tls(CX) MOVQ DX, g(CX) MOVQ gobuf_sp(BX), SP // restore SP MOVQ gobuf_ret(BX), AX MOVQ gobuf_ctxt(BX), DX MOVQ gobuf_bp(BX), BP MOVQ $0, gobuf_sp(BX) // clear to help garbage collector MOVQ $0, gobuf_ret(BX) MOVQ $0, gobuf_ctxt(BX) MOVQ $0, gobuf_bp(BX) MOVQ gobuf_pc(BX), BX JMP BX

  1. 以上便是 Go 底层处理异常的流程,精简为三步便是:
  2. - `defer` 函数中调用 `recover`
  3. - 触发 `panic` 并切到 `runtime` 环境获取在 `defer` 中调用了 `recover` `g` sp pc
  4. - 恢复到 `defer` `recover` 后面的处理逻辑
  5. <a name="95c24fac"></a>
  6. ### 都有哪些坑
  7. 前面提到,`panic` 函数主要用于主动触发异常。我们在实现业务代码的时候,在程序启动阶段,如果资源初始化出错,可以主动调用 `panic` 立即结束程序。对于新手来说,这没什么问题,很容易做到。
  8. 但是,现实往往是残酷的—— Go `runtime` 代码中很多地方都调用了 `panic` 函数,对于不了解 Go 底层实现的新人来说,这无疑是挖了一堆深坑。如果不熟悉这些坑,是不可能写出健壮的 Go 代码。
  9. 接下来,笔者给大家细数下都有哪些坑。
  10. -
  11. <a name="88bd471a"></a>
  12. #### 数组 ( slice ) 下标越界
  13. 这个比较好理解,对于静态类型语言,数组下标越界是致命错误。如下代码可以验证:

package main

import ( “fmt” )

func foo(){ defer func(){ if err := recover(); err != nil { fmt.Println(err) } }() var bar = []int{1} fmt.Println(bar[1]) }

func main(){ foo() fmt.Println(“exit”) }

  1. 输出:

runtime error: index out of range exit

  1. 因为代码中用了 `recover` ,程序得以恢复,输出 `exit`
  2. 如果将 `recover` 那几行注释掉,将会输出如下日志:

panic: runtime error: index out of range

goroutine 1 [running]: main.foo() /home/letian/work/go/src/test/test.go:14 +0x3e main.main() /home/letian/work/go/src/test/test.go:18 +0x22 exit status 2

  1. -
  2. <a name="934c4123"></a>
  3. #### 访问未初始化的指针或 nil 指针
  4. 对于有 c/c++ 开发经验的人来说,这个很好理解。但对于没用过指针的新手来说,这是最常见的一类错误。<br />
  5. 如下代码可以验证:

package main

import ( “fmt” )

func foo(){ defer func(){ if err := recover(); err != nil { fmt.Println(err) } }() var bar int fmt.Println(bar) }

func main(){ foo() fmt.Println(“exit”) }

  1. 输出:

runtime error: invalid memory address or nil pointer dereference exit

  1. 如果将 `recover` 那几行代码注释掉,则会输出:

panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x4869ff]

goroutine 1 [running]: main.foo() /home/letian/work/go/src/test/test.go:14 +0x3f main.main() /home/letian/work/go/src/test/test.go:18 +0x22 exit status 2

  1. -
  2. <a name="2ced96b7"></a>
  3. #### 试图往已经 close 的 `chan` 里发送数据
  4. 这也是刚学用 `chan` 的新手容易犯的错误。如下代码可以验证:

package main

import ( “fmt” )

func foo(){ defer func(){ if err := recover(); err != nil { fmt.Println(err) } }() var bar = make(chan int, 1) close(bar) bar<-1 }

func main(){ foo() fmt.Println(“exit”) }

  1. 输出:

send on closed channel exit

  1. 如果注释掉 recover ,将输出:

panic: send on closed channel

goroutine 1 [running]: main.foo() /home/letian/work/go/src/test/test.go:15 +0x83 main.main() /home/letian/work/go/src/test/test.go:19 +0x22 exit status 2

  1. 源码处理逻辑在 `src/runtime/chan.go` `chansend` 函数中,如下:

// src/runtime/chan.go 第 269 行

/*

  • generic single channel send/recv
  • If block is not nil,
  • then the protocol will not
  • sleep but return if it could
  • not complete. *
  • sleep can wake up with g.param == nil
  • when a channel involved in the sleep has
  • been closed. it is easiest to loop and re-run
  • the operation; we’ll see that it’s now closed. / func chansend(c hchan, ep unsafe.Pointer, block bool, callerpc uintptr) bool { if c == nil {

    1. if !block {
    2. return false
    3. }
    4. gopark(nil, nil, waitReasonChanSendNilChan, traceEvGoStop, 2)
    5. throw("unreachable")

    }

    if debugChan {

    1. print("chansend: chan=", c, "\n")

    }

    if raceenabled {

    1. racereadpc(c.raceaddr(), callerpc, funcPC(chansend))

    }

    // Fast path: check for failed non-blocking operation without acquiring the lock. // // After observing that the channel is not closed, we observe that the channel is // not ready for sending. Each of these observations is a single word-sized read // (first c.closed and second c.recvq.first or c.qcount depending on kind of channel). // Because a closed channel cannot transition from ‘ready for sending’ to // ‘not ready for sending’, even if the channel is closed between the two observations, // they imply a moment between the two when the channel was both not yet closed // and not ready for sending. We behave as if we observed the channel at that moment, // and report that the send cannot proceed. // // It is okay if the reads are reordered here: if we observe that the channel is not // ready for sending and then observe that it is not closed, that implies that the // channel wasn’t closed during the first observation. if !block && c.closed == 0 && ((c.dataqsiz == 0 && c.recvq.first == nil) ||

    1. (c.dataqsiz > 0 && c.qcount == c.dataqsiz)) {
    2. return false

    }

    var t0 int64 if blockprofilerate > 0 {

    1. t0 = cputicks()

    }

    lock(&c.lock)

    if c.closed != 0 {

    1. unlock(&c.lock)
    2. panic(plainError("send on closed channel"))

    }

    if sg := c.recvq.dequeue(); sg != nil {

    1. // Found a waiting receiver. We pass the value we want to send
    2. // directly to the receiver, bypassing the channel buffer (if any).
    3. send(c, sg, ep, func() { unlock(&c.lock) }, 3)
    4. return true

    }

    if c.qcount < c.dataqsiz {

    1. // Space is available in the channel buffer. Enqueue the element to send.
    2. qp := chanbuf(c, c.sendx)
    3. if raceenabled {
    4. raceacquire(qp)
    5. racerelease(qp)
    6. }
    7. typedmemmove(c.elemtype, qp, ep)
    8. c.sendx++
    9. if c.sendx == c.dataqsiz {
    10. c.sendx = 0
    11. }
    12. c.qcount++
    13. unlock(&c.lock)
    14. return true

    }

    if !block {

    1. unlock(&c.lock)
    2. return false

    }

    // Block on the channel. Some receiver will complete our operation for us. gp := getg() mysg := acquireSudog() mysg.releasetime = 0 if t0 != 0 {

    1. mysg.releasetime = -1

    } // No stack splits between assigning elem and enqueuing mysg // on gp.waiting where copystack can find it. mysg.elem = ep mysg.waitlink = nil mysg.g = gp mysg.isSelect = false mysg.c = c gp.waiting = mysg gp.param = nil c.sendq.enqueue(mysg) goparkunlock(&c.lock, waitReasonChanSend, traceEvGoBlockSend, 3) // Ensure the value being sent is kept alive until the // receiver copies it out. The sudog has a pointer to the // stack object, but sudogs aren’t considered as roots of the // stack tracer. KeepAlive(ep)

    // someone woke us up. if mysg != gp.waiting {

    1. throw("G waiting list is corrupted")

    } gp.waiting = nil if gp.param == nil {

    1. if c.closed == 0 {
    2. throw("chansend: spurious wakeup")
    3. }
    4. panic(plainError("send on closed channel"))

    } gp.param = nil if mysg.releasetime > 0 {

    1. blockevent(mysg.releasetime-t0, 2)

    } mysg.c = nil releaseSudog(mysg) return true } ```

  • 并发读写相同 map

对于刚学并发编程的同学来说,并发读写 map 也是很容易遇到的问题。如下代码可以验证:

  1. package main
  2. import (
  3. "fmt"
  4. )
  5. func foo(){
  6. defer func(){
  7. if err := recover(); err != nil {
  8. fmt.Println(err)
  9. }
  10. }()
  11. var bar = make(map[int]int)
  12. go func(){
  13. defer func(){
  14. if err := recover(); err != nil {
  15. fmt.Println(err)
  16. }
  17. }()
  18. for{
  19. _ = bar[1]
  20. }
  21. }()
  22. for{
  23. bar[1]=1
  24. }
  25. }
  26. func main(){
  27. foo()
  28. fmt.Println("exit")
  29. }

输出:

  1. fatal error: concurrent map read and map write
  2. goroutine 5 [running]:
  3. runtime.throw(0x4bd8b0, 0x21)
  4. /home/letian/.gvm/gos/go1.12/src/runtime/panic.go:617 +0x72 fp=0xc00004c780 sp=0xc00004c750 pc=0x427f22
  5. runtime.mapaccess1_fast64(0x49eaa0, 0xc000088180, 0x1, 0xc0000260d8)
  6. /home/letian/.gvm/gos/go1.12/src/runtime/map_fast64.go:21 +0x1a8 fp=0xc00004c7a8 sp=0xc00004c780 pc=0x40eb58
  7. main.foo.func2(0xc000088180)
  8. /home/letian/work/go/src/test/test.go:21 +0x5c fp=0xc00004c7d8 sp=0xc00004c7a8 pc=0x48708c
  9. runtime.goexit()
  10. /home/letian/.gvm/gos/go1.12/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc00004c7e0 sp=0xc00004c7d8 pc=0x450e51
  11. created by main.foo
  12. /home/letian/work/go/src/test/test.go:14 +0x68
  13. goroutine 1 [runnable]:
  14. main.foo()
  15. /home/letian/work/go/src/test/test.go:25 +0x8b
  16. main.main()
  17. /home/letian/work/go/src/test/test.go:30 +0x22
  18. exit status 2

细心的朋友不难发现,输出日志里没有出现我们在程序末尾打印的 exit,而是直接将调用栈打印出来了。查看 src/runtime/map.go 中的代码不难发现这几行:

  1. if h.flags&hashWriting != 0 {
  2. throw("concurrent map read and map write")
  3. }

与前面提到的几种情况不同,runtime 中调用 throw 函数抛出的异常是无法在业务代码中通过 recover 捕获的,这点最为致命。所以,对于并发读写 map 的地方,应该对 map 加锁。

  • 类型断言

在使用类型断言对 interface 进行类型转换的时候也容易一不小心踩坑,而且这个坑是即使用 interface 有一段时间的人也容易忽略的问题。如下代码可以验证:

  1. package main
  2. import (
  3. "fmt"
  4. )
  5. func foo(){
  6. defer func(){
  7. if err := recover(); err != nil {
  8. fmt.Println(err)
  9. }
  10. }()
  11. var i interface{} = "abc"
  12. _ = i.([]string)
  13. }
  14. func main(){
  15. foo()
  16. fmt.Println("exit")
  17. }

输出:

  1. interface conversion: interface {} is string, not []string
  2. exit

源码在 src/runtime/iface.go 中,如下两个函数:

  1. // panicdottypeE is called when doing an e.(T) conversion and the conversion fails.
  2. // have = the dynamic type we have.
  3. // want = the static type we're trying to convert to.
  4. // iface = the static type we're converting from.
  5. func panicdottypeE(have, want, iface *_type) {
  6. panic(&TypeAssertionError{iface, have, want, ""})
  7. }
  8. // panicdottypeI is called when doing an i.(T) conversion and the conversion fails.
  9. // Same args as panicdottypeE, but "have" is the dynamic itab we have.
  10. func panicdottypeI(have *itab, want, iface *_type) {
  11. var t *_type
  12. if have != nil {
  13. t = have._type
  14. }
  15. panicdottypeE(t, want, iface)
  16. }

更多的 panic

前面提到的只是基本语法中常遇到的几种 panic 场景,Go 标准库中有更多使用 panic 的地方,大家可以在源码中搜索 panic( 找到调用的地方,以免后续用标准库函数的时候踩坑。

限于篇幅,本文暂不介绍填坑技巧,后面再开其他篇幅逐个介绍。感谢阅读!

下回预告

Go 语言踩坑记之 channel 与 goroutine。