24. Performance

找到瓶颈
- 获取执行时间
- 检查汇编代码
使用 declare 表达式
优化通用函数
- 使用静态分配

很多 Common Lisp 解释器会将源代码编译成汇编语言，因此，Common Lisp 的性能要比其他解释型语言要好很多。

然而，我们有时想要程序跑得更快一些。本章将介绍一些发挥 CPU 极限的技术。

找到瓶颈

获取执行时间

[time](http://www.lispworks.com/documentation/lw51/CLHS/Body/m_time.htm) 宏对找到瓶颈很有用。它接受一个列表，执行然后将其所消耗的时间信息输出到 [*trace-output*](http://www.lispworks.com/documentation/lw71/CLHS/Body/v_debug_.htm#STtrace-outputST)，如下所示：

* (defun collect (start end)
    "Collect numbers [start, end] as list."
    (loop for i from start to end
          collect i))
* (time (collect 1 10))
Evaluation took:
  0.000 seconds of real time
  0.000001 seconds of total run time (0.000001 user, 0.000000 system)
  100.00% CPU
  3,800 processor cycles
  0 bytes consed

通过 time 宏，就可以很简单的找出程序的哪一部分消耗的时间过多。

注意，这里的时间信息并不保证在真正的生产环境一样准确。本章只是用来演示如何进行性能优化。

检查汇编代码

函数 [disassemble](http://www.lispworks.com/documentation/lw60/CLHS/Body/f_disass.htm) 接受一个函数，然后将该函数编译后的代码输出到*标准输出（standard-output）*，如：

* (defun plus (a b)
    (+ a b))
PLUS
* (disassemble 'plus)
; disassembly for PLUS
; Size: 37 bytes. Origin: #x52B8063B
; 3B:       498B5D60         MOV RBX, [R13+96]                ; no-arg-parsing entry point
                                                              ; thread.binding-stack-pointer
; 3F:       48895DF8         MOV [RBP-8], RBX
; 43:       498BD0           MOV RDX, R8
; 46:       488BFE           MOV RDI, RSI
; 49:       FF1425B0001052   CALL QWORD PTR [#x521000B0]      ; GENERIC-+
; 50:       488B75E8         MOV RSI, [RBP-24]
; 54:       4C8B45F0         MOV R8, [RBP-16]
; 58:       488BE5           MOV RSP, RBP
; 5B:       F8               CLC
; 5C:       5D               POP RBP
; 5D:       C3               RET
; 5E:       CC0F             BREAK 15                         ; Invalid argument count trap

上述代码是在 SBCL 中执行的。在其他的编译器中，如 CLISP， disassemble 的返回结果可能有所不同：

* (defun plus (a b)
    (+ a b))
PLUS
* (disassemble 'plus)
Disassembly of function PLUS
2 required arguments
0 optional arguments
No rest parameter
No keyword parameters
4 byte-code instructions:
0     (LOAD&PUSH 2)
1     (LOAD&PUSH 2)
2     (CALLSR 2 55)                       ; +
5     (SKIP&RET 3)
NIL

发生以上的情况是因为 SBCL 将 Lisp 代码编译成机器码，而 CLISP 则不是。

使用 `declare` 表达式

declare 表达式可以对编译器进行多种优化。注意这些优化都是与解释器无关的。一些解释其如 SBCL 支持这个功能，可以阅读其文档获取详细信息。这里只是介绍一些 CLHS 中提到的基础方法。

一般来说， decalre 表达式只会出现在特定格式的头部，如果上下文允许的话，也可以放在文档字符串后。此外，声明表达式将被限制为特定的格式。在这，将介绍一些与性能调优相关的特性。

谨记一点，本章节中介绍的的优化技巧与 Lisp 的解释器有很大的联系。在使用 declare 之前记得查看相应的文档。

速度及安全

Lisp 可以通过 [optimize](http://www.lispworks.com/documentation/lw71/CLHS/Body/d_optimi.htm) 来指定编译器优化质量的属性。每个质量属性可以赋值为 0 到 3，0 表示完全不重要，3表示很重要。

质量中最值得关注的就是安全和速度

默认情况下，Lisp 更注重代码的安全性，而不是速度。但是可以通过调整其权重来实现更好的优化。

* (defun max-original (a b)
    (max a b))
MAX-ORIGINAL
* (disassemble 'max-original)
; disassembly for MAX-ORIGINAL
; Size: 144 bytes. Origin: #x52D450EF
; 7A7:       8D46F1           lea eax, [rsi-15]               ; no-arg-parsing entry point
; 7AA:       A801             test al, 1
; 7AC:       750E             jne L0
; 7AE:       3C0A             cmp al, 10
; 7B0:       740A             jeq L0
; 7B2:       A80F             test al, 15
; 7B4:       7576             jne L5
; 7B6:       807EF11D         cmp byte ptr [rsi-15], 29
; 7BA:       7770             jnbe L5
; 7BC: L0:   8D43F1           lea eax, [rbx-15]
; 7BF:       A801             test al, 1
; 7C1:       750E             jne L1
; 7C3:       3C0A             cmp al, 10
; 7C5:       740A             jeq L1
; 7C7:       A80F             test al, 15
; 7C9:       755A             jne L4
; 7CB:       807BF11D         cmp byte ptr [rbx-15], 29
; 7CF:       7754             jnbe L4
; 7D1: L1:   488BD3           mov rdx, rbx
; 7D4:       488BFE           mov rdi, rsi
; 7D7:       B9C1030020       mov ecx, 536871873              ; generic->
; 7DC:       FFD1             call rcx
; 7DE:       488B75F0         mov rsi, [rbp-16]
; 7E2:       488B5DF8         mov rbx, [rbp-8]
; 7E6:       7E09             jle L3
; 7E8:       488BD3           mov rdx, rbx
; 7EB: L2:   488BE5           mov rsp, rbp
; 7EE:       F8               clc
; 7EF:       5D               pop rbp
; 7F0:       C3               ret
; 7F1: L3:   4C8BCB           mov r9, rbx
; 7F4:       4C894DE8         mov [rbp-24], r9
; 7F8:       4C8BC6           mov r8, rsi
; 7FB:       4C8945E0         mov [rbp-32], r8
; 7FF:       488BD3           mov rdx, rbx
; 802:       488BFE           mov rdi, rsi
; 805:       B929040020       mov ecx, 536871977              ; generic-=
; 80A:       FFD1             call rcx
; 80C:       4C8B45E0         mov r8, [rbp-32]
; 810:       4C8B4DE8         mov r9, [rbp-24]
; 814:       488B75F0         mov rsi, [rbp-16]
; 818:       488B5DF8         mov rbx, [rbp-8]
; 81C:       498BD0           mov rdx, r8
; 81F:       490F44D1         cmoveq rdx, r9
; 823:       EBC6             jmp L2
; 825: L4:   CC0A             break 10                        ; error trap
; 827:       04               byte #X04
; 828:       13               byte #X13                       ; OBJECT-NOT-REAL-ERROR
; 829:       FE9B01           byte #XFE, #X9B, #X01           ; RBX
; 82C: L5:   CC0A             break 10                        ; error trap
; 82E:       04               byte #X04
; 82F:       13               byte #X13                       ; OBJECT-NOT-REAL-ERROR
; 830:       FE1B03           byte #XFE, #X1B, #X03           ; RSI
; 833:       CC0A             break 10                        ; error trap
; 835:       02               byte #X02
; 836:       19               byte #X19                       ; INVALID-ARG-COUNT-ERROR
; 837:       9A               byte #X9A                       ; RCX
* (defun max-with-speed-3 (a b)
    (declare (optimize (speed 3) (safety 0)))
    (max a b))
MAX-WITH-SPEED-3
* (disassemble 'max-with-speed-3)
; disassembly for MAX-WITH-SPEED-3
; Size: 92 bytes. Origin: #x52D452C3
; 3B:       48895DE0         mov [rbp-32], rbx                ; no-arg-parsing entry point
; 3F:       488945E8         mov [rbp-24], rax
; 43:       488BD0           mov rdx, rax
; 46:       488BFB           mov rdi, rbx
; 49:       B9C1030020       mov ecx, 536871873               ; generic->
; 4E:       FFD1             call rcx
; 50:       488B45E8         mov rax, [rbp-24]
; 54:       488B5DE0         mov rbx, [rbp-32]
; 58:       7E0C             jle L1
; 5A:       4C8BC0           mov r8, rax
; 5D: L0:   498BD0           mov rdx, r8
; 60:       488BE5           mov rsp, rbp
; 63:       F8               clc
; 64:       5D               pop rbp
; 65:       C3               ret
; 66: L1:   488945E8         mov [rbp-24], rax
; 6A:       488BF0           mov rsi, rax
; 6D:       488975F0         mov [rbp-16], rsi
; 71:       4C8BC3           mov r8, rbx
; 74:       4C8945F8         mov [rbp-8], r8
; 78:       488BD0           mov rdx, rax
; 7B:       488BFB           mov rdi, rbx
; 7E:       B929040020       mov ecx, 536871977               ; generic-=
; 83:       FFD1             call rcx
; 85:       488B45E8         mov rax, [rbp-24]
; 89:       488B75F0         mov rsi, [rbp-16]
; 8D:       4C8B45F8         mov r8, [rbp-8]
; 91:       4C0F44C6         cmoveq r8, rsi
; 95:       EBC6             jmp L0

如上所示，生成的汇编代码要短很多。编译器可以进行优化。然而，可以通过声明类型做的更好。

类型提示

在[第16章类型][zh-cn/16.type.md] 中有提到，Lisp 有个强大的类型系统。可以通过给定类型来减少编译器生成代码的体积。

* (defun max-with-type (a b)
    (declare (optimize (speed 3) (safety 0)))
    (declare (type integer a b))
    (max a b))
MAX-WITH-TYPE
* (disassemble 'max-with-type)
; disassembly for MAX-WITH-TYPE
; Size: 42 bytes. Origin: #x52D48A23
; 1B:       488BF7           mov rsi, rdi                     ; no-arg-parsing entry point
; 1E:       488975F0         mov [rbp-16], rsi
; 22:       488BD8           mov rbx, rax
; 25:       48895DF8         mov [rbp-8], rbx
; 29:       488BD0           mov rdx, rax
; 2C:       B98C030020       mov ecx, 536871820               ; generic-<
; 31:       FFD1             call rcx
; 33:       488B75F0         mov rsi, [rbp-16]
; 37:       488B5DF8         mov rbx, [rbp-8]
; 3B:       480F4CDE         cmovl rbx, rsi
; 3F:       488BD3           mov rdx, rbx
; 42:       488BE5           mov rsp, rbp
; 45:       F8               clc
; 46:       5D               pop rbp
; 47:       C3               ret

上面生成的汇编代码缩减为原来的 1/3。那么运行速度呢？

* (time (dotimes (i 10000) (max-original 100 200)))
Evaluation took:
  0.000 seconds of real time
  0.000107 seconds of total run time (0.000088 user, 0.000019 system)
  100.00% CPU
  361,088 processor cycles
  0 bytes consed
* (time (dotimes (i 10000) (max-with-type 100 200)))
Evaluation took:
  0.000 seconds of real time
  0.000044 seconds of total run time (0.000036 user, 0.000008 system)
  100.00% CPU
  146,960 processor cycles
  0 bytes consed

可以看到，通过指定类型，代码的执行要快很多！

等等……如果定义了错误的类型会怎么样呢？答案示情况而定。

比如说，SBCL 对待类型定义有其特殊的方法，叫 sbcl-type。根据不同的安全等级来进行不同等级的类型检查。如果安全登记设置为 0，那么 SBCL 不会进行类型检查。因此错误的类型声明将造成很严重的后果。

声明函数类型

另一个有用的声明是 ftype，ftype 会在函数参数类型和返回值的类型之间建立一种关系。如果传入的参数类型与定义的类型一致，返回值类型和定义的类型保持一致。基于这点，函数中可以有多个 ftype 声明。每次函数调用时，ftype 会约束参数的类型，其格式如下：

 (declare (ftype (function (arg1 arg2 ...) return-value) function-name1))

但函数返回 nil 时，它返回的类型是 null。该声明本身并没有对参数的类型做限制，只有在参数类型是指定的类型是才有效——否则不会产生任何，且声明也不会生效。比如说，下面的声明中，如果 square 函数的参数是修正数（fixnum），函数的返回值也会是修正数。

(declaim (ftype (function (fixnum) fixnum) square))
(defun square (x) (* x x))

如果提供的参数没有声明为 修正数 类型，就不会进行优化：

(defun do-some-arithmetic (x)
  (the fixnum (+ x (square x))))

现在，来试着优化速度吧。下面的代码中编译器将声明存在类型不确定：

(defun do-some-arithmetic (x)
  (declare (optimize (speed 3) (debug 0) (safety 0)))
  (the fixnum (+ x (square x))))
; compiling (DEFUN DO-SOME-ARITHMETIC ...)
; file: /tmp/slimeRzDh1R
 in: DEFUN DO-SOME-ARITHMETIC
;     (+ TEST-FRAMEWORK::X (TEST-FRAMEWORK::SQUARE TEST-FRAMEWORK::X))
;
; note: forced to do GENERIC-+ (cost 10)
;       unable to do inline fixnum arithmetic (cost 2) because:
;       The first argument is a NUMBER, not a FIXNUM.
;       unable to do inline (signed-byte 64) arithmetic (cost 5) because:
;       The first argument is a NUMBER, not a (SIGNED-BYTE 64).
;       etc.
;
; compilation unit finished
;   printed 1 note
      (disassemble 'do-some-arithmetic)
; disassembly for DO-SOME-ARITHMETIC
; Size: 53 bytes. Origin: #x52CD1D1A
; 1A:       488945F8         MOV [RBP-8], RAX                 ; no-arg-parsing entry point
; 1E:       488BD0           MOV RDX, RAX
; 21:       4883EC10         SUB RSP, 16
; 25:       B902000000       MOV ECX, 2
; 2A:       48892C24         MOV [RSP], RBP
; 2E:       488BEC           MOV RBP, RSP
; 31:       E8C2737CFD       CALL #x504990F8                  ; #<FDEFN SQUARE>
; 36:       480F42E3         CMOVB RSP, RBX
; 3A:       488B45F8         MOV RAX, [RBP-8]
; 3E:       488BFA           MOV RDI, RDX
; 41:       488BD0           MOV RDX, RAX
; 44:       E807EE42FF       CALL #x52100B50                  ; GENERIC-+
; 49:       488BE5           MOV RSP, RBP
; 4C:       F8               CLC
; 4D:       5D               POP RBP
; 4E:       C3               RET
NIL

现在可以给 x 声明个类型了，所以编译器可以假设 (square x) 表达式是个修正数（fixnum），然后使用修正运算符 +：

(defun do-some-arithmetic (x)
  (declare (optimize (speed 3) (debug 0) (safety 0)))
  (declare (type fixnum x))
  (the fixnum (+ x (square x))))
       (disassemble 'do-some-arithmetic)
; disassembly for DO-SOME-ARITHMETIC
; Size: 48 bytes. Origin: #x52C084DA
; 4DA:       488945F8         MOV [RBP-8], RAX                ; no-arg-parsing entry point
; 4DE:       4883EC10         SUB RSP, 16
; 4E2:       488BD0           MOV RDX, RAX
; 4E5:       B902000000       MOV ECX, 2
; 4EA:       48892C24         MOV [RSP], RBP
; 4EE:       488BEC           MOV RBP, RSP
; 4F1:       E8020C89FD       CALL #x504990F8                 ; #<FDEFN SQUARE>
; 4F6:       480F42E3         CMOVB RSP, RBX
; 4FA:       488B45F8         MOV RAX, [RBP-8]
; 4FE:       4801D0           ADD RAX, RDX
; 501:       488BD0           MOV RDX, RAX
; 504:       488BE5           MOV RSP, RBP
; 507:       F8               CLC
; 508:       5D               POP RBP
; 509:       C3               RET
NIL

内联代码

[inline](http://www.lispworks.com/documentation/lw51/CLHS/Body/d_inline.htm) 的作用是将函数调用替换成函数的主体，如果编译器支持的话。这种做法会降低函数调用的消耗，但会增加代码的体积。最适合使用 inline 的应该是那些使用频繁且简短的函数。下面的代码段展示了何如使用及禁用内联代码。

;; The globally defined function DISPATCH should be open-coded,
;; if the implementation supports inlining, unless a NOTINLINE
;; declaration overrides this effect.
(declaim (inline dispatch))
(defun dispatch (x) (funcall (get (car x) 'dispatch) x))
;; Here is an example where inlining would be encouraged.
;; Because function DISPATCH was defined as INLINE, the code
;; inlining will be encouraged by default.
(defun use-dispatch-inline-by-default ()
  (dispatch (read-command)))
;; Here is an example where inlining would be prohibited.
;; The NOTINLINE here only affects this function.
(defun use-dispatch-with-declare-notinline  ()
  (declare (notinline dispatch))
  (dispatch (read-command)))
;; Here is an example where inlining would be prohibited.
;; The NOTINLINE here affects all following code.
(declaim (notinline dispatch))
(defun use-dispatch-with-declaim-noinline ()
  (dispatch (read-command)))
;; Inlining would be encouraged because you specified it.
;; The INLINE here only affects this function.
(defun use-dispatch-with-inline ()
  (declare (inline dispatch))
  (dispatch (read-command)))

注意，当内联函数改变时，所有调用这个内联函数的函数都需要重新编译。

优化通用函数

使用静态分配

在开发时期，通用函数提供了极大的便利和灵活性。然而，灵活性所带来是开销：通用函数要比普通函数慢的多。尤其是在不需要灵活性的时候，其带来的性能消耗将会称为负担。

[inlined-generic-function](https://github.com/guicho271828/inlined-generic-function) 包中有将通用函数转化成静态分配的函数，将分配的开销移到编译的时间中去。只需要将通用函数定义成 inlined-generic-function 类型的函数。

注意事项

这个包被声明为实验性的，因此在正规的软件产品中不推荐使用。否则后果自负！

* (defgeneric plus (a b)
    (:generic-function-class inlined-generic-function))
#<INLINED-GENERIC-FUNCTION HELLO::PLUS (2)>
* (defmethod plus ((a fixnum) (b fixnum))
    (+ a b))
#<INLINED-METHOD HELLO::PLUS (FIXNUM FIXNUM) {10056D7513}>
* (defun func-using-plus (a b)
    (plus a b))
FUNC-USING-PLUS
* (defun func-using-plus-inline (a b)
    (declare (inline plus))
    (plus a b))
FUNC-USING-PLUS-INLINE
* (time
   (dotimes (i 100000)
     (func-using-plus 100 200)))
Evaluation took:
  0.018 seconds of real time
  0.017819 seconds of total run time (0.017800 user, 0.000019 system)
  100.00% CPU
  3 lambdas converted
  71,132,440 processor cycles
  6,586,240 bytes consed
* (time
   (dotimes (i 100000)
     (func-using-plus-inline 100 200)))
Evaluation took:
  0.001 seconds of real time
  0.000326 seconds of total run time (0.000326 user, 0.000000 system)
  0.00% CPU
  1,301,040 processor cycles
  0 bytes consed

默认情况下不启用内联，因为一旦内联，对方法所做的更改将不会生效。

可以在 [*features*](http://www.lispworks.com/documentation/lw71/CLHS/Body/v_featur.htm) 中添加 :inline-generic-function 标志来启用全局内联。

* (push :inline-generic-function *features*)
(:INLINE-GENERIC-FUNCTION :SLYNK :CLOSER-MOP :CL-FAD :BORDEAUX-THREADS
:THREAD-SUPPORT :CL-PPCRE ALEXANDRIA.0.DEV::SEQUENCE-EMPTYP :QUICKLISP
:QUICKLISP-SUPPORT-HTTPS :SB-BSD-SOCKETS-ADDRINFO :ASDF3.3 :ASDF3.2 :ASDF3.1
:ASDF3 :ASDF2 :ASDF :OS-UNIX :NON-BASE-CHARS-EXIST-P :ASDF-UNICODE :ROS.INIT
:X86-64 :64-BIT :64-BIT-REGISTERS :ALIEN-CALLBACKS :ANSI-CL :AVX2
:C-STACK-IS-CONTROL-STACK :CALL-SYMBOL :COMMON-LISP :COMPACT-INSTANCE-HEADER
:COMPARE-AND-SWAP-VOPS :CYCLE-COUNTER :ELF :FP-AND-PC-STANDARD-SAVE ..)

当该全局内联启用时，除非被声明为 notinline，否则其他所有可内联的通用函数函数都会变成内联函数。