c++ 性能优化 - 《计算机基础》

c++性能优化工具
常见代码的性能
- 对象作为函数参数 / 返回值
- 编译期计算
多线程并发

c++性能优化工具

计时器

c API
std API
instruction - rdtsc: The Time Stamp Counter (TSC) is a 64-bit register present on all x86 processors since the Pentium. It counts the number of CPU cycles since its reset. The instruction RDTSC returns the TSC in EDX:EAX. 该指令在多数环境下有封装好的函数

clock
std::chrono::steady_clock
rdtsc

在使用中需要防止优化

使用global变量, 避免编译器优化未使用的局部变量
使用锁作为内存屏障
__attribute__((noinline)) 防止内联

google-perftools

ubuntu21安装 / 低版本可能需要从源码编译:

sudo apt install google-perftools
# git clone https://github.com/gperftools/gperftools

使用

LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libprofiler.so.0 CPUPROFILE=test.prof ./bin/vb_spectest ../tests/testsuite/
google-pprof --svg ./bin/vb_spectest test.prof > test.svg

gprof - gcc 原生自带

使用

# for make project
g++ -Og -g -pg ...
# for cmake project
cmake .. -DENABLE_SPECTEST=1 -DCMAKE_BUILD_TYPE=Debug -DCMAKE_CXX_FLAGS="-pg -Og -g"
./可执行文件
gprof 可执行文件 gmon.out > gprof.out

快速查看反汇编代码网站

https://godbolt.org

函数签名反混淆

c++filt <反汇编中的函数名>

常见代码的性能

对象作为函数参数 / 返回值

// test.h
class Test {
  int vec[10];
public:
  Test();
  Test(Test const &t);
  Test(Test &&t);
  Test &operator=(Test const &t);
  Test &operator=(Test &&t);
  void nochange(int a) const;
  void change(int a);
  Test copy_and_change() const;
};
// test.cpp
__attribute__((noinline)) void foo_const(Test const &a) { a.nochange(1); }
__attribute__((noinline)) void foo_copy_use(Test a) { a.change(1); }
__attribute__((noinline)) void foo_change(Test &a) { a.change(1); }
void test() {
  Test a;
  foo_const(a);
  foo_copy_use(a);
  foo_copy_use(std::forward<Test>(a));
  foo_change(a);
}

在函数内对象不可变, 传递const ref相当于传指针

// g++ -S -O2 ./main.cpp -std=c++17 
__attribute__((noinline)) void foo_const(Test const &a) { a.nochange(1); }
foo_const(a);

 sub    x0, x29, #40                    ; =40
 bl    __Z9foo_constRK4Test            ; foo_const(Test const&)

在函数内需要操作传入对象的复制时, 直接使用对象可以同时处理左值右值
```c void foo2(Test a) { a.change(1); }

foo_copy_use(a); foo_copy_use(std::forward(a);

```assembly
; 传左值
    add    x0, sp, #48                     ; =48
    sub    x1, x29, #40                    ; =40
    bl    __ZN4TestC1ERKS_                                ; Test::Test(Test const&) 调用拷贝构造
    add    x0, sp, #48                     ; =48
    bl    __Z12foo_copy_use4Test                    ; foo_copy_use(Test)
; 传右值
  add    x0, sp, #8                      ; =8
    sub    x1, x29, #40                    ; =40
    bl    __ZN4TestC1EOS_                                    ; Test::Test(Test&&) 调用移动构造
    add    x0, sp, #8                      ; =8
    bl    __Z12foo_copy_use4Test

在函数内对象可变, 传递ref相当于传指针
```c attribute((noinline)) void foo_change(Test &a) { a.change(1); }

foo_change(a);

```assembly
    sub    x0, x29, #40                    ; =40
    bl    __Z10foo_changeR4Test

返回值优化
在完成ret调用后, 理论上应分别调用Test的移动构造和移动赋值. 但在arm64实际中汇编如下:
通过x8寄存器, 将返回值地址传入函数内部, 直接在对应地址构造
ref: cppreference: copy elision
ref: Microsoft Docs: Overview of ARM64 ABI conventions
```
Test b = ret(a);
b = ret(a);
```
```assembly Z3retRK4Test: ; @_Z3retRK4Test .cfi_startproc ; %bb.0: mov x0, x8 b ZN4TestC1Ev .cfi_endproc

; Test b = ret(a); add x8, sp, #48 ; =48 bl __Z4retRK4Test

; b = ret(a); add x8, sp, #8 ; =8 bl Z4retRK4Test add x0, sp, #48 ; =48 add x1, sp, #8 ; =8 bl ZN4TestaSEOS_

> For types greater than 16 bytes, the caller shall reserve a block of memory of sufficient size and alignment to hold the result. The address of the memory block shall be passed as an additional argument to the function in x8. The callee may modify the result memory block at any point during the execution of the subroutine. The callee isn't required to preserve the value stored in x8.
<a name="view"></a>
### view
c++17 `string_view` 仅保存`string`对象头指针和长度, 适合用于传参
C++20 `ranges::view` 更抽象的试图概念, 使用者需要保证view存续时间指向的数据一直存在
<a name="32dab895"></a>
### 为移动构造增加noexcept
当使用stl容器储存该对象时, 若对象的移动构造函数声明不抛出异常, 容器扩容时会调用移动构造函数. 若移动构造可能抛出异常, 移动部分后将无法保证`push_back`等函数的异常安全性(老数据被破坏), stl会选择使用拷贝构造函数
<a name="08954579"></a>
### 智能指针
`std::unique_ptr` 有着良好的性能, 可以认为和裸指针性能相同
`std::shared_ptr` 内部包含着原子量, 在多线程环境可能会导致一些额外的性能开销(锁总线)
<a name="cfe7eb2f"></a>
### 虚函数
多一次跳转. 大多数情况下对性能影响不大
<a name="mutex"></a>
### mutex
现代c++中普遍进行了优化, 普通mutex性能开销较低
<a name="atomic"></a>
### atomic
性能较mutex更高, 但推荐只在单一变量的同步中使用
<a name="40eac522"></a>
### 浮点数
编译器不会自动优化浮点数运算顺序(-O3不会, -Ofast中会), 需要手动合并
```c
double a;
a = 1.0 * a * 2.0;  // 不好
a = 1.0 * 2.0 * a;  // 好
a = a * 1.0 * 2.0;  // 不好
a = a * (1.0 * 2.0); // 好

编译期计算

constexpr: c++11, 14, 17, 20逐渐完善
模板元编程

多线程并发

编译器重排

mac m1 clang 无法复现

int a, b;
int main(void) {
  int c = (a + 1) * 10;
  a = b + 11;
  return c;
}

0000000100003f90 <_main>:
100003f90: 08 00 00 b0     adrp    x8, 0x100004000 <_main+0x4>
100003f94: 09 01 40 b9     ldr    w9, [x8]
100003f98: 4a 01 80 52     mov    w10, #10
100003f9c: 29 7d 0a 1b     mul    w9, w9, w10
100003fa0: 20 29 00 11     add    w0, w9, #10
100003fa4: 1f 20 03 d5     nop
100003fa8: e9 02 00 18     ldr    w9, 0x100004004 <_b>
100003fac: 29 2d 00 11     add    w9, w9, #11
100003fb0: 09 01 00 b9     str    w9, [x8]
100003fb4: c0 03 5f d6     ret

CPU乱序执行

ref: memory ordering

屏障/内存序

for c++ atomic

memory_order_relaxed: only this operation’s atomicity is guaranteed
memory_order_acquire: no reads or writes in the current thread can be reordered before this load. 保证在load完成后进行后续操作
memory_order_release: no reads or writes in the current thread can be reordered after this store. 保证store前所有操作已经完成
memory_order_acq_rel: No memory reads or writes in the current thread can be reordered before or after this store.
memory_order_seq_cst (default): all threads observe all modifications in the same order

ref: 对优化说不 - Linux中的Barrier