不同的汇编语言

MASM

Microsoft Assembler，微软开发的汇编器，主要用于Windows平台上的开发。MASM是指用于16-bit/32-bit平台的汇编器，而ML64则是MASM在64-bit平台上的汇编器。

NASM/YASM

Netwide Assembler，Linux平台下流行的汇编器，主要用于Linux平台的开发，也可以用于Windows平台。
YASM完全兼容NASM，是在BSD协议下的完全重写。

FASM

Flat Assembler，另一个汇编器，在Windows和Linux下都可用，但并不是这两个平台的默认汇编器。

可用寄存器

https://wiki.osdev.org/CPU_Registers_x86-64

The 64-bit versions of the ‘original’ x86 registers are named:

rax - register a extended
rbx - register b extended
rcx - register c extended
rdx - register d extended
rbp - register base pointer (start of stack)
rsp - register stack pointer (current location in stack, growing downwards)
rsi - register source index (source for data copies)
rdi - register destination index (destination for data copies)

The registers added for 64-bit mode are named:

r8 - register 8
r9 - register 9
r10 - register 10
r11 - register 11
r12 - register 12
r13 - register 13
r14 - register 14
r15 - register 15

此外，eax表示rax的低32位，ax表示rax的低16位，al表示低8位，ah表示低16位中较高的8位；对于r开头的寄存器而言规则有所区别，r8d表示低32位，r8w表示低16位，r8b表示低8位，可以看到这些命名更规范了。

函数调用约定

Intel（MS）的汇编方法调用规范和Linux不同，阅读代码时需要留意

Intel C Style function calling：

For the Microsoft* x64 calling convention, the additional register space let fastcall be the only calling convention (under x86 there were many: stdcall, thiscall, fastcall, cdecl, etc.). The rules for interfacing with C/C++ style functions:

RCX, RDX, R8, R9 are used for integer and pointer arguments in that order left to right.
XMM0, 1, 2, and 3 are used for floating point arguments.
Additional arguments are pushed on the stack left to right.
Parameters less than 64 bits long are not zero extended; the high bits contain garbage.
It is the caller’s responsibility to allocate 32 bytes of “shadow space” (for storing RCX, RDX, R8, and R9 if needed) before calling the function.
It is the caller’s responsibility to clean the stack after the call.
Integer return values (similar to x86) are returned in RAX if 64 bits or less.
Floating point return values are returned in XMM0.
Larger return values (structs) have space allocated on the stack by the caller, and RCX then contains a pointer to the return space when the callee is called. Register usage for integer parameters is then pushed one to the right. RAX returns this address to the caller.
The stack is 16-byte aligned. The “call” instruction pushes an 8-byte return value, so the all non-leaf functions must adjust the stack by a value of the form 16n+8 when allocating stack space.
Registers RAX, RCX, RDX, R8, R9, R10, and R11 are considered volatile and must be considered destroyed on function calls.
RBX, RBP, RDI, RSI, R12, R14, R14, and R15 must be saved in any function using them.
Note there is no calling convention for the floating point (and thus MMX) registers.
Further details (varargs, exception handling, stack unwinding) are at Microsoft’s site.

Linux ABI：

参数顺序：
整型：rdi, rsi, rdx, rcx, r8, r9
浮点：xmm0~xmm7
放不下的，从右到左push到stack上，这样取出来的时候顺序就是从左到右的自然顺序
返回地址：[rsp], 16字节对齐
被调用方需维护的寄存器：rbp, rbx, r12~r15
调用方需维护：XMCSR, x87 control word, 通常不会用到x87指令，因此调用方不怎么需要做这些事
返回值：
整数：rax或者rdx:rax
浮点数：xmm0或xmm1:xmm0（对于标准C，浮点数最大为double，64位，用不到xmm1）

https://cs.lmu.edu/~ray/notes/nasmtutorial/

From left to right, pass as many parameters as will fit in registers. The order in which registers are allocated, are:
- For integers and pointers, rdi, rsi, rdx, rcx, r8, r9.
- For floating-point (float, double), xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7.
Additional parameters are pushed on the stack, right to left, and are to be removed by the caller after the call.
After the parameters are pushed, the call instruction is made, so when the called function gets control, the return address is at [rsp], the first memory parameter is at [rsp+8], etc.
The stack pointer rsp must be aligned to a 16-byte boundary before making a call. Fine, but the process of making a call pushes the return address (8 bytes) on the stack, so when a function gets control, rspis not aligned. You have to make that extra space yourself, by pushing something or subtracting 8 from rsp.
The only registers that the called function is required to preserve (the calle-save registers) are: rbp, rbx, r12, r13, r14, r15. All others are free to be changed by the called function.
The callee is also supposed to save the control bits of the XMCSR and the x87 control word, but x87 instructions are rare in 64-bit code so you probably don’t have to worry about this.
Integers are returned in rax or rdx:rax, and floating point values are returned in xmm0 or xmm1:xmm0.
此外，Linux还规定函数栈顶之后的128个字节为’red zone’，函数可以任意使用这部分区域存储局部变量，使用时无需修改rsp寄存器。

寻址模式

$Imm是直接数，不涉及寄存器存取或者内存存取；
ra代表寄存器内容，用简写R[ra]表示ra寄存器的内容；
Imm(rb, ri, s) 代表：M[Imm + R(rb) + R(ri) * s]，用M[…]代表某处内存的内容

Imm: 立即数
rb: 基址寄存器
ri: 变址寄存器
s: 比例因子

条件码寄存器

假设刚刚利用ADD执行完了t=a+b（不存在a-b指令SUB，因为补码减法可以用a加上b的负数来表达，一个ADD配合一个NEG即可），那么：
CF：t < a，无符号指令执行时会被设置，表示溢出
ZF：t == 0，结果为零
SF：t < 0，结果为负
OF：(a < 0 == b < 0) && (t < 0 != a < 0)，a和b的符号相同但结果和a或者b符号不同，即发生有符号溢出

指令列表

数据传送指令
- MOV(b, w, l, q, absq)
  - params: S, D
  - result: D <- S
  - 其中，movl会对高32bit置零，movabsq用于传送立即数到指定位置
- MOVZ(bw, bl, wl, bq, wq)
  - params: S, D
  - result: D <- S
  - 用于传输位数不同的数据，以0填充左侧位，源操作数必须比目标位数少，不存在movzlq是因为它相当于movl，movl已经会将高32位置0了
- MOVS(bw, bl, wl, bq, wq, lq)
  - 类似MOVZ，但以符号位填充左侧位
栈操作指令
- pushq
  - params: S
  - result: R[%rsp] <- R[%rsp] - 8; M[R[%rsp]] <- S
  - 将指定的参数压入栈，压栈操作是先移动栈指针让出空间，然后将S写入到被让出的空间中
- popq
  - params: D
  - result: D <- M[R[%rsp]]; R[%rsp] <- R[%rsp] + 8
  - 将指定的参数出栈，写入到参数指定的地址，出栈过程和压栈相反
算数操作
- leaq：
  - params: S, D
  - result: D <- &S
  - 按照寻址模式规则对地址进行计算，但不取地址处的值，而是取计算好的地址，写入到指定位置
- INC, DEC, NEG, NOT
  - params: D
  - 将D地址值加1减1，或者取负取补后，将结果写入原位置D
- ADD, SUB, IMUL, XOR, OR, AND
  - params: S, D
  - 执行完D+S, D-S等之后，将结果写入D
- SAL, SHL, SHR, SAR
  - params: k, D
  - 将D值按照立即数k进行算数或逻辑左右移，然后将结果写入D
特殊算数操作，涉及多个事先约定的寄存器
- imulq, mulq
  - params: S
  - result: R[%rdx]: R[%rax] <- S * R[%rax]
  - i表示有符号版本，计算S * %rax, 结果最高有128位，将高64位存入%rdx, 低64位存入%rax
- cqto
  - 无参数
  - 利用%rax的符号位填满%rdx，通常在使用下面的除法指令时，如果被除数仅有64位，那么就需要用该指令对被除数的高64位也就是%rdx使用0或者符号位进行填充，一般利用xor %rdx, %rdx来清零，而利用cqto来填充符号位
- idivq, divq
  - params: S
  - R[%rdx] <- R[%rdx] : R[%rax] mod S
  - R[%rax] <- R[%rdx] : R[%rax] / S
比较指令，专用于修改条件码寄存器
- CMP(b, w, l, q)
  - params: S1, S2
  - 基于S2-S1，设置条件码寄存器
- TEST(b, w, l, q)
  - params: S1, S2
  - 基于S1 & S2，设置条件码寄存器
根据条件码寄存器设置值
- set(e, z, ne, nz, s, ns, g, ge, nle, nl, l, le, nge, ng, a, nbe, ae, nb, b, nae, be, na)
  - e = equal, z = zero(e和z效果相同), s = sign(有符号位，负数)
  - g = greater, l = less (有符号判断大小于)
  - a = above, b = below(无符号判断大小于)
  - n = not, 否定后面的其他符号(因此nl和ge效果相同，”不小于”等同”大于等于”)
跳转指令，跳转到指定标号位置，在链接阶段标号会被实际的地址替代
- jmp, je, jne, jz, jnz, js, jns, jg, jnle, jge, jnl, jl, jnge, jle, jng, ja, jnbe, jae, jnb, jb, jnae, jbe, jna
  - 和set一样，jmp指令也有一堆条件码寄存器变种，和set变种判断逻辑一致，不一样的是set负责将判断结果写入目标位置，而jmp当判断结果为真时就会执行跳转
- jmp是无条件跳转，可以跳转到任意间接地址或标号（即直接地址，最终链接完就是一个固定地址），而其他变种只能跳转到标号
- 寻址
  - PC相对寻址：相对下一条指令的位置指定的偏移量，比较常见，因为编码较紧凑，只有一个指令字节和一个相对偏移量字节
  - 绝对寻址
- （有时会看到repz;retq这样的指令序列，ret的意义很明显，而rep其实原本是用于实现重复字符串操作的，但repz只是一个空指令，这样做的原因是为了兼容AMD处理器的实现问题，避免AMD处理器无法正确执行分支预测导致代码执行变慢，不会有其他影响。）
条件传送指令
- cmov(e, z, ne, nz, s, ns, g, nle, ge, nl, l, nge, le, ng, a, nbe, ae, nb, b, nae, be, na)
- 参数：S, R，源可以是寄存器或者内存地址，但目的必须是寄存器
- 后缀部分和上面其他的条件含义一致
- 条件传送指令效率上优于条件跳转指令，因此对于除了赋值之外无副作用的分支逻辑，编译器会优先通过条件传送指令实现，但如何判定是否有副作用比较复杂，为了用条件传送必须事先计算出可能的两个值，有时计算的代价可能会大于条件跳转的代价，这方面的知识就是另一门课程‘编译原理’的知识点了；
过程调用指令
- call
  - 参数：函数名称
  - 效果：将rip存入栈顶，并跳转到函数处执行
- ret
  - 参数：无
  - 效果：读取栈顶地址并跳转执行
浮点传送指令（SSE/AVX指令）
- vmovss, vmovsd
  - 参数（源, 目的）：X, M / M, X
    - X表示128位的XMM寄存器
    - M表示内存地址，vmovss用于传送单精度浮点数，vmovsd用于传送双精度浮点数，因此M指向的内容必须是4字节或8字节的
- vomvaps
  - 参数（源, 目的）：X, X
浮点转换指令
- vcvtt(ss2si, sd2si, ss2siq, sd2siq)
  - ss表示单精度，sd表示双精度，siq表示四字节整数
  - 参数（源, 目的）：X/M, R
  - 用于将浮点数转为整数并存储到目标寄存器，转换过程中会发生截断
- vcvt(si2ss, si2sd, si2ssq, si2sdq)
  - 参数（源, 意义不明的源2, 目的）：M/R, X, X
  - 用于将整数转为浮点数，此时第二个参数都会被设置为和目的一样的XMM寄存器
浮点运算指令（AVX2指令）
- vaddss, vsubss, vmulss, vdivss, vminss, vmaxss, sqrtss
  - 以上包含了加减乘除、取最大最小值、求平方根
  - 除了sqrtss, 其他指令都接受三个操作数，分别是S1，S2，D
  - D <- S2 S1，或者D <- S1
- vxorps, vandps
  - 对浮点数的全部128位执行位操作，xor表示异或，and表示与
  - 参数：S1，S2，D
- vucomiss, vucomisd
  - 对浮点数进行比较，和com指令类似，根据S2 - S1对条件码寄存器进行设置
  - 当S1或者S2当中有一个是NaN时，还会设置PF，即奇偶校验位，此时认为比较失败了

C/C 知识

x86-64汇编基础