本文翻译自Pitfalls of TSC usage,作者的网站(http://oliveryang.net)有很多值得一读的文章
1. 用户态中的延迟测量 Latency measurement in user space
在应用程序中开发性能敏感的代码时,开发者通常会测量相关代码的时延数据。这些代码有时会仅用于开发阶段修正错误、测量性能表现等,有些时候则保留在产品中提供性能追踪数据。
While user application developers are working on performance sensitive code, one common requirement is do latency/time measurement in their code. This kind of code could be temporary code for debug, test or profiling purpose, or permanent code that could provide performance tracing data in software production mode.
Linux内核为用户态程序提供了gettimeofday() 和 clock_gettime()两个高精度的时间测量系统调用。gettimeofday()精度在微秒级别,clock_gettime()在纳秒级别。然而,这两个系统调用本身也会带来额外的性能开销。
Linux kernel provides gettimeofday() and clock_gettime() system calls for user application high resolution time measurement. The gettimeofday() is us level, and clock_gettime() is ns level. However, the major concerns of these system calls usage are the additional performance cost caused by calling themselves.
为了最小化这两个系统调用的性能开销,Linux内核提供了虚拟系统调用(virtual system calls,vsyscalls)和虚拟动态链接共享对象(Virtual Dynamically linked Shared Objects,VDSOs)两种机制避免用户态和内核态的切换开销。在x86平台,这两个系统调用能够借助于vsyscalls内核补丁,避免切换开销,但在其他平台上,它们仍然需要完成常规的系统调用流程。因此这一优化是依赖于硬件实现的。
In order to minimize the perf cost of gettimeofday() and clock_gettime() system calls, Linux kernel uses the vsyscalls(virtual system calls) and VDSOs (Virtual Dynamically linked Shared Objects) mechanisms to avoid the cost of switching from user to kernel. On x86, gettimeofday() and clock_gettime() could get better performance due to vsyscalls kernel patch, by avoiding context switch from user to kernel space. But some other arch still need follow the regular system call code path. This is really hardware dependent optimization.
2. 为什么使用TSC Why using TSC?
尽管基于vsyscalls 的 gettimeofday() and clock_gettime()实现会快于常规的系统调用,但它们有时还是不能满足一些延迟测试场景的需要。
Although vsyscalls implementation of gettimeofday() and clock_gettime() is faster than regular system calls, the perf cost of them is still too high to meet the latency measurement requirements for some perf sensitive application.
TSC(time stamp counter,时间戳计数器)是由x86处理器提供的高精度计数器,并且能够简便地通过rdtsc指令读取。在Linux中rdtsc指令可以直接在用户态执行,这意味着应用程序仅用一条指令就能够以快于vsyscall的速度获取到细粒度(纳秒级别)的时间戳。
The TSC (time stamp counter) provided by x86 processors is a high-resolution counter that can be read with a single instruction (RDTSC). On Linux this instruction could be executed from user space directly, that means user applications could use one single instruction to get a fine-grained timestamp (nanosecond level) with a much faster way than vsyscalls.
下面是用户态实现rdtsc接口的典型代码
Following code are typical implementation for rdtsc() api in user space application,
static uint64_t rdtsc(void){uint64_t var;uint32_t hi, lo;__asm volatile("rdtsc" : "=a" (lo), "=d" (hi));var = ((uint64_t)hi << 32) | lo;return (var);}
rdtsc的返回值是CPU周期数,通过简单计算就能够转换为纳秒
The result of rdtsc is CPU cycle, that could be converted to nanoseconds by a simple calculation.
ns = CPU cycles * (ns_per_sec / CPU freq)
为了获取到更好的结果,Linux内核使用了更复杂的方法。
In Linux kernel, it uses more complex way to get a better results,
/** Accelerators for sched_clock()* convert from cycles(64bits) => nanoseconds (64bits)* basic equation:* ns = cycles / (freq / ns_per_sec)* ns = cycles * (ns_per_sec / freq)* ns = cycles * (10^9 / (cpu_khz * 10^3))* ns = cycles * (10^6 / cpu_khz)** Then we use scaling math (suggested by george@mvista.com) to get:* ns = cycles * (10^6 * SC / cpu_khz) / SC* ns = cycles * cyc2ns_scale / SC** And since SC is a constant power of two, we can convert the div* into a shift.** We can use khz divisor instead of mhz to keep a better precision, since* cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.* (mathieu.desnoyers@polymtl.ca)** -johnstul@us.ibm.com "math is hard, lets go shopping!"*/
最终,延迟测量代码可以实现为
Finally, the code of latency measurement could be,
start = rdtsc();/* 在这里放置要测量的代码 *//* put code you want to measure here */end = rdtsc();cycle = end - start;latency = cycle_2_ns(cycle)
但是上面的实现存在一些问题,同时Linux内核并不鼓励这样使用。其原因在于,TSC机制并不是那么可靠,即使是Linux内核在处理它时也遇到了相当棘手的状况。
In fact, above rdtsc implementation are problematic, and not encouraged by Linux kernel. The major reason is, TSC mechanism is rather unreliable, and even Linux kernel had the hard time to handle it.
Linux不向用户直接提供rdtsc接口的原因也正是如此。当然,虽然x86平台提供了限制用户态使用rdtsc的功能,Linux也并没有做出这一限制。应用程序完全可以使用上述实现,不过这也意味着开发者必须对一些tsc相关缺陷导致的复杂情况做好准备。
That is why Linux kernel does not provide the rdtsc api to user application. However, Linux kernel does not limit the rdtsc instruction to be executed at privilege level, although x86 support the setup. That means, there is nothing stopping Linux application read TSC directly by above implementation, but these applications have to prepare to handle some strange TSC behaviors due to some known pitfalls.
3. 已知的TSC缺陷 Known TSC pitfalls
3.1 不稳定的TSC硬件实现 TSC unstable hardware
3.1.1 CPU提供的TSC功能 CPU TSC capabilities
Intel CPU有三类TSC行为
Intel CPUs have 3 sort of TSC behaviors,
- 可变TSC Variant TSC
在第一代TSC中,TSC增加的的速率会受到CPU频率的影响。早期的一些处理器使用的这种设计(从P4开始)
The first generation of TSC, the TSC increments could be impacted by CPU frequency changes. This is started from a very old processors (P4). - 常数TSC Constant TSC
不论CPU频率是否变化,TSC都以常数速率增加。但当CPU进入休眠状态(deep C-state)时,TSC会停止计数。Nehalem之前的CPU使用了常数TSC,不过恒定TSC要优于这一设计
The TSC increments at a constant rate, even CPU frequency get changed. But the TSC could be stopped when CPU run into deep C-state. Constant TSC is supported before Nehalem, and not as good as invariant TSC. - 恒定TSC Invariant TSC
恒定TSC在CPU的所有状态下都以恒定速率增加。之后的架构都保持了这一行为。Nehalem及之后的CPU使用了恒定TSC。
The invariant TSC will run at a constant rate in all ACPI P-, C-, and T-states. This is the architectural behavior moving forward. Invariant TSC only appears on Nehalem-and-later Intel processors.
参考Intel 64 Architecture SDM Vol. 3A “17.12.1 Invariant TSC”.
See Intel 64 Architecture SDM Vol. 3A “17.12.1 Invariant TSC”.
根据CPU的不同,Linux提供了TSC相关标志位
Linux defines several CPU feature bits per CPU differences,
- X86_FEATURE_TSC
CPU支持TSC
The TSC is available in CPU. - X86_FEATURE_CONSTANT_TSC
CPU支持常数TSC
When CPU has a constant TSC. - X86_FEATURE_NONSTOP_TSC
TSC在休眠状态不停止
When CPU does not stop for C-state.
同时支持 X86_FEATURE_CONSTANT_TSC 和 X86_FEATURE_NONSTOP_TSC 表示CPU使用恒定TSC。详情可参考这里
The CONSTANT_TSC and NONSTOP_TSC flag combinations are enabled for invariant TSC. Please refer to this kernel patch for implementation.
如果CPU不支持恒定TSC,在内核启用睿频(turbo boost)、频率调节(speed-step)和电源管理时会使TSC出现问题。
If CPU has no “Invariant TSC” feature, it might cause the TSC problems, when kernel enables P or C state: as known as turbo boost, speed-step, or CPU power management features.
举个例子,如果Linux内核没有探测到NONSTOP_TSC标志,在CPU进入省电(休眠)状态时,Intel的驱动(Intel idle driver)会将TSC标记为不稳定。
For example, if NONSTOP_TSC feature is not detected by Linux kernel, when CPU ran into deep C-state for power saving, Intel idle driver will try to mark TSC with unstable flag,
if (((mwait_cstate + 1) > 2) &&!boot_cpu_has(X86_FEATURE_NONSTOP_TSC))mark_tsc_unstable("TSC halts in idle"" states deeper than C2");
ACPI CPU idle driver 有着类似的处理逻辑。
The ACPI CPU idle driver has the similar logic to check NONSTOP_TSC for deep C-state.
使用下面的命令可以查看CPU提供的标志位
Please use below command on Linux to check CPU capabilities,
$ cat /proc/cpuinfo | grep -E "constant_tsc|nonstop_tsc"
- X86_FEATURE_TSC_RELIABLE
一个合成的标志位,跳过TSC同步检查
A synthetic flag, TSC sync checks are skipped.
CPU的特征位只能表明在单处理器系统(UP system)上TSC的稳定性。在对称多处理器系统(SMP system)中,并没有直接确保TSC稳定性的功能。TSC同步性检验是唯一的方法。
CPU feature bits can only indicate the TSC stability in a UP system. For a SMP system, there are no explicit ways could be used to ensure TSC reliability. The TSC sync test is the only way to test SMP TSC reliability.
然而,一些虚拟化方法并没有提供良好的TSC同步机制。为了处理TSC同步性测试的假阳性结果,VMWare为Linux内核增加了一个合成标志位 TSC_RELIABLE,用于跳过同步性测试。其他的一些组件也使用了这一功能。下面的命令可以用来查看这个新的标志位。
However, some virtualization solution does provide good TSC sync mechanism. In order to handle some false positive test results, VMware create a new synthetic TSC_RELIABLE feature bit in Linux kernel to bypass TSC sync testing. This flag is also used by other kernel components to bypass TSC sync testing. Below command could be used to check this new synthetic CPU feature,
$ cat /proc/cpuinfo | grep "tsc_reliable"
如果我们能获取到这些标志位,那么我们就能够信赖这个平台上的TSC。但要记住,TSC处理代码中的缺陷仍有可能带来问题。
If we could get the feature bit set on CPU, we should be able to trust the TSC source on this platform. But keep in mind, software bugs in TSC handling still could cause the problems.
3.1.2 SMP系统中的TSC行为 TSC sync behaviors on SMP system
在UP系统中,多核心的TSC同步行为由处理器自身决定。但在SMP系统中,不同CPU之间的TSC同步是一个大问题。根据处理方式,可将SMP系统分为三类
On a UP system, CPU TSC sync behavior among multiple cores is determined by CPU TSC capability. Whereas on a SMP system, the TSC sync problem cross multiple CPU sockets could be a big problem. There are 3 type of SMP systems,
- 没有同步机制 No sync mechanism
在早期SMP系统和多核处理器上,TSC并不会进行同步。这意味着如果程序在读取TSC后被OS切换到另一个处理器上,再次读取TSC时有可能会出现“时间回退”的现象。
On most older SMP and early multi-core machines, TSC was not synchronized between processors. Thus if an application were to read the TSC on one processor, then was moved by the OS to another processor, then read TSC again, it might appear that “time went backwards”.
可变TSC和常数TSC在SMP系统中会出现这种问题。
Both “Variant TSC” and “Constant TSC” CPUs on SMP machine have this problem.
- 对同一主板上的CPU进行同步 Sync among multiple CPU sockets on same main-board
在CPU支持恒定TSC后,多数SMP系统都会同步TSC。在引导阶段,所有使用统一RESET信号的CPU都会重置,并以相同速率增加
After CPU supports “Invariant TSC”, most recent SMP system could make sure TSC got synced among multiple CPUs. At boot time, all CPUs connected with same RESET signal got reseted and TSCs are increased at same rate.
- 没有跨机柜、刀片或主板的同步机制 No sync mechanism cross multiple cabinets, blades, or main-boards.
根据主板或计算机制造商的设计,不同的CPU可能会连接到不同的时钟信号上。这种情况下TSC同步无法得到保证
Depending on board and computer manufacturer design, different CPUs from different boards may connect to different clock signal, which has no guarantee of TSC sync.
例如,在SGI UV系统上,不同刀片的TSC并没有同步。SGI还向Linux提供了禁用此类平台上TSC同步的补丁。
For example, on SGI UV systems, the TSC is not synchronized across blades. A patch provided by SGI tries to disable TSC clock source for this kind of platform.
即便CPU提供了恒定TSC功能,SMP系统仍然有可能无法保证TSC的可靠性。鉴于此,Linux不得不依赖于引导阶段或运行时的测试,而不仅仅依靠检测CPU的功能。TSC同步性测试代码曾经还包含了基于write_tsc的TSC值修正代码。实际上,Intel CPU提供了MSR寄存器用于修改TSC值。这也是write_tsc的工作原理。然而,通过同时在SMP中的每个CPU修改TSC来实现同步是十分困难的。因此,目前Linux仅仅会检测TSC同步性而不会修改其值。
Even if a CPU has “Invariant TSC” feature, but the SMP system still can not provide reliable TSC. For this reason, Linux kernel has to rely on some boot time or runtime testing instead of just detect CPU capabilities. The test sync test code used to have TSC value fix up code by calling write_tsc code. Actually, Intel CPU provides a MSR register which allows software change the TSC value. This is how write_tsc works. However, it is difficult to issue per CPU instructions over multiple CPUs to make TSC got a sync value. For this reason, Linux kernel just check the tsc sync and will not to write to tsc now.
如果在引导阶段通过了同步性测试,下面的文件会指定TSC作为时钟源。
If TSC sync test passed during Linux kernel boot, following sysfs file would export tsc as current clock source,
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksourcetsc
3.1.3 非 Intel 平台 Non-intel platform
对于非 Intel 平台情况会有所不同。目前的Linux系统会将所有非Intel的SMP系统视为非同步TSC系统。具体可参考tsc.c中unsynchronized_tsc相关代码。LKML中有AMD的相关文档。
Non-intel x86 platform has different stories. Current Linux kernel treats all non-intel SMP system as non-sync TSC system. See unsynchronized_tsc code in tsc.c.
LKML also has the AMD documents.
3.1.4 CPU热插拔 CPU hotplug
CPU热插拔功能会增加一个可能未进行TSC同步的新CPU。理论上,系统软件、BIOS或SMM代码都能够完成同步操作。
CPU hotplug will introduce a new CPU which may not have synchronized TSC values than existing CPUs. In theory, either system software, BIOS, or SMM code could do TSC sync for CPU hotplug.
在Linux中,内核会检查TSC同步性,并有可能通过mark_tsc_unstable停止将TSC作为时钟源。Linux内核曾经包含基于write_tsc实现的TSC同步算法,但最近已经移除了相关代码。主要由于现在没有通过同时在SMP中的每个CPU修改TSC来实现同步的可靠机制。
In Linux kernel CPU hotplug code path, it will check tsc sync and may disable tsc clocksource by calling mark_tsc_unstable. Linux kernel used to have TSC sync algorithm by using write_tsc call. But recent Linux code already removed the implementation due to the sync code because there is no reliable software mechanism to make sure TSC values are exactly same by issuing multiple instructions to multiple CPUs at exactly same time.
根据我的理解,因为这个原因,Linux在CPU热插拔时的TSC同步操作取决于硬件/固件行为。
For this reason, per my understandings, Linux TSC sync on CPU hotplug scenario depends on hardware/firmware behaviors.
3.1.5 不同固件带来的复杂问题 Misc firmware problems
由于TSC是可写的,固件对TSC的修改可能在OS中引发TSC同步问题。
As TSC value is writeable, firmware code could change TSC value that caused TSC sync issue in OS.
这篇LKML讨论中提及到,一些BIOS SMI(System Management Interruption,系统管理中断)处理程序会通过修改TSC值隐藏其执行时间
There is a LKML discussion mentioned some BIOS SMI handler try to hide its execution by changing TSC value.
另一个例子与固件的电源管理行为相关。正如我们之前提到的,具有恒定TSC的CPU能够避免状态改变导致的TSC速度变化。但是,一些x86固件实现会在同步时修改TSC值。这种情况下,固件视角的TSC工作正常,但对于软件却并不总是如此。Linux内核需要创建一些补丁处理ACPI挂起/继续操作,以此确保OS的TSC同步机制仍然有效。
Another example is related firmware behaviors in power management handling. As we mentioned earlier, CPU which has “Invariant TSC” feature could avoid TSC rate changes during CPU deep C-state changes. However, some x86 firmware implementation change the TSC value in its TSC sync implementation. In this case, TSC sync work from firmware, but could break from software perspective. Linux kernel has to create a patch to handle ACPI suspend/resume to make sure TSC sync still works in OS.
3.1.6 不同固件缺陷引发的问题 Misc hardware erratas
TSC同步功能高度依赖于硬件实现。例如,时钟源可靠性问题。我曾经遇到过不可靠时钟源导致的硬件错误提示。由于这一问题的出现,Linux内核的TSC同步测试代码报告的错误信息并禁用了TSC时钟源。
TSC sync functionality was highly depends on board manufacturer design. For example, clock source reliability issues. I used to encountered a hardware errata caused by unreliable clock source. Due to the errata, Linux kernel TSC sync test code (check_tsc_sync_source in tsc_sync.c) reported error messages and disabled TSC as clock source.
另一篇LKML讨论提到多处理器系统中温度问题导致的TSC漂移。最终这可能会导致linux检测到TSC问题。
Another LKML discussion also mentioned that SMP TSC drift in the clock signals due to temperature problem. This finally could cause Linux detected the TSC wrap problems.
3.1.7 总结 Summary
对于x86平台制造商来说,支持TSC同步并不是一件容易的事。在Linux平台,我们能够通过下面的两个步骤检查平台支持的特性
TSC sync capability are not easy to be supported by x86 platform vendors. Under Linux, We can use following two steps to check the platform capabilities,
- 检查CPU的特征位 Check CPU capability flags
例如,检查Linux下的/proc/cpuinfo或使用cpuid指令。具体可参考3.1.1.另外要注意VMWare的特殊标记位
For example, check /proc/cpuinfo under Linux or using cpuid instruction. Please refer to section 3.1.1.Note that VMware guest VM have a special flag. - 检查当前的内核时钟 Check current kernel clock source
具体请参考 3.1.2.
Please refer to section 3.1.2.
依赖TSC同步可靠性的应用程序可以通过以上两个步骤检查TSC的可靠性。但是,从TSC问题的根源出发,内核并不一定能发现全部的不可靠情形。例如,TSC在运行时也有可能会出现问题。这种情况下,Linux有可能会实时改变时钟源。
User application who relies on TSC sync may do above two steps check to confirm whether TSC is reliable or not. However, per the root causes of TSC problems, kernel may not able to test out all of unreliable cases. For example, it is still possible that TSC clock had the problem during runtime. In this case, Linux may switch clock source from tsc to others on the fly.
3.2 软件使用TSC时的缺陷 Software TSC usage bugs
3.2.1 处理TSC时的溢出问题 Overflow issues in TSC calculation
rdtsc的返回值是CPU周期数,而使用延迟或时间点时需要的是常规的时间单位,例如纳秒和微秒。
The direct return value of rdtsc is CPU cycle, but latency or time requires a regular time unit: ns or us.
根据 Intel 64 Architecture SDM Vol. 3A 2-33,64位的TSC寄存器足够存储CPU周期数
In theory, 64bit TSC register is good enough for saving the CPU cycle, per Intel 64 Architecture SDM Vol. 3A 2-33,
时间戳计数器是一个64位的模型特定(model-specific)计数器。在处理器重置时会设定位0。通常CPU工作在3Ghz时,tsc会以约每年9.5x10^16的速度增加。在这一CPU频率下,绕回周期大约为190年。溢出问题主要源自于cycle_2_ns 和 cycle_2_us 的具体实现。转换过程中的大数乘法可能会导致溢出。
The time-stamp counter is a model-specific 64-bit counter that is reset to zero each time the processor is reset. If not reset, the counter will increment ~9.5 x 1016 times per year when the processor is operating at a clock rate of 3GHz. At this clock frequency, it would take over 190 years for the counter to wrap around. The overflow problem here is in implementations of cycle_2_ns or cycle_2_us, which need multiply cycle with another big number, then this may cause the overflow problem.
对于当前的Linux,溢出可能在以下场景中出现
Per current Linux implementation, when the overflow bug may happen if,
- Linux连续运行超过208天 The Linux OS has been running for more than 208 days
- kexec特性阻止Linux在重启阶段重置TSC The Linux OS reboot does not cause TSC reset due to kexec feature
- 硬件、固件缺陷使Linux在重启阶段没有重置TSC Some possible Hardware/Firmware bugs that cause no TSC reset during Linux OS reboot
linux在获取调度器时钟时也曾受到溢出的影响
Linux kernel used to get suffered from the overflow bugs (patch for v3.2 , patch for v3.4) when it try to use TSC to get a scheduler clock.
为了避免溢出问题,Linux内核的cycle_2_ns变得比前文的情况更加复杂
In order to avoid overflow bugs, cycle_2_ns in Linux kernel becomes more complex than we referred before,
/** ns = cycles * cyc2ns_scale / SC** Although we may still have enough bits to store the value of ns,* in some cases, we may not have enough bits to store cycles * cyc2ns_scale,* leading to an incorrect result.** To avoid this, we can decompose 'cycles' into quotient and remainder* of division by SC. Then,** ns = (quot * SC + rem) * cyc2ns_scale / SC* = quot * cyc2ns_scale + (rem * cyc2ns_scale) / SC** - sqazi@google.com*/
一些应用程序可能会使用下面的公式,然而只要TSC计数超过2个小时这个公式就会发生溢出。
Unlike Linux kernel, some user applications uses below formula, which can cause the overflow if TSC cycles are more than 2 hours!
ns = cycles * 10^6 / cpu_khz
总而言之,使用rdtsc时要小心处理溢出问题
Anyway, be careful for overflow issue when you use rdtsc value for a calculation.
3.2.2 CPU频率使用错误 Wrong CPU frequency usage
你可能已经注意到了,Linux在实现中使用了kHZ而不是GHZ/MHZ(具体请参考前面章节cycle_2_ns的实现)。其原因在于使用kHz能提供更好的精度。早期的内核使用MHz,后期的补丁修正了这一问题
You may already notice that, Linux kernel use CPU KHZ instead of GHZ/MHZ in its implementation. Please read previous section about cycle_2_ns implementation. The major reason of using KHZ here is: better precision. Old kernel code used MHZ before, and this patch fixed the issue.
3.2.3 乱序执行 Out of order execution
如果要进行性能测试的代码非常短,最好使用LFENCE 或 RDTSCP 重新实现rdtsc结构。否则,CPU的乱序执行会对测量带来精度问题。
If the code you want to measure is a very small piece of code, our rdtsc function above might need to be re-implement by LFENCE or RDTSCP. Otherwise, this will introduce the precision issues caused by CPU out of order execution.
根据 Intel 64 Architecture SDM Vol. 2B 的描述
See the description in Intel 64 Architecture SDM Vol. 2B,
RDTSC不是一个顺序执行的指令。CPU并不会等待前面的指令全部执行完再读取TSC值,后续指令同样有可能在读取操作前执行。如果需要在前面指令都执行完后再读取TSC值,开发者可以使用RDTSCP指令(如果CPU支持)或依次使用LFENCE;RDTSC两个指令
The RDTSC instruction is not a serializing instruction. It does not necessarily wait until all previous instructions have been executed before reading the counter. Similarly, subsequent instructions may begin execution before the read operation is performed. If software requires RDTSC to be executed only after all previous instructionshave completed locally, it can either use RDTSCP (if the processor supports tha tinstruction) or execute the sequence LFENCE;RDTSC.
Linux内核有一个rdtsc_ordered的例子
Linux kernel has an example to have the rdtsc_ordered implementation
3.3 不同虚拟化平台对TSC的模拟 TSC emulation on different hypervisors
虚拟化技术为虚拟机OS的时间保持带来了很大的挑战。本节只会包含其中的一部分情况。在这些案例中,宿主OS能够检测到TSC时钟源,而运行于虚拟机中的软件依赖于TSC机制,同时会向虚拟CPU(vCPU)发送rdtsc指令。
Virtualization technology caused the lots of challenges for guest OS time keeping. This section just cover the cases that host could detect the TSC clock source, and guest software might be TSC sensitive and try to issue rdtsc instruction to access TSC register while the task is running on a vCPU.
与使用物理机的情况相比,虚拟化会为TSC同步带来更大的挑战。例如,虚拟机在硬件或软件层面进行实时迁移时,有可能会引发TSC同步问题:
Comparing with physical problems, the virtualization introduced more challenges regarding to TSC sync. For example, VM live migration may cause TSC sync problems if source and target hosts are different from hardware and software levels,
- 平台差异(Intel vs AMD,可靠 vs 不可靠)Platform type differences (Intel vs AMD, reliable vs unreliable)
- CPU频率(TSC上升速率)CPU frequency (TSC increase rate)
- CPU引导阶段差异(不同的TSC初值)CPU boot time (TSC initial values)
- 虚拟化软件的版本差异 Hypervisor version differences
因此在不同虚拟化平台上TSC行为的差异有可能引发TSC同步问题。
So the behaviors of TSC sync on different hypervisors could cause the TSC sync problems.
3.3.1 基础的应对方法 Basic approaches
针对虚拟化软件的差别,可以采用下面的方法解决rdtsc和TSC同步的问题
Per hypervisors differences, the rdtsc instruction and TSC sync could be addressed with following approaches,
- 使用原生指令或直通方法 - 快速但有潜在错误 Native or pass-through - fast but potentially incorrect
虚拟化软件不对rdtsc进行模拟。rdtsc指令直接在物理CPU上执行。这种模式有着最佳的新能表现但有可能引发虚拟机中TSC敏感软件的同步问题。
No emulation by hypervisor. The instruction is directly executed on physical CPUs. This mode has faster performance but may cause the TSC sync problems to TSC sensitive applications in guest OS.
特别是在VM有可能会实时迁移到另一个物理机上的情况。现在并没有可行或合理的方法在不同机器上同步TSC值。
Especially, VM could be live migrated to different machine. It is not possible and reasonable to ensure TSC value got synced among different machines. - 模拟或使用陷入 - 正确但是慢 Emulated or Trap - correct but slow
- 完全虚拟化 Full virtualization
虚拟化软件会模拟TSC,相应的,rdtsc也不会在物理CPU上执行。这种模式会影响rdtsc的性能,但是会提供更好的TSC稳定性。Intel和AMD都支持为rdtsc模拟进行硬件加速的VMX和SVM
Hypervisor will emulate TSC, then rdtsc is not directly executed on physical CPUs. This mode causes performance degrade for rdtsc instruction, but give the reliability for TSC sensitive application. Intel and AMD CPUs support VMX and SVM which allows the hardware accelerations of rdtsc emulation.
- 半虚拟化 Para virtualization
为了优化rdtsc的性能,一些虚拟化软件提供了PVRDTSCP指令,使用半虚拟化改善性能。不过如果软件在虚拟机中直接使用rdtsc,这种方案就无法奏效了。
In order to optimize the rdtsc performance, some hypervisor provided PVRDTSCP which allows software in VM could be paravirtualized (modified) for better performance. If user applications in VM directly issue the rdtsc instruction, the para virtualization solution can not work.
- 混合 - 正确但有潜在性能问题 Hybrid - correct but potentially slow
混合算法会根据具体情况确保正确性
A hybrid algorithm to ensure correctness per following factors,- 对正确性的需要 The requirement of correctness
- 硬件提供的TSC能力 Hardware TSC capabilities
- 特殊的VM使用场景:保存/重置/迁移 Some special VM use case scenarios: VM is saved/restored/migrated
当使用物理机原生执行能够保证好的性能和正确性时,混合算法会采用原生执行。如果不能达到要求,混合算法就会采用全虚拟化或半虚拟化确保正确性
When native run could get both good performance and correctness, it will be run natively without emulation. If hypervisor could not use native way, it will use full or para virtualization technology to make sure the correctness.
3.3.2 一些虚拟化软件的具体实现 Implementations on various hypervisors
下面是一些虚拟化软件对TSC的具体支持信息
Below is detailed information about the TSC support on various hypervisors,
VMware
ESX 4.x 和 3.x 并没有对虚拟CPU之间的TSC进行同步,但从 ESX 5.X开始,虚拟化软件会始终保持虚拟CPU之间的TSC同步。VMWare使用混合算法确保各种情况下的TSC同步性,包括当底层硬件不支持TSC同步时。当硬件有良好的TSC同步支持时,rdtsc指令模拟有良好的性能表现。如果不支持,这种模拟就会有较差的性能表现。
ESX 4.x and 3.x does not make TSC sync between vCPUs. But since ESX 5.x, the hypervisor always maintain the TSC got synced between vCPUs. VMware uses the hybrid algorithm to make sure TSC got synced even if underlaying hardware does not support TSC sync. For hardware with good TSC sync support, the rdtsc emulation could get good performance. But when hardware could not give TSC sync support, TSC emulation would be slower.
不过,VMWare的TSC模拟不能确保消除CPU间TSC的偏斜倾向。因此,Linux引导阶段的TSC同步性测试有可能会失败。
However, VMware TSC emulation could not ensure there is no marginal TSC skew happen between CPUs. For this reason, Linux boot TSC sync check may fail.
正因如此,VMWare为Linux增加了一个标志位 TSC_RELIABLE 来跳过TSC同步性测试,Linux中的相关代码也针对这一问题提供了详细的注释
For this reason, in Linux guest, VMware creates a new synthetic TSC_RELIABLE feature bit to bypass Linux TSC sync testing, Linux [VMware cpu detect code] gives the good comments about TSC sync testing issues,
这个补丁合并进了 Linux 2.6.29
The patch got merged in Linux 2.6.29.
VMWare还提供了Timekeeping in VMware Virtual Machines用于讨论TSC模拟的相关问题。参考这个文档可以获得更详细的内容。
VMware also provides Timekeeping in VMware Virtual Machines to discuss TSC emulation issues. Please refer to this document for detailed information./** VMware hypervisor takes care of exporting a reliable TSC to the guest.* Still, due to timing difference when running on virtual cpus, the TSC can* be marked as unstable in some cases. For example, the TSC sync check at* bootup can fail due to a marginal offset between vcpus TSCs (though the* TSCs do not drift from each other). Also, the ACPI PM timer clocksource* is not suitable as a watchdog when running on a hypervisor because the* kernel may miss a wrap of the counter if the vcpu is descheduled for a* long time. To skip these checks at runtime we set these capability bits,* so that the kernel could just trust the hypervisor with providing a* reliable virtual TSC that is suitable for timekeeping.*/static void vmware_set_cpu_features(struct cpuinfo_x86 *c){set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);set_cpu_cap(c, X86_FEATURE_TSC_RELIABLE);}
Hyper-V
Hyper-V没有提供TSC模拟,因而Hyper-V中的TSC是不可靠的。但问题是,Hyper-V中的Linux CPU驱动并不会报告这一问题。这意味着如果偶然通过了内核的TSC同步性测试,Linux仍然会使用TSC作为时钟源。在20天前,4.3-rc1内核补丁才禁用了在Hyper-V中使用TSC作为时钟源。
Hyper-V does not provide TSC emulation. For this reason, TSC on hyper-V is not reliable. But the problem is, hyper-V Linux CPU driver never reported the problem, that means the TSC clock source is still could be used if it happed to pass Linux kernel TSC sync test. Just 20 Days ago, a Linux kernel 4.3-rc1 patch had disabled the TSC clock source on Hyper-V Linux guest.- KVM
在最新的Linux虚拟机OS中,KVM默认采用kvmclock驱动,其特性包括
On the latest Linux guest OS, KVM uses kvmclock driver by default.- 在可能的情况下尝试使用完美同步 Try for perfect synchronization where possible
- 使用TSC稳定技术 Use TSC stabilization techniques
- 没有频率补偿 No frequency compensation
- 没有TSC陷入,用户态的TSC是不完美的 No TSC trapping, user space rdtsc is imperfect
- 将VDSO/vsyscall下的
gettimeofday()使用的pvclock映射为kvmclock Map pvclock and run kvmclock from VDSO/vsyscalls for gettimeofday()
kvmclock的不足在于引入kvmclock并没阻止程序使用rdtsc指令,同时也不能修复程序直接在用户态使用rdtsc可能带来的问题。事实上,解决这个问题的唯一办法就是TSC模拟/陷入。
The drawbacks of kvmclock is that user space TSC read will still have the problem. Because user space rdtsc could not be fixed by kvmclock and the only way to fix it is TSC emulation/trap.
对于带有kvmclock驱动的早期系统,KVM虚拟机看起来会直接原生执行rdtsc
For the legacy OS with kvmclock driver, KVM guest seemed to run rdtsc natively without emulation.
然而,我并没看到KVM支持TSC陷入的代码。我仅仅找到了一个提供TSC陷入支持的早期补丁,而且这个补丁看起来没有合并到Linux的主干上。
However, I did not see KVM code has supported TSC trap here. The I just found an old kernel patch to support TSC trap. But I have not found that it got merged into Linux mainline.
基于上述原因,我认为对rdtsc非常敏感的程序在KVM中运行存在问题。KVM的文档包含了时间保持的相关内容,但并没有更多的实现细节。
For above reasons, I think a rdtsc sensitive application running over KVM Linux guest would be problematic. KVM actually has a kernel documentation about the timekeeping, but the document does not have enough information about KVM implementation.
- Xen
在4.0之前,Xen仅支持原生模式。Xen 4.0 提供了tsc_mode参数和4种模式,用于让管理员根据需要指定使用哪种模式。默认情况下Xen 4.0 使用混合模式。Xen的文档有关于TSC模拟的详细讨论。
Prior to Xen 4.0, it only support native mode. Xen 4.0 provides tsc_mode parameter, which allows administrators switch between 4 modes per their requirements. By default Xen 4.0 use the hybrid mode. This Xen document gives very detailed discussion about TSC emulation.
3.3.3 总结 Summary
VMWare和Xen看起来为TSC同步提供了最佳的解决方案。KVM的PV模拟方案没有解决在用户态直接使用rdtsc的问题。Hyper-V则没有提供TSC模拟的解决方案。所有的这些TSC同步方案都只能作为在Linux中继续使用TSC时钟源的权宜方案。在一些虚拟化软件提供的TSC同步中仍能够观察到细微的TSC偏斜,因而应用程序还是有可能在使用TSC测量时间时出现误差。
VMware and Xen seems provide best solution for TSC sync. The KVM PV emulation never addresses user space rdtsc use case problems. And hyper-V has no TSC sync solution. All these TSC sync solutions just provide the way that let Linux kernel TSC clocksource continuously work. The tiny TSC skew may still be observed in VM although TSC sync is supported by some hypervisors. Thus application may still have a wrong TSC duration for time measurement.
4. 结论 Conclusion
Linux内核会检测TSC同步问题,试图做到“TSC中性”(TSC-resilient)。但主要的问题在于用户直接使用rdtsc指令。目前并没有在用户态进行TSC同步的可靠机制,特别是在虚拟化环境中。下面是具体的实践建议:
Linux kernel could detect TSC sync problem and try to be “TSC-resilient”. The major problem is in user application. There is no reliable TSC sync mechanism for user application especially under a Virtualization environment. Here are the suggestions,
- 只要可能,就不要在用户态使用rdtsc指令 If possible, avoid to use rdtsc in user applications.
并不是所有的硬件的TSC都是可靠的,这意味着TSC值可能不正确
Not all of hardware, hypervisors are TSC safe, which means TSC may behave incorrectly.
使用TSC可能会带来不同x86平台或虚拟化平台上的迁移问题。借助于系统调用或虚拟系统调用能够改善软件的迁移性,特别是对于虚拟化平台来说
TSC usage will cause software porting bugs cross various x86 platforms or different hypervisors. Leverage syscall or vsyscall will make software portable, especially for Virtualization environment. - 如果你不得不使用,请让你的软件做到“TSC中性” If you have to use it, please make your application “TSC-resilient”.
在调试而不是生产或产品中使用rdtsc
Use it for debugging, but never use rdtsc in functional area.
正如前文所述,Linux内核直到现在都在处理TSC带来的棘手问题。如果可能的化,首先学习Linux的相关设计。性能评估和缺陷检测功能可能是使用rdtsc的唯一场景。即便如此,也要做好处理边界情况和移植问题的准备。
As we mentioned above, Linux kernel also had hard time to handle it until today. If possible, learn from Linux code first. Perf measurement and debug facility might be only usable cases, but be prepare for handling various conner cases and software porting problems.
一定要理解硬件、OS内核、虚拟化软件所带来的风险。编写“TSC中性”的代码,这能够让你的代码在TSC出错的情况下正常地工作
Understand the risks from hardware, OS kernel, hypervisors. Write a “TSC-resilient” application, which make sure your application still can behave correctly when TSC value is wrong.
