5.3. 剖析

以下各节的脚本展示了如何通过监控函数调用来剖析(profile)内核活动。

统计函数调用次数

本节展示如何统计30秒内某个内核函数调用次数。通过使用通配符,你可以用这个脚本同时统计多个内核函数。

functioncallcount.stp

  1. #! /usr/bin/env stap
  2. # The following line command will probe all the functions
  3. # in kernel's memory management code:
  4. #
  5. # stap functioncallcount.stp "*@mm/*.c"
  6. probe kernel.function(@1).call { # probe functions listed on commandline
  7. called[ppfunc()] <<< 1 # add a count efficiently
  8. }
  9. global called
  10. probe end {
  11. foreach (fn in called-) # Sort by call count (in decreasing order)
  12. # (fn+ in called) # Sort by function name
  13. printf("%s %d\n", fn, @count(called[fn]))
  14. exit()
  15. }

functioncallcount.stp接受内核函数名作为参数。你可以使用通配符,这样就能同时监控多个内核函数。 它的输出包括调用者的名字和取样时间内调用次数。下面是stap functioncallcount.stp "*@mm/*.c"的输出片段:

  1. [...]
  2. __vma_link 97
  3. __vma_link_file 66
  4. __vma_link_list 97
  5. __vma_link_rb 97
  6. __xchg 103
  7. add_page_to_active_list 102
  8. add_page_to_inactive_list 19
  9. add_to_page_cache 19
  10. add_to_page_cache_lru 7
  11. all_vm_events 6
  12. alloc_pages_node 4630
  13. alloc_slabmgmt 67
  14. anon_vma_alloc 62
  15. anon_vma_free 62
  16. anon_vma_lock 66
  17. anon_vma_prepare 98
  18. anon_vma_unlink 97
  19. anon_vma_unlock 66
  20. arch_get_unmapped_area_topdown 94
  21. arch_get_unmapped_exec_area 3
  22. arch_unmap_area_topdown 97
  23. atomic_add 2
  24. atomic_add_negative 97
  25. atomic_dec_and_test 5153
  26. atomic_inc 470
  27. atomic_inc_and_test 1
  28. [...]

追踪函数调用链

本节展示如何追踪函数调用链。

para-callgraph.stp

  1. #! /usr/bin/env stap
  2. function trace(entry_p, extra) {
  3. %( $# > 1 %? if (tid() in trace) %)
  4. printf("%s%s%s %s\n",
  5. thread_indent (entry_p),
  6. (entry_p>0?"->":"<-"),
  7. ppfunc (),
  8. extra)
  9. }
  10. %( $# > 1 %?
  11. global trace
  12. probe $2.call {
  13. trace[tid()] = 1
  14. }
  15. probe $2.return {
  16. delete trace[tid()]
  17. }
  18. %)
  19. probe $1.call { trace(1, $$parms) }
  20. probe $1.return { trace(-1, $$return) }

para-callgraph.stp接受两个命令行参数:

  1. 想要跟踪的函数($1
  2. 可选的触发函数。该函数可以在线程范围内启动/停止追踪。只要触发函数不退出,追踪就不会结束。 para-callgraph.stp使用了thread_indent();此外它的输出包括了时间戳、进程名,和$1所在的线程ID。关于thread_indent()的更多信息,请参考。。。。 (译注:这个脚本的编码风格小朋友们可不要学。前两个探针里的trace是数组,后两个探针里的trace是函数。另外$#表示参数的个数,写过shell的都明白。%( $# > 1 %? if (tid() in trace) %)是一个预处理三元表达式,见langref中的“5.8.1 Conditions”)

下面是stap para-callgraph.stp 'kernel.function("*@fs/*.c")' 'kernel.function("sys_read")'的输出片段:

  1. [...]
  2. 267 gnome-terminal(2921): <-do_sync_read return=0xfffffffffffffff5
  3. 269 gnome-terminal(2921):<-vfs_read return=0xfffffffffffffff5
  4. 0 gnome-terminal(2921):->fput file=0xffff880111eebbc0
  5. 2 gnome-terminal(2921):<-fput
  6. 0 gnome-terminal(2921):->fget_light fd=0x3 fput_needed=0xffff88010544df54
  7. 3 gnome-terminal(2921):<-fget_light return=0xffff8801116ce980
  8. 0 gnome-terminal(2921):->vfs_read file=0xffff8801116ce980 buf=0xc86504 count=0x1000 pos=0xffff88010544df48
  9. 4 gnome-terminal(2921): ->rw_verify_area read_write=0x0 file=0xffff8801116ce980 ppos=0xffff88010544df48 count=0x1000
  10. 7 gnome-terminal(2921): <-rw_verify_area return=0x1000
  11. 12 gnome-terminal(2921): ->do_sync_read filp=0xffff8801116ce980 buf=0xc86504 len=0x1000 ppos=0xffff88010544df48
  12. 15 gnome-terminal(2921): <-do_sync_read return=0xfffffffffffffff5
  13. 18 gnome-terminal(2921):<-vfs_read return=0xfffffffffffffff5
  14. 0 gnome-terminal(2921):->fput file=0xffff8801116ce980

统计给定线程在内核空间和用户空间上的耗时

本节展示如何统计给定线程花费在内核空间或用户空间上的运行时间。

thread-times.stp

  1. #! /usr/bin/env stap
  2. probe perf.sw.cpu_clock!, timer.profile {
  3. // NB: To avoid contention on SMP machines, no global scalars/arrays used,
  4. // only contention-free statistics aggregates.
  5. tid=tid(); e=execname()
  6. if (!user_mode())
  7. kticks[e,tid] <<< 1
  8. else
  9. uticks[e,tid] <<< 1
  10. ticks <<< 1
  11. tids[e,tid] <<< 1
  12. }
  13. global uticks%, kticks%, ticks
  14. global tids%
  15. probe timer.s(5), end {
  16. allticks = @count(ticks)
  17. printf ("%16s %5s %7s %7s (of %d ticks)\n",
  18. "comm", "tid", "%user", "%kernel", allticks)
  19. foreach ([e,tid] in tids- limit 20) {
  20. uscaled = @count(uticks[e,tid])*10000/allticks
  21. kscaled = @count(kticks[e,tid])*10000/allticks
  22. printf ("%16s %5d %3d.%02d%% %3d.%02d%%\n",
  23. e, tid, uscaled/100, uscaled%100, kscaled/100, kscaled%100)
  24. }
  25. printf("\n")
  26. delete uticks
  27. delete kticks
  28. delete ticks
  29. delete tids
  30. }

thread-time.stp列出5秒内花费CPU时间最多的20个进程,和这段时间CPU滴答(ticks)的总数。脚本的输出还包括每个进程CPU占用百分比,分别按内核空间和用户空间列出。 下面就是它的输出:

  1. tid %user %kernel (of 20002 ticks)
  2. 0 0.00% 87.88%
  3. 32169 5.24% 0.03%
  4. 9815 3.33% 0.36%
  5. 9859 0.95% 0.00%
  6. 3611 0.56% 0.12%
  7. 9861 0.62% 0.01%
  8. 11106 0.37% 0.02%
  9. 32167 0.08% 0.08%
  10. 3897 0.01% 0.08%
  11. 3800 0.03% 0.00%
  12. 2886 0.02% 0.00%
  13. 3243 0.00% 0.01%
  14. 3862 0.01% 0.00%
  15. 3782 0.00% 0.00%
  16. 21767 0.00% 0.00%
  17. 2522 0.00% 0.00%
  18. 3883 0.00% 0.00%
  19. 3775 0.00% 0.00%
  20. 3943 0.00% 0.00%
  21. 3873 0.00% 0.00%

监控应用轮询情况

本节展示如何监控应用的轮询(polling)情况。这将允许你跟踪多余的或过度的轮询,帮助锁定CPU使用或能源消耗需要改善的地方。

timeout.stp

  1. #! /usr/bin/env stap
  2. # Copyright (C) 2009 Red Hat, Inc.
  3. # Written by Ulrich Drepper <drepper@redhat.com>
  4. # Modified by William Cohen <wcohen@redhat.com>
  5. global process, timeout_count, to
  6. global poll_timeout, epoll_timeout, select_timeout, itimer_timeout
  7. global nanosleep_timeout, futex_timeout, signal_timeout
  8. probe syscall.poll, syscall.epoll_wait {
  9. if (timeout) to[pid()]=timeout
  10. }
  11. probe syscall.poll.return {
  12. if ($return == 0 && to[pid()] > 0 ) {
  13. poll_timeout[pid()]++
  14. timeout_count[pid()]++
  15. process[pid()] = execname()
  16. delete to[pid()]
  17. }
  18. }
  19. probe syscall.epoll_wait.return {
  20. if ($return == 0 && to[pid()] > 0 ) {
  21. epoll_timeout[pid()]++
  22. timeout_count[pid()]++
  23. process[p] = execname()
  24. delete to[pid()]
  25. }
  26. }
  27. probe syscall.select.return {
  28. if ($return == 0) {
  29. select_timeout[pid()]++
  30. timeout_count[pid()]++
  31. process[pid()] = execname()
  32. }
  33. }
  34. probe syscall.futex.return {
  35. if (errno_str($return) == "ETIMEDOUT") {
  36. futex_timeout[pid()]++
  37. timeout_count[pid()]++
  38. process[pid()] = execname()
  39. }
  40. }
  41. probe syscall.nanosleep.return {
  42. if ($return == 0) {
  43. nanosleep_timeout[pid()]++
  44. timeout_count[pid()]++
  45. process[pid()] = execname()
  46. }
  47. }
  48. probe kernel.function("it_real_fn") {
  49. itimer_timeout[pid()]++
  50. timeout_count[pid()]++
  51. process[pid()] = execname()
  52. }
  53. probe syscall.rt_sigtimedwait.return {
  54. if (errno_str($return) == "EAGAIN") {
  55. signal_timeout[pid()]++
  56. timeout_count[pid()]++
  57. process[pid()] = execname()
  58. }
  59. }
  60. probe syscall.exit {
  61. if (pid() in process) {
  62. delete process[pid()]
  63. delete timeout_count[pid()]
  64. delete poll_timeout[pid()]
  65. delete epoll_timeout[pid()]
  66. delete select_timeout[pid()]
  67. delete itimer_timeout[pid()]
  68. delete futex_timeout[pid()]
  69. delete nanosleep_timeout[pid()]
  70. delete signal_timeout[pid()]
  71. }
  72. }
  73. probe timer.s(1) {
  74. ansi_clear_screen()
  75. printf (" pid | poll select epoll itimer futex nanosle signal| process\n")
  76. foreach (p in timeout_count- limit 20) {
  77. printf ("%5d |%7d %7d %7d %7d %7d %7d %7d| %-.38s\n", p,
  78. poll_timeout[p], select_timeout[p],
  79. epoll_timeout[p], itimer_timeout[p],
  80. futex_timeout[p], nanosleep_timeout[p],
  81. signal_timeout[p], process[p])
  82. }
  83. }

timeout.stp跟踪下列系统调用,并仅当因为超时而退出该调用时记录次数:

  • poll
  • select
  • epoll
  • itimer
  • futex
  • nanosleep
  • signal
  1. uid | poll select epoll itimer futex nanosle signal| process
  2. 28937 | 148793 0 0 4727 37288 0 0| firefox
  3. 22945 | 0 56949 0 1 0 0 0| scim-bridge
  4. 0 | 0 0 0 36414 0 0 0| swapper
  5. 4275 | 23140 0 0 1 0 0 0| mixer_applet2
  6. 4191 | 0 14405 0 0 0 0 0| scim-launcher
  7. 22941 | 7908 1 0 62 0 0 0| gnome-terminal
  8. 4261 | 0 0 0 2 0 7622 0| escd
  9. 3695 | 0 0 0 0 0 7622 0| gdm-binary
  10. 3483 | 0 7206 0 0 0 0 0| dhcdbd
  11. 4189 | 6916 0 0 2 0 0 0| scim-panel-gtk
  12. 1863 | 5767 0 0 0 0 0 0| iscsid
  13. 2562 | 0 2881 0 1 0 1438 0| pcscd
  14. 4257 | 4255 0 0 1 0 0 0| gnome-power-man
  15. 4278 | 3876 0 0 60 0 0 0| multiload-apple
  16. 4083 | 0 1331 0 1728 0 0 0| Xorg
  17. 3921 | 1603 0 0 0 0 0 0| gam_server
  18. 4248 | 1591 0 0 0 0 0 0| nm-applet
  19. 3165 | 0 1441 0 0 0 0 0| xterm
  20. 29548 | 0 1440 0 0 0 0 0| httpd
  21. 1862 | 0 0 0 0 0 1438 0| iscsid

你可以通过修改最后一个探针(timer.s(1))来增大取样时间。timeout.stp的输出包括前20个轮询应用的名字和UID,连带每个应用调用每种轮询系统调用的累计次数。在上面的输出片段中,由于某个插件模块,firefox进行了过度的轮询。

监控最常调用的系统调用

上一节的timeout.stp通过监控系统调用的某个子集,帮助你找到过度轮询的应用。同样,如果你怀疑某些系统调用被过度地调用了,通过类似的监控,你也能把它们找出来。下面就通过topsys.stp实现这一点:

topsys.stp

  1. #! /usr/bin/env stap
  2. #
  3. # This script continuously lists the top 20 systemcalls in the interval
  4. # 5 seconds
  5. #
  6. global syscalls_count
  7. probe syscall.* {
  8. syscalls_count[name] <<< 1
  9. }
  10. function print_systop () {
  11. printf ("%25s %10s\n", "SYSCALL", "COUNT")
  12. foreach (syscall in syscalls_count- limit 20) {
  13. printf("%25s %10d\n", syscall, @count(syscalls_count[syscall]))
  14. }
  15. delete syscalls_count
  16. }
  17. probe timer.s(5) {
  18. print_systop ()
  19. printf("--------------------------------------------------------------\n")
  20. }

topsys.stp每5秒列出调用最多的20个系统调用。它也列出了这段时间内每个系统调用被调用的次数。下面是它的一个输出:

  1. --------------------------------------------------------------
  2. SYSCALL COUNT
  3. gettimeofday 1857
  4. read 1821
  5. ioctl 1568
  6. poll 1033
  7. close 638
  8. open 503
  9. select 455
  10. write 391
  11. writev 335
  12. futex 303
  13. recvmsg 251
  14. socket 137
  15. clock_gettime 124
  16. rt_sigprocmask 121
  17. sendto 120
  18. setitimer 106
  19. stat 90
  20. time 81
  21. sigreturn 72
  22. fstat 66
  23. --------------------------------------------------------------

找出每个进程的系统调用量

本节展示如何找出调用系统调用最多的进程。在上一节,我们谈到了如何找出调用最多的系统调用。而在上上节,我们也谈到了如何找出轮询最多的进程。通过监控每个进程的调用系统调用的次数,可以在调查轮询进程和其他滥用资源者时提供更多的数据。

syscalls_by_proc.stp

  1. #! /usr/bin/env stap
  2. # Copyright (C) 2006 IBM Corp.
  3. #
  4. # This file is part of systemtap, and is free software. You can
  5. # redistribute it and/or modify it under the terms of the GNU General
  6. # Public License (GPL); either version 2, or (at your option) any
  7. # later version.
  8. #
  9. # Print the system call count by process name in descending order.
  10. #
  11. global syscalls
  12. probe begin {
  13. print ("Collecting data... Type Ctrl-C to exit and display results\n")
  14. }
  15. probe nd_syscall.* {
  16. syscalls[execname()]++
  17. }
  18. probe end {
  19. printf ("%-10s %-s\n", "#SysCalls", "Process Name")
  20. foreach (proc in syscalls-)
  21. printf("%-10d %-s\n", syscalls[proc], proc)
  22. }

syscalls_by_proc.stp列出调用系统调用最多的20个进程。它也列出了这段时间内每个进程调用系统调用的数量。下面是它的输出:

  1. Collecting data... Type Ctrl-C to exit and display results
  2. #SysCalls Process Name
  3. 1577 multiload-apple
  4. 692 synergyc
  5. 408 pcscd
  6. 376 mixer_applet2
  7. 299 gnome-terminal
  8. 293 Xorg
  9. 206 scim-panel-gtk
  10. 95 gnome-power-man
  11. 90 artsd
  12. 85 dhcdbd
  13. 84 scim-bridge
  14. 78 gnome-screensav
  15. 66 scim-launcher
  16. [...]

要想输出进程PID而非进程名,改用下面的脚本:

syscalls_by_pid.stp

  1. #! /usr/bin/env stap
  2. # Copyright (C) 2006 IBM Corp.
  3. #
  4. # This file is part of systemtap, and is free software. You can
  5. # redistribute it and/or modify it under the terms of the GNU General
  6. # Public License (GPL); either version 2, or (at your option) any
  7. # later version.
  8. #
  9. # Print the system call count by process ID in descending order.
  10. #
  11. global syscalls
  12. probe begin {
  13. print ("Collecting data... Type Ctrl-C to exit and display results\n")
  14. }
  15. probe nd_syscall.* {
  16. syscalls[pid()]++
  17. }
  18. probe end {
  19. printf ("%-10s %-s\n", "#SysCalls", "PID")
  20. foreach (pid in syscalls-)
  21. printf("%-10d %-d\n", syscalls[pid], pid)
  22. }

正如在输出中提到的,你需要手动Ctrl-C退出程序来显示结果。你可以简单地添加一个timer.s()探针,让脚本在给定时间后自动退出。举个例子,要想让脚本5秒后退出,往里面添加下面的探针:

  1. probe timer.s(5)
  2. {
  3. exit()
  4. }