5.2. 磁盘

以下各节的脚本展示了如何监控磁盘和I/O活动。

统计磁盘读写状况

本节展示了如何找出磁盘读写最频繁的进程。

disktop.stp

  1. #!/usr/bin/env stap
  2. #
  3. # Copyright (C) 2007 Oracle Corp.
  4. #
  5. # Get the status of reading/writing disk every 5 seconds,
  6. # output top ten entries
  7. #
  8. # This is free software,GNU General Public License (GPL);
  9. # either version 2, or (at your option) any later version.
  10. #
  11. # Usage:
  12. # ./disktop.stp
  13. #
  14. global io_stat,device
  15. global read_bytes,write_bytes
  16. probe vfs.read.return {
  17. if ($return>0) {
  18. if (devname!="N/A") {/*skip read from cache*/
  19. io_stat[pid(),execname(),uid(),ppid(),"R"] += $return
  20. device[pid(),execname(),uid(),ppid(),"R"] = devname
  21. read_bytes += $return
  22. }
  23. }
  24. }
  25. probe vfs.write.return {
  26. if ($return>0) {
  27. if (devname!="N/A") { /*skip update cache*/
  28. io_stat[pid(),execname(),uid(),ppid(),"W"] += $return
  29. device[pid(),execname(),uid(),ppid(),"W"] = devname
  30. write_bytes += $return
  31. }
  32. }
  33. }
  34. probe timer.ms(5000) {
  35. /* skip non-read/write disk */
  36. if (read_bytes+write_bytes) {
  37. printf("\n%-25s, %-8s%4dKb/sec, %-7s%6dKb, %-7s%6dKb\n\n",
  38. ctime(gettimeofday_s()),
  39. "Average:", ((read_bytes+write_bytes)/1024)/5,
  40. "Read:",read_bytes/1024,
  41. "Write:",write_bytes/1024)
  42. /* print header */
  43. printf("%8s %8s %8s %25s %8s %4s %12s\n",
  44. "UID","PID","PPID","CMD","DEVICE","T","BYTES")
  45. }
  46. /* print top ten I/O */
  47. foreach ([process,cmd,userid,parent,action] in io_stat- limit 10)
  48. printf("%8d %8d %8d %25s %8s %4s %12d\n",
  49. userid,process,parent,cmd,
  50. device[process,cmd,userid,parent,action],
  51. action,io_stat[process,cmd,userid,parent,action])
  52. /* clear data */
  53. delete io_stat
  54. delete device
  55. read_bytes = 0
  56. write_bytes = 0
  57. }
  58. probe end{
  59. delete io_stat
  60. delete device
  61. delete read_bytes
  62. delete write_bytes
  63. }

disktop.stp输出磁盘读写最频繁的十个进程,包含各个进程的以下数据:

  • UID - 进程所有者的UID
  • PID - 进程的PID
  • PPID - 进程的父进程的PID
  • CMD - 进程的名字
  • DEVICE - 读/写的设备名
  • T - 进程的操作;W是写,而R是读。
  • BYTES - 读/写的数据量

disktop.stp使用ctime()gettimeofday_s()输出当前时间。gettimeofday_s返回当前时间自epoch(1970年1月1日)以来的秒数,ctime把它转化成可读的时间戳。 在这个脚本中,$return是一个存储着虚拟文件系统读写的字节数的本地变量。$return只能在函数返回事件探针中使用(比如这里的vfs.read.returnvfs.write.return)。

以下是本节脚本的输出:

  1. [...]
  2. Mon Sep 29 03:38:28 2008 , Average: 19Kb/sec, Read: 7Kb, Write: 89Kb
  3. UID PID PPID CMD DEVICE T BYTES
  4. 0 26319 26294 firefox sda5 W 90229
  5. 0 2758 2757 pam_timestamp_c sda5 R 8064
  6. 0 2885 1 cupsd sda5 W 1678
  7. Mon Sep 29 03:38:38 2008 , Average: 1Kb/sec, Read: 7Kb, Write: 1Kb
  8. UID PID PPID CMD DEVICE T BYTES
  9. 0 2758 2757 pam_timestamp_c sda5 R 8064
  10. 0 2885 1 cupsd sda5 W 1678

追踪对任意文件的读写

本节展示如何监控各进程读/写任意文件所花费的时间。这可以帮助你发现系统中加载时间过长的文件。

iotime.stp

  1. #! /usr/bin/env stap
  2. /*
  3. * Copyright (C) 2006-2007 Red Hat Inc.
  4. *
  5. * This copyrighted material is made available to anyone wishing to use,
  6. * modify, copy, or redistribute it subject to the terms and conditions
  7. * of the GNU General Public License v.2.
  8. *
  9. * You should have received a copy of the GNU General Public License
  10. * along with this program. If not, see <http://www.gnu.org/licenses/>.
  11. *
  12. * Print out the amount of time spent in the read and write systemcall
  13. * when each file opened by the process is closed. Note that the systemtap
  14. * script needs to be running before the open operations occur for
  15. * the script to record data.
  16. *
  17. * This script could be used to to find out which files are slow to load
  18. * on a machine. e.g.
  19. *
  20. * stap iotime.stp -c 'firefox'
  21. *
  22. * Output format is:
  23. * timestamp pid (executabable) info_type path ...
  24. *
  25. * 200283135 2573 (cupsd) access /etc/printcap read: 0 write: 7063
  26. * 200283143 2573 (cupsd) iotime /etc/printcap time: 69
  27. *
  28. */
  29. global start
  30. global time_io
  31. function timestamp:long() { return gettimeofday_us() - start }
  32. function proc:string() { return sprintf("%d (%s)", pid(), execname()) }
  33. probe begin { start = gettimeofday_us() }
  34. global filehandles, fileread, filewrite
  35. probe syscall.open.return {
  36. filename = user_string($filename)
  37. if ($return != -1) {
  38. filehandles[pid(), $return] = filename
  39. } else {
  40. printf("%d %s access %s fail\n", timestamp(), proc(), filename)
  41. }
  42. }
  43. probe syscall.read.return {
  44. p = pid()
  45. fd = $fd
  46. bytes = $return
  47. time = gettimeofday_us() - @entry(gettimeofday_us())
  48. if (bytes > 0)
  49. fileread[p, fd] += bytes
  50. time_io[p, fd] <<< time
  51. }
  52. probe syscall.write.return {
  53. p = pid()
  54. fd = $fd
  55. bytes = $return
  56. time = gettimeofday_us() - @entry(gettimeofday_us())
  57. if (bytes > 0)
  58. filewrite[p, fd] += bytes
  59. time_io[p, fd] <<< time
  60. }
  61. probe syscall.close {
  62. if ([pid(), $fd] in filehandles) {
  63. printf("%d %s access %s read: %d write: %d\n",
  64. timestamp(), proc(), filehandles[pid(), $fd],
  65. fileread[pid(), $fd], filewrite[pid(), $fd])
  66. if (@count(time_io[pid(), $fd]))
  67. printf("%d %s iotime %s time: %d\n", timestamp(), proc(),
  68. filehandles[pid(), $fd], @sum(time_io[pid(), $fd]))
  69. }
  70. delete fileread[pid(), $fd]
  71. delete filewrite[pid(), $fd]
  72. delete filehandles[pid(), $fd]
  73. delete time_io[pid(),$fd]
  74. }

iotime.stp跟踪每次openclosereadwrite系统调用。对于访问到的每个文件,iotime.stp都会计算读写操作花费的时间和读写的数据量(以字节为单位)。 虽然我们可以在读写事件(syscall.readsyscall.write)中使用本地变量$count,但是$count存储的是系统调用想要读写的数据量,要获取实际读写到的数据量需要使用$return

  1. [...]
  2. 825946 3364 (NetworkManager) access /sys/class/net/eth0/carrier read: 8190 write: 0
  3. 825955 3364 (NetworkManager) iotime /sys/class/net/eth0/carrier time: 9
  4. [...]
  5. 117061 2460 (pcscd) access /dev/bus/usb/003/001 read: 43 write: 0
  6. 117065 2460 (pcscd) iotime /dev/bus/usb/003/001 time: 7
  7. [...]
  8. 3973737 2886 (sendmail) access /proc/loadavg read: 4096 write: 0
  9. 3973744 2886 (sendmail) iotime /proc/loadavg time: 11
  10. [...]

本节的脚本会输出以下数据:

  • 时间戳,精确到毫秒
  • PID和进程名
  • access或iotime
  • 访问的文件

如果一个进程读写了数据,你会看到accessiotime成对出现。access那一行的时间戳表示进程访问了文件;在结尾处会输出读写的数据(以字节为单位)。iotime那一行会输出读写消耗的时间(以毫秒为单位)。如果一行access后面没有iotime,意味着进程没有读写到数据。

追踪I/O的累计总量

本节展示如何累计I/O总量。

traceio.stp

  1. #! /usr/bin/env stap
  2. # traceio.stp
  3. # Copyright (C) 2007 Red Hat, Inc., Eugene Teo <eteo@redhat.com>
  4. # Copyright (C) 2009 Kai Meyer <kai@unixlords.com>
  5. # Fixed a bug that allows this to run longer
  6. # And added the humanreadable function
  7. #
  8. # This program is free software; you can redistribute it and/or modify
  9. # it under the terms of the GNU General Public License version 2 as
  10. # published by the Free Software Foundation.
  11. #
  12. global reads, writes, total_io
  13. probe vfs.read.return {
  14. if ($return > 0) {
  15. reads[pid(),execname()] += $return
  16. total_io[pid(),execname()] += $return
  17. }
  18. }
  19. probe vfs.write.return {
  20. if ($return > 0) {
  21. writes[pid(),execname()] += $return
  22. total_io[pid(),execname()] += $return
  23. }
  24. }
  25. function humanreadable(bytes) {
  26. if (bytes > 1024*1024*1024) {
  27. return sprintf("%d GiB", bytes/1024/1024/1024)
  28. } else if (bytes > 1024*1024) {
  29. return sprintf("%d MiB", bytes/1024/1024)
  30. } else if (bytes > 1024) {
  31. return sprintf("%d KiB", bytes/1024)
  32. } else {
  33. return sprintf("%d B", bytes)
  34. }
  35. }
  36. probe timer.s(1) {
  37. foreach([p,e] in total_io- limit 10)
  38. printf("%8d %15s r: %12s w: %12s\n",
  39. p, e, humanreadable(reads[p,e]),
  40. humanreadable(writes[p,e]))
  41. printf("\n")
  42. # Note we don't zero out reads, writes and total_io,
  43. # so the values are cumulative since the script started.
  44. }

traceio.stp逐秒输出累计I/O最频繁的前十个进程。此外,它还会累计每个进程的I/O情况。注意该脚本跟开头找出磁盘读写最频繁的进程的脚本一样,也通过本地变量$return获取实际的读写数据量

  1. [...]
  2. Xorg r: 583401 KiB w: 0 KiB
  3. floaters r: 96 KiB w: 7130 KiB
  4. multiload-apple r: 538 KiB w: 537 KiB
  5. sshd r: 71 KiB w: 72 KiB
  6. pam_timestamp_c r: 138 KiB w: 0 KiB
  7. staprun r: 51 KiB w: 51 KiB
  8. snmpd r: 46 KiB w: 0 KiB
  9. pcscd r: 28 KiB w: 0 KiB
  10. irqbalance r: 27 KiB w: 4 KiB
  11. cupsd r: 4 KiB w: 18 KiB
  12. Xorg r: 588140 KiB w: 0 KiB
  13. floaters r: 97 KiB w: 7143 KiB
  14. multiload-apple r: 543 KiB w: 542 KiB
  15. sshd r: 72 KiB w: 72 KiB
  16. pam_timestamp_c r: 138 KiB w: 0 KiB
  17. staprun r: 51 KiB w: 51 KiB
  18. snmpd r: 46 KiB w: 0 KiB
  19. pcscd r: 28 KiB w: 0 KiB
  20. irqbalance r: 27 KiB w: 4 KiB
  21. cupsd r: 4 KiB w: 18 KiB

监控指定设备的I/O

本节展示如何监控指定设备的I/O活动。

traceio2.stp

  1. #! /usr/bin/env stap
  2. global device_of_interest
  3. probe begin {
  4. /* The following is not the most efficient way to do this.
  5. One could directly put the result of usrdev2kerndev()
  6. into device_of_interest. However, want to test out
  7. the other device functions */
  8. dev = usrdev2kerndev($1)
  9. device_of_interest = MKDEV(MAJOR(dev), MINOR(dev))
  10. }
  11. probe vfs.write, vfs.read
  12. {
  13. if (dev == device_of_interest)
  14. printf ("%s(%d) %s 0x%x\n",
  15. execname(), pid(), ppfunc(), dev)
  16. }

traceio2.stp接受一个参数:设备号,要想获取名为directory的文件夹所在设备的设备号,使用stat -c "0x%D" directoryusrdev2kerndev()把设备号转化成内核理解的格式。usrdev2kerndev()的输出经过MAJOR()MINOR()处理,分别得到主设备号和次设备号,再经过MKDEV()处理,得到内核里对应的设备号。 traceio2.stp的输出包括了读/写进程的名字和PID,所调用的函数(vfs_readvfs_write)和内核里对应的设备号。

下面是stap traceio2.stp 0x805的输出,其中0x805/home的设备号。/home位于/dev/sda5,正是我们想要监控的设备。

  1. [...]
  2. synergyc(3722) vfs_read 0x800005
  3. synergyc(3722) vfs_read 0x800005
  4. cupsd(2889) vfs_write 0x800005
  5. cupsd(2889) vfs_write 0x800005
  6. cupsd(2889) vfs_write 0x800005
  7. [...]

监控对指定文件的读写

本节展示如何实时监控对指定文件的读写。

inodewatch.stp

  1. #! /usr/bin/env stap
  2. probe vfs.write, vfs.read
  3. {
  4. # dev and ino are defined by vfs.write and vfs.read
  5. if (dev == MKDEV($1,$2) # major/minor device
  6. && ino == $3)
  7. printf ("%s(%d) %s 0x%x/%u\n",
  8. execname(), pid(), ppfunc(), dev, ino)
  9. }

inodewatch.stp从命令行中依次获取如下参数:

  1. 文件的主设备号
  2. 文件的次设备号
  3. 文件的inode号

要获取上述信息,使用stat -c '%D %i' filename,注意filename取绝对路径。 比如:要监控/etc/crontab,先运行stat -c '%D %i' /etc/crontab。应该会有如下输出:

  1. 805 1078319

805是十六进制的设备号。最小的两位是次设备号,其余是主设备号。1078319是inode号。要监控/etc/crontab,运行stap inodewatch.stp 0x8 0x05 1078319.(加0x以表示这是十六进制的数)

该命令的输出包括进程名和进程PID,以及调用的函数(vfs_readvfs_write),设备号(以十六进制的格式输出)和inode号。下面就是stap inodewatch.stp 0x8 0x05 1078319的输出(当脚本运行时,/etc/crontab也正在执行中):

  1. cat(16437) vfs_read 0x800005/1078319
  2. cat(16437) vfs_read 0x800005/1078319

监控对指定文件的属性的修改

本节展示如何实时监控对指定文件的属性的修改。

inodewatch2.stp

  1. #! /usr/bin/env stap
  2. global ATTR_MODE = 1
  3. probe kernel.function("setattr_copy")!,
  4. kernel.function("generic_setattr")!,
  5. kernel.function("inode_setattr") {
  6. dev_nr = $inode->i_sb->s_dev
  7. inode_nr = $inode->i_ino
  8. if (dev_nr == MKDEV($1,$2) # major/minor device
  9. && inode_nr == $3
  10. && $attr->ia_valid & ATTR_MODE)
  11. printf ("%s(%d) %s 0x%x/%u %o %d\n",
  12. execname(), pid(), ppfunc(), dev_nr, inode_nr, $attr->ia_mode, uid())
  13. }

跟上一节的脚本类似,inodewatch2.stp也需要提供目标文件的设备号和inode号作为参数。用上一节的方法可以获取这些数据。 inodewatch.stp的输出也类似于上一节的脚本,不过inodewatch.stp还包括文件属性的变化,和对应用户的UID。下面就是监控/home/joe/bigfile时,用户job执行chmod 777 /home/joe/bigfilechmod 666 /home/joe/bigfile后的输出:

  1. chmod(17448) inode_setattr 0x800005/6011835 100777 500
  2. chmod(17449) inode_setattr 0x800005/6011835 100666 500

定期输出块I/O等待时间

本节展示如何跟踪每个块I/O的等待时间。这可以帮助你发现给定时间内块I/O操作是否排起了长队。

ioblktime.stp

  1. #! /usr/bin/env stap
  2. global req_time%[25000], etimes
  3. probe ioblock.request
  4. {
  5. req_time[$bio] = gettimeofday_us()
  6. }
  7. probe ioblock.end
  8. {
  9. t = gettimeofday_us()
  10. s = req_time[$bio]
  11. delete req_time[$bio]
  12. if (s) {
  13. etimes[devname, bio_rw_str(rw)] <<< t - s
  14. }
  15. }
  16. /* for time being delete things that get merged with others */
  17. probe kernel.trace("block_bio_frontmerge"),
  18. kernel.trace("block_bio_backmerge")
  19. {
  20. delete req_time[$bio]
  21. }
  22. probe timer.s(10), end {
  23. ansi_clear_screen()
  24. printf("%10s %3s %10s %10s %10s\n",
  25. "device", "rw", "total (us)", "count", "avg (us)")
  26. foreach ([dev,rw] in etimes - limit 20) {
  27. printf("%10s %3s %10d %10d %10d\n", dev, rw,
  28. @sum(etimes[dev,rw]), @count(etimes[dev,rw]), @avg(etimes[dev,rw]))
  29. }
  30. delete etimes
  31. }

ioblktime.stp计算每个设备上块I/O平均等待时间,每10秒更新一次。你可以修改probe timer.s(10), end {来更改刷新频率。 有时候,在设备上的块I/O操作实在太多,以致于超过了默认的MAXMAPENTRIES值。如果你在定义数组时没有指定大小,SystemTap会以MAXMAPENTRIES作为数组的最大长度。它的默认值是2048,不过你可以使用stap命令的选项-DMAXMAPENTRIES=10000来指定该变量的值。

  1. device rw total (us) count avg (us)
  2. sda W 9659 6 1609
  3. dm-0 W 20278 6 3379
  4. dm-0 R 20524 5 4104
  5. sda R 19277 5 3855

上面的输出展示了设备名,操作类型(rw),总等待时间(total(us)),操作数(count),和平均等待时间(avg(us))。这里面的时间都是以毫秒为单位。