https://stackoverflow.com/questions/24020420/find-out-the-cpu-time-and-memory-usage-of-a-slurm-job
https://slurm.schedmd.com/quickstart.html

MaxRSS 基于 task

MaxRss : Maximum resident set size of all tasks in job

job->step->task
https://slurm.schedmd.com/job_launch.html

sstat -j — MaxRSS—-running

sstat.html

sstat --allsteps --format=AveCPU,AvePages,MaxRSS,AveRSS,JobID -j <jobid>

https://stackoverflow.com/questions/60779173/what-does-maximum-resident-set-size-mean#:~:text=Resident%20set%20size%20%28RSS%29%20means%2C%20roughly%2C%20the%20total,file%20but%20not%20currently%20loaded%20into%20physical%20memory

Resident set size (RSS)means, roughly, the total amount of physical memory assigned to a process at a given point in time. It does not count pages that have been swapped out, or that are mapped from a file but not currently loaded into physical memory.
“Maximum RSS” means the maximum of the RSS since the process’s birth, i.e. the largest it has ever been. So this number tells you the largest amount of physical memory your process has ever been using at any one instant.

测试下,

  1. [root@head ~]# scontrol show job 1480
  2. JobId=1480 JobName=Image_Classification_0427_test1
  3. UserId=hpcadmin(1000) GroupId=hpcadmin(1000) MCS_label=N/A
  4. Priority=4294901712 Nice=0 Account=(null) QOS=normal
  5. JobState=RUNNING Reason=None Dependency=(null)
  6. Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
  7. RunTime=16:16:54 TimeLimit=1-00:00:00 TimeMin=N/A
  8. SubmitTime=2021-04-27T17:31:41 EligibleTime=2021-04-27T17:31:41
  9. AccrueTime=2021-04-27T17:31:41
  10. StartTime=2021-04-27T17:31:41 EndTime=2021-04-28T17:31:41 Deadline=N/A
  11. SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-04-27T17:31:41
  12. Partition=lico-queue AllocNode:Sid=head:2101957
  13. ReqNodeList=(null) ExcNodeList=(null)
  14. NodeList=lico-C1,licosecond
  15. BatchHost=lico-C1
  16. NumNodes=2 NumCPUs=4 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
  17. TRES=cpu=4,node=2,billing=4
  18. Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
  19. MinCPUsNode=2 MinMemoryNode=0 MinTmpDiskNode=0
  20. Features=(null) DelayBoot=00:00:00
  21. OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
  22. Command=/home/hpcadmin/daiyi_test_20210126/train_dir/Image_Classification_0427_test1_107_202104270931.slurm
  23. WorkDir=/home/hpcadmin/daiyi_test_20210126/train_dir
  24. Comment=LICO-107
  25. StdErr=/home/hpcadmin/daiyi_test_20210126/train_dir/slurm-1480.out
  26. StdIn=/dev/null
  27. StdOut=/home/hpcadmin/daiyi_test_20210126/train_dir/slurm-1480.out
  28. Power=
  29. MailUser=(null) MailType=NONE
  30. [root@head ~]# sstat -j 1480 --allsteps --fields=MaxRSS,MaxRSSNode,MaxRSSTask,AveRSS,MaxPages,MaxPagesNode,TresUsageInAve,TresUsageInMax
  31. MaxRSS MaxRSSNode MaxRSSTask AveRSS MaxPages MaxPagesNode TRESUsageInAve TRESUsageInMax
  32. ---------- ---------- ---------- ---------- -------- ------------ -------------- --------------
  33. 60792K lico-C1 0 60792K 0 lico-C1 cpu=00:00:16,+ cpu=00:00:16,+
  34. 1315612K lico-C1 0 1308824K 188 lico-C1 cpu=00:02:00,+ cpu=00:02:00,+
  35. 1501632K licosecond 0 1461676K 185 licosecond cpu=00:01:53,+ cpu=00:01:53,+
  36. 7084064K lico-C1 0 6883420K 194 lico-C1 cpu=1-09:20:2+ cpu=1-09:20:2+
  37. 6304844K licosecond 0 4999336K 195 licosecond cpu=1-07:46:3+ cpu=1-07:46:3+
  38. [root@head ~]#
  39. [root@lico-C1 ~]# jobid=1480;a='';for i in `scontrol listpids $jobid|grep -v 'PID'|awk '{print $1}'`;do a=$a','$i;echo $a;done;a=${a:1};top -p $a -n 1
  40. top - 10:48:53 up 12 days, 1:02, 2 users, load average: 4.89, 4.99, 4.25
  41. Tasks: 14 total, 0 running, 14 sleeping, 0 stopped, 0 zombie
  42. %Cpu(s): 5.2 us, 1.1 sy, 0.0 ni, 93.6 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
  43. MiB Mem : 386393.2 total, 311122.4 free, 16072.3 used, 59198.6 buff/cache
  44. MiB Swap: 4096.0 total, 4096.0 free, 0.0 used. 367382.9 avail Mem
  45. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
  46. 1389988 hpcadmin 20 0 21.8g 6.5g 169664 S 200.0 1.7 2000:52 letrain.pyc
  47. 1389909 hpcadmin 20 0 12836 3280 2860 S 0.0 0.0 0:00.00 slurm_script
  48. 1389915 hpcadmin 20 0 387580 21840 8756 S 0.0 0.0 0:16.24 lico-dl-run
  49. 1389920 hpcadmin 20 0 255824 8004 4768 S 0.0 0.0 0:00.01 srun
  50. 1389922 hpcadmin 20 0 255824 7920 4728 S 0.0 0.0 0:00.01 srun
  51. 1389923 hpcadmin 20 0 255824 7836 4596 S 0.0 0.0 0:00.04 srun
  52. 1389925 hpcadmin 20 0 51068 996 0 S 0.0 0.0 0:00.00 srun
  53. 1389928 hpcadmin 20 0 255824 7916 4728 S 0.0 0.0 0:00.09 srun
  54. 1389932 hpcadmin 20 0 51068 1000 0 S 0.0 0.0 0:00.00 srun
  55. 1389942 hpcadmin 20 0 51068 1000 0 S 0.0 0.0 0:00.00 srun
  56. 1389943 hpcadmin 20 0 51068 1000 0 S 0.0 0.0 0:00.00 srun
  57. 1389960 hpcadmin 20 0 653488 18536 11596 S 0.0 0.0 0:00.08 starter-suid
  58. 1389987 hpcadmin 20 0 13.9g 1.2g 148664 S 0.0 0.3 2:00.19 letrain.pyc
  59. 1389961 hpcadmin 20 0 651824 20068 11120 S 0.0 0.0 0:00.07 starter-suid
  60. [root@licosecond ~]# jobid=1480;a='';for i in `scontrol listpids $jobid|grep -v 'PID'|awk '{print $1}'`;do a=$a','$i;echo $a;done;a=${a:1};top -p $a -n 1
  61. top - 10:49:04 up 9 days, 11:13, 1 user, load average: 5.72, 4.85, 3.46
  62. Tasks: 4 total, 0 running, 4 sleeping, 0 stopped, 0 zombie
  63. %Cpu(s): 2.4 us, 0.5 sy, 0.0 ni, 97.1 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
  64. MiB Mem : 386390.0 total, 357403.2 free, 11141.1 used, 17845.7 buff/cache
  65. MiB Swap: 10239.7 total, 10239.7 free, 0.0 used. 372638.7 avail Mem
  66. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
  67. 972275 hpcadmin 20 0 23.0g 4.4g 169260 S 200.0 1.2 1907:15 letrain.pyc
  68. 972248 hpcadmin 20 0 653488 20176 11192 S 0.0 0.0 0:00.07 starter-suid
  69. 972250 hpcadmin 20 0 651824 18252 11336 S 0.0 0.0 0:00.07 starter-suid
  70. 972282 hpcadmin 20 0 17.8g 1.4g 147964 S 0.0 0.4 1:53.73 letrain.pyc
  1. >>> (60792 + 1315612 + 1501632 + 7084064 +6304844) // 1024
  2. 15885
  3. >>> (386393.2 * 2.0 //100) + (386390 * 1.6 // 100)
  4. 13909.0
  5. >>>
  1. [root@head slurm]# cat to_log.sh
  2. while true
  3. do
  4. job_id=17
  5. echo "##### "`date`" #####">>bb.log
  6. echo `sstat -a --format=JobID,MaxRss,AveRss -j $job_id`>>bb.log;
  7. echo "###########">>bb.log
  8. echo `sacct --format=JobID,MaxRss,AveRss -j $job_id`>>bb.log
  9. echo " ">>bb.log
  10. echo " ">>bb.log
  11. echo " ">>bb.log
  12. echo " ">>bb.log
  13. sleep 10
  14. done

sstat 从 job 开始运行 就有数据, 到 job 完成为止。
sacct 从 job 完成 才开始有 数据, 二者刚好能接上。

  1. ##### Wed Apr 28 18:37:04 CST 2021 #####
  2. JobID MaxRSS AveRSS ------------ ---------- ---------- 17.batch 33380K 33364K 17.0 934964K 869876K
  3. ###########
  4. JobID MaxRSS AveRSS ------------ ---------- ---------- 17 17.batch 17.0
  5. ##### Wed Apr 28 18:37:14 CST 2021 #####
  6. JobID MaxRSS AveRSS ------------ ---------- ---------- 17.batch 33380K 33364K 17.0 934964K 909380K
  7. ###########
  8. JobID MaxRSS AveRSS ------------ ---------- ---------- 17 17.batch 17.0
  9. ##### Wed Apr 28 18:37:24 CST 2021 #####
  10. JobID MaxRSS AveRSS ------------ ---------- ---------- 17.batch 33380K 33364K 17.0 934964K 841056K
  11. ###########
  12. JobID MaxRSS AveRSS ------------ ---------- ---------- 17 17.batch 17.0
  13. ##### Wed Apr 28 18:37:34 CST 2021 #####
  14. JobID MaxRSS AveRSS ------------ ---------- ----------
  15. ###########
  16. JobID MaxRSS AveRSS ------------ ---------- ---------- 17 17.batch 33380K 33380K 17.0 934964K 934964K
  17. ##### Wed Apr 28 18:37:34 CST 2021 #####
  18. JobID MaxRSS AveRSS ------------ ---------- ----------
  19. ###########
  20. JobID MaxRSS AveRSS ------------ ---------- ---------- 17 17.batch 33380K 33380K 17.0 934964K 934964K
  21. ##### Wed Apr 28 18:37:44 CST 2021 #####
  22. JobID MaxRSS AveRSS ------------ ---------- ----------
  23. ###########
  24. JobID MaxRSS AveRSS ------------ ---------- ---------- 17 17.batch 33380K 33380K 17.0 934964K 934964K
  25. ##### Wed Apr 28 18:37:55 CST 2021 #####
  26. JobID MaxRSS AveRSS ------------ ---------- ----------
  27. ###########
  28. JobID MaxRSS AveRSS ------------ ---------- ---------- 17 17.batch 33380K 33380K 17.0 934964K 934964K
  29. ##### Wed Apr 28 18:38:05 CST 2021 #####
  30. JobID MaxRSS AveRSS ------------ ---------- ----------
  31. ###########
  32. JobID MaxRSS AveRSS ------------ ---------- ---------- 17 17.batch 33380K 33380K 17.0 934964K 934964K

seff ———all time

https://stackoverflow.com/questions/24020420/find-out-the-cpu-time-and-memory-usage-of-a-slurm-job

https://slurm.schedmd.com/sstat.html#lbAE

[root@head slurm]# seff 7

/usr/bin/perl: symbol lookup error: /usr/lib64/slurm/accounting_storage_slurmdbd.so: undefined symbol: running_in_slurmctld

scontrol show node

scontrol show job —detail

  1. [root@head-C1 ~]# scontrol show job 1373
  2. JobId=1373 JobName=hx_tc_p_03_same_01
  3. UserId=admin(1000) GroupId=admin(1000) MCS_label=N/A
  4. Priority=4294901739 Nice=0 Account=(null) QOS=normal
  5. JobState=RUNNING Reason=None Dependency=(null)
  6. Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
  7. RunTime=00:15:08 TimeLimit=7-00:00:00 TimeMin=N/A
  8. SubmitTime=2021-04-21T15:19:46 EligibleTime=2021-04-21T15:19:46
  9. AccrueTime=2021-04-21T15:19:46
  10. StartTime=2021-04-21T18:10:24 EndTime=2021-04-28T18:10:24 Deadline=N/A
  11. SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-04-21T18:10:24
  12. Partition=head-queue AllocNode:Sid=head:2698865
  13. ReqNodeList=(null) ExcNodeList=(null)
  14. NodeList=head-C1,licosecond
  15. BatchHost=head-C1
  16. NumNodes=2 NumCPUs=114 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
  17. TRES=cpu=114,node=2,billing=114
  18. Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
  19. MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
  20. Features=(null) DelayBoot=00:00:00
  21. OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
  22. Power=
  23. TresPerNode=gpu:1
  24. MailUser=(null) MailType=NONE

sacct -j ——- finished

scontrol listpid

  1. root@head ~]# scontrol listpid
  2. PID JOBID STEPID LOCALID GLOBALID
  3. 708542 7 batch 0 0
  4. 708556 7 batch - -
  5. 708559 7 batch - -
  6. 708562 7 batch - -
  7. 708574 7 0 0 0
  8. 708583 7 0 - -

scontrol show job —detail