https://stackoverflow.com/questions/24020420/find-out-the-cpu-time-and-memory-usage-of-a-slurm-job
https://slurm.schedmd.com/quickstart.html
MaxRSS 基于 task
MaxRss : Maximum resident set size of all tasks in job
job->step->task
https://slurm.schedmd.com/job_launch.html
sstat -j — MaxRSS—-running
sstat --allsteps --format=AveCPU,AvePages,MaxRSS,AveRSS,JobID -j <jobid>
Resident set size (RSS)means, roughly, the total amount of physical memory assigned to a process at a given point in time. It does not count pages that have been swapped out, or that are mapped from a file but not currently loaded into physical memory.
“Maximum RSS” means the maximum of the RSS since the process’s birth, i.e. the largest it has ever been. So this number tells you the largest amount of physical memory your process has ever been using at any one instant.
测试下,
[root@head ~]# scontrol show job 1480JobId=1480 JobName=Image_Classification_0427_test1UserId=hpcadmin(1000) GroupId=hpcadmin(1000) MCS_label=N/APriority=4294901712 Nice=0 Account=(null) QOS=normalJobState=RUNNING Reason=None Dependency=(null)Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0RunTime=16:16:54 TimeLimit=1-00:00:00 TimeMin=N/ASubmitTime=2021-04-27T17:31:41 EligibleTime=2021-04-27T17:31:41AccrueTime=2021-04-27T17:31:41StartTime=2021-04-27T17:31:41 EndTime=2021-04-28T17:31:41 Deadline=N/ASuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-04-27T17:31:41Partition=lico-queue AllocNode:Sid=head:2101957ReqNodeList=(null) ExcNodeList=(null)NodeList=lico-C1,licosecondBatchHost=lico-C1NumNodes=2 NumCPUs=4 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*TRES=cpu=4,node=2,billing=4Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*MinCPUsNode=2 MinMemoryNode=0 MinTmpDiskNode=0Features=(null) DelayBoot=00:00:00OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)Command=/home/hpcadmin/daiyi_test_20210126/train_dir/Image_Classification_0427_test1_107_202104270931.slurmWorkDir=/home/hpcadmin/daiyi_test_20210126/train_dirComment=LICO-107StdErr=/home/hpcadmin/daiyi_test_20210126/train_dir/slurm-1480.outStdIn=/dev/nullStdOut=/home/hpcadmin/daiyi_test_20210126/train_dir/slurm-1480.outPower=MailUser=(null) MailType=NONE[root@head ~]# sstat -j 1480 --allsteps --fields=MaxRSS,MaxRSSNode,MaxRSSTask,AveRSS,MaxPages,MaxPagesNode,TresUsageInAve,TresUsageInMaxMaxRSS MaxRSSNode MaxRSSTask AveRSS MaxPages MaxPagesNode TRESUsageInAve TRESUsageInMax---------- ---------- ---------- ---------- -------- ------------ -------------- --------------60792K lico-C1 0 60792K 0 lico-C1 cpu=00:00:16,+ cpu=00:00:16,+1315612K lico-C1 0 1308824K 188 lico-C1 cpu=00:02:00,+ cpu=00:02:00,+1501632K licosecond 0 1461676K 185 licosecond cpu=00:01:53,+ cpu=00:01:53,+7084064K lico-C1 0 6883420K 194 lico-C1 cpu=1-09:20:2+ cpu=1-09:20:2+6304844K licosecond 0 4999336K 195 licosecond cpu=1-07:46:3+ cpu=1-07:46:3+[root@head ~]#[root@lico-C1 ~]# jobid=1480;a='';for i in `scontrol listpids $jobid|grep -v 'PID'|awk '{print $1}'`;do a=$a','$i;echo $a;done;a=${a:1};top -p $a -n 1top - 10:48:53 up 12 days, 1:02, 2 users, load average: 4.89, 4.99, 4.25Tasks: 14 total, 0 running, 14 sleeping, 0 stopped, 0 zombie%Cpu(s): 5.2 us, 1.1 sy, 0.0 ni, 93.6 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 stMiB Mem : 386393.2 total, 311122.4 free, 16072.3 used, 59198.6 buff/cacheMiB Swap: 4096.0 total, 4096.0 free, 0.0 used. 367382.9 avail MemPID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND1389988 hpcadmin 20 0 21.8g 6.5g 169664 S 200.0 1.7 2000:52 letrain.pyc1389909 hpcadmin 20 0 12836 3280 2860 S 0.0 0.0 0:00.00 slurm_script1389915 hpcadmin 20 0 387580 21840 8756 S 0.0 0.0 0:16.24 lico-dl-run1389920 hpcadmin 20 0 255824 8004 4768 S 0.0 0.0 0:00.01 srun1389922 hpcadmin 20 0 255824 7920 4728 S 0.0 0.0 0:00.01 srun1389923 hpcadmin 20 0 255824 7836 4596 S 0.0 0.0 0:00.04 srun1389925 hpcadmin 20 0 51068 996 0 S 0.0 0.0 0:00.00 srun1389928 hpcadmin 20 0 255824 7916 4728 S 0.0 0.0 0:00.09 srun1389932 hpcadmin 20 0 51068 1000 0 S 0.0 0.0 0:00.00 srun1389942 hpcadmin 20 0 51068 1000 0 S 0.0 0.0 0:00.00 srun1389943 hpcadmin 20 0 51068 1000 0 S 0.0 0.0 0:00.00 srun1389960 hpcadmin 20 0 653488 18536 11596 S 0.0 0.0 0:00.08 starter-suid1389987 hpcadmin 20 0 13.9g 1.2g 148664 S 0.0 0.3 2:00.19 letrain.pyc1389961 hpcadmin 20 0 651824 20068 11120 S 0.0 0.0 0:00.07 starter-suid[root@licosecond ~]# jobid=1480;a='';for i in `scontrol listpids $jobid|grep -v 'PID'|awk '{print $1}'`;do a=$a','$i;echo $a;done;a=${a:1};top -p $a -n 1top - 10:49:04 up 9 days, 11:13, 1 user, load average: 5.72, 4.85, 3.46Tasks: 4 total, 0 running, 4 sleeping, 0 stopped, 0 zombie%Cpu(s): 2.4 us, 0.5 sy, 0.0 ni, 97.1 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 stMiB Mem : 386390.0 total, 357403.2 free, 11141.1 used, 17845.7 buff/cacheMiB Swap: 10239.7 total, 10239.7 free, 0.0 used. 372638.7 avail MemPID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND972275 hpcadmin 20 0 23.0g 4.4g 169260 S 200.0 1.2 1907:15 letrain.pyc972248 hpcadmin 20 0 653488 20176 11192 S 0.0 0.0 0:00.07 starter-suid972250 hpcadmin 20 0 651824 18252 11336 S 0.0 0.0 0:00.07 starter-suid972282 hpcadmin 20 0 17.8g 1.4g 147964 S 0.0 0.4 1:53.73 letrain.pyc
>>> (60792 + 1315612 + 1501632 + 7084064 +6304844) // 102415885>>> (386393.2 * 2.0 //100) + (386390 * 1.6 // 100)13909.0>>>
[root@head slurm]# cat to_log.shwhile truedojob_id=17echo "##### "`date`" #####">>bb.logecho `sstat -a --format=JobID,MaxRss,AveRss -j $job_id`>>bb.log;echo "###########">>bb.logecho `sacct --format=JobID,MaxRss,AveRss -j $job_id`>>bb.logecho " ">>bb.logecho " ">>bb.logecho " ">>bb.logecho " ">>bb.logsleep 10done
sstat 从 job 开始运行 就有数据, 到 job 完成为止。
sacct 从 job 完成 才开始有 数据, 二者刚好能接上。
##### Wed Apr 28 18:37:04 CST 2021 #####JobID MaxRSS AveRSS ------------ ---------- ---------- 17.batch 33380K 33364K 17.0 934964K 869876K###########JobID MaxRSS AveRSS ------------ ---------- ---------- 17 17.batch 17.0##### Wed Apr 28 18:37:14 CST 2021 #####JobID MaxRSS AveRSS ------------ ---------- ---------- 17.batch 33380K 33364K 17.0 934964K 909380K###########JobID MaxRSS AveRSS ------------ ---------- ---------- 17 17.batch 17.0##### Wed Apr 28 18:37:24 CST 2021 #####JobID MaxRSS AveRSS ------------ ---------- ---------- 17.batch 33380K 33364K 17.0 934964K 841056K###########JobID MaxRSS AveRSS ------------ ---------- ---------- 17 17.batch 17.0##### Wed Apr 28 18:37:34 CST 2021 #####JobID MaxRSS AveRSS ------------ ---------- ----------###########JobID MaxRSS AveRSS ------------ ---------- ---------- 17 17.batch 33380K 33380K 17.0 934964K 934964K##### Wed Apr 28 18:37:34 CST 2021 #####JobID MaxRSS AveRSS ------------ ---------- ----------###########JobID MaxRSS AveRSS ------------ ---------- ---------- 17 17.batch 33380K 33380K 17.0 934964K 934964K##### Wed Apr 28 18:37:44 CST 2021 #####JobID MaxRSS AveRSS ------------ ---------- ----------###########JobID MaxRSS AveRSS ------------ ---------- ---------- 17 17.batch 33380K 33380K 17.0 934964K 934964K##### Wed Apr 28 18:37:55 CST 2021 #####JobID MaxRSS AveRSS ------------ ---------- ----------###########JobID MaxRSS AveRSS ------------ ---------- ---------- 17 17.batch 33380K 33380K 17.0 934964K 934964K##### Wed Apr 28 18:38:05 CST 2021 #####JobID MaxRSS AveRSS ------------ ---------- ----------###########JobID MaxRSS AveRSS ------------ ---------- ---------- 17 17.batch 33380K 33380K 17.0 934964K 934964K
seff ———all time
https://stackoverflow.com/questions/24020420/find-out-the-cpu-time-and-memory-usage-of-a-slurm-job
https://slurm.schedmd.com/sstat.html#lbAE
[root@head slurm]# seff 7
/usr/bin/perl: symbol lookup error: /usr/lib64/slurm/accounting_storage_slurmdbd.so: undefined symbol: running_in_slurmctld
scontrol show node
scontrol show job —detail
[root@head-C1 ~]# scontrol show job 1373JobId=1373 JobName=hx_tc_p_03_same_01UserId=admin(1000) GroupId=admin(1000) MCS_label=N/APriority=4294901739 Nice=0 Account=(null) QOS=normalJobState=RUNNING Reason=None Dependency=(null)Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0RunTime=00:15:08 TimeLimit=7-00:00:00 TimeMin=N/ASubmitTime=2021-04-21T15:19:46 EligibleTime=2021-04-21T15:19:46AccrueTime=2021-04-21T15:19:46StartTime=2021-04-21T18:10:24 EndTime=2021-04-28T18:10:24 Deadline=N/ASuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-04-21T18:10:24Partition=head-queue AllocNode:Sid=head:2698865ReqNodeList=(null) ExcNodeList=(null)NodeList=head-C1,licosecondBatchHost=head-C1NumNodes=2 NumCPUs=114 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*TRES=cpu=114,node=2,billing=114Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0Features=(null) DelayBoot=00:00:00OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)Power=TresPerNode=gpu:1MailUser=(null) MailType=NONE
sacct -j ——- finished
scontrol listpid
root@head ~]# scontrol listpidPID JOBID STEPID LOCALID GLOBALID708542 7 batch 0 0708556 7 batch - -708559 7 batch - -708562 7 batch - -708574 7 0 0 0708583 7 0 - -
