https://stackoverflow.com/questions/24020420/find-out-the-cpu-time-and-memory-usage-of-a-slurm-job
https://slurm.schedmd.com/quickstart.html
MaxRSS 基于 task
MaxRss : Maximum resident set size of all tasks in job
job->step->task
https://slurm.schedmd.com/job_launch.html
sstat -j — MaxRSS—-running
sstat --allsteps --format=AveCPU,AvePages,MaxRSS,AveRSS,JobID -j <jobid>
Resident set size (RSS)means, roughly, the total amount of physical memory assigned to a process at a given point in time. It does not count pages that have been swapped out, or that are mapped from a file but not currently loaded into physical memory.
“Maximum RSS” means the maximum of the RSS since the process’s birth, i.e. the largest it has ever been. So this number tells you the largest amount of physical memory your process has ever been using at any one instant.
测试下,
[root@head ~]# scontrol show job 1480
JobId=1480 JobName=Image_Classification_0427_test1
UserId=hpcadmin(1000) GroupId=hpcadmin(1000) MCS_label=N/A
Priority=4294901712 Nice=0 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=16:16:54 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2021-04-27T17:31:41 EligibleTime=2021-04-27T17:31:41
AccrueTime=2021-04-27T17:31:41
StartTime=2021-04-27T17:31:41 EndTime=2021-04-28T17:31:41 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-04-27T17:31:41
Partition=lico-queue AllocNode:Sid=head:2101957
ReqNodeList=(null) ExcNodeList=(null)
NodeList=lico-C1,licosecond
BatchHost=lico-C1
NumNodes=2 NumCPUs=4 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=4,node=2,billing=4
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=2 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/hpcadmin/daiyi_test_20210126/train_dir/Image_Classification_0427_test1_107_202104270931.slurm
WorkDir=/home/hpcadmin/daiyi_test_20210126/train_dir
Comment=LICO-107
StdErr=/home/hpcadmin/daiyi_test_20210126/train_dir/slurm-1480.out
StdIn=/dev/null
StdOut=/home/hpcadmin/daiyi_test_20210126/train_dir/slurm-1480.out
Power=
MailUser=(null) MailType=NONE
[root@head ~]# sstat -j 1480 --allsteps --fields=MaxRSS,MaxRSSNode,MaxRSSTask,AveRSS,MaxPages,MaxPagesNode,TresUsageInAve,TresUsageInMax
MaxRSS MaxRSSNode MaxRSSTask AveRSS MaxPages MaxPagesNode TRESUsageInAve TRESUsageInMax
---------- ---------- ---------- ---------- -------- ------------ -------------- --------------
60792K lico-C1 0 60792K 0 lico-C1 cpu=00:00:16,+ cpu=00:00:16,+
1315612K lico-C1 0 1308824K 188 lico-C1 cpu=00:02:00,+ cpu=00:02:00,+
1501632K licosecond 0 1461676K 185 licosecond cpu=00:01:53,+ cpu=00:01:53,+
7084064K lico-C1 0 6883420K 194 lico-C1 cpu=1-09:20:2+ cpu=1-09:20:2+
6304844K licosecond 0 4999336K 195 licosecond cpu=1-07:46:3+ cpu=1-07:46:3+
[root@head ~]#
[root@lico-C1 ~]# jobid=1480;a='';for i in `scontrol listpids $jobid|grep -v 'PID'|awk '{print $1}'`;do a=$a','$i;echo $a;done;a=${a:1};top -p $a -n 1
top - 10:48:53 up 12 days, 1:02, 2 users, load average: 4.89, 4.99, 4.25
Tasks: 14 total, 0 running, 14 sleeping, 0 stopped, 0 zombie
%Cpu(s): 5.2 us, 1.1 sy, 0.0 ni, 93.6 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
MiB Mem : 386393.2 total, 311122.4 free, 16072.3 used, 59198.6 buff/cache
MiB Swap: 4096.0 total, 4096.0 free, 0.0 used. 367382.9 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1389988 hpcadmin 20 0 21.8g 6.5g 169664 S 200.0 1.7 2000:52 letrain.pyc
1389909 hpcadmin 20 0 12836 3280 2860 S 0.0 0.0 0:00.00 slurm_script
1389915 hpcadmin 20 0 387580 21840 8756 S 0.0 0.0 0:16.24 lico-dl-run
1389920 hpcadmin 20 0 255824 8004 4768 S 0.0 0.0 0:00.01 srun
1389922 hpcadmin 20 0 255824 7920 4728 S 0.0 0.0 0:00.01 srun
1389923 hpcadmin 20 0 255824 7836 4596 S 0.0 0.0 0:00.04 srun
1389925 hpcadmin 20 0 51068 996 0 S 0.0 0.0 0:00.00 srun
1389928 hpcadmin 20 0 255824 7916 4728 S 0.0 0.0 0:00.09 srun
1389932 hpcadmin 20 0 51068 1000 0 S 0.0 0.0 0:00.00 srun
1389942 hpcadmin 20 0 51068 1000 0 S 0.0 0.0 0:00.00 srun
1389943 hpcadmin 20 0 51068 1000 0 S 0.0 0.0 0:00.00 srun
1389960 hpcadmin 20 0 653488 18536 11596 S 0.0 0.0 0:00.08 starter-suid
1389987 hpcadmin 20 0 13.9g 1.2g 148664 S 0.0 0.3 2:00.19 letrain.pyc
1389961 hpcadmin 20 0 651824 20068 11120 S 0.0 0.0 0:00.07 starter-suid
[root@licosecond ~]# jobid=1480;a='';for i in `scontrol listpids $jobid|grep -v 'PID'|awk '{print $1}'`;do a=$a','$i;echo $a;done;a=${a:1};top -p $a -n 1
top - 10:49:04 up 9 days, 11:13, 1 user, load average: 5.72, 4.85, 3.46
Tasks: 4 total, 0 running, 4 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.4 us, 0.5 sy, 0.0 ni, 97.1 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
MiB Mem : 386390.0 total, 357403.2 free, 11141.1 used, 17845.7 buff/cache
MiB Swap: 10239.7 total, 10239.7 free, 0.0 used. 372638.7 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
972275 hpcadmin 20 0 23.0g 4.4g 169260 S 200.0 1.2 1907:15 letrain.pyc
972248 hpcadmin 20 0 653488 20176 11192 S 0.0 0.0 0:00.07 starter-suid
972250 hpcadmin 20 0 651824 18252 11336 S 0.0 0.0 0:00.07 starter-suid
972282 hpcadmin 20 0 17.8g 1.4g 147964 S 0.0 0.4 1:53.73 letrain.pyc
>>> (60792 + 1315612 + 1501632 + 7084064 +6304844) // 1024
15885
>>> (386393.2 * 2.0 //100) + (386390 * 1.6 // 100)
13909.0
>>>
[root@head slurm]# cat to_log.sh
while true
do
job_id=17
echo "##### "`date`" #####">>bb.log
echo `sstat -a --format=JobID,MaxRss,AveRss -j $job_id`>>bb.log;
echo "###########">>bb.log
echo `sacct --format=JobID,MaxRss,AveRss -j $job_id`>>bb.log
echo " ">>bb.log
echo " ">>bb.log
echo " ">>bb.log
echo " ">>bb.log
sleep 10
done
sstat 从 job 开始运行 就有数据, 到 job 完成为止。
sacct 从 job 完成 才开始有 数据, 二者刚好能接上。
##### Wed Apr 28 18:37:04 CST 2021 #####
JobID MaxRSS AveRSS ------------ ---------- ---------- 17.batch 33380K 33364K 17.0 934964K 869876K
###########
JobID MaxRSS AveRSS ------------ ---------- ---------- 17 17.batch 17.0
##### Wed Apr 28 18:37:14 CST 2021 #####
JobID MaxRSS AveRSS ------------ ---------- ---------- 17.batch 33380K 33364K 17.0 934964K 909380K
###########
JobID MaxRSS AveRSS ------------ ---------- ---------- 17 17.batch 17.0
##### Wed Apr 28 18:37:24 CST 2021 #####
JobID MaxRSS AveRSS ------------ ---------- ---------- 17.batch 33380K 33364K 17.0 934964K 841056K
###########
JobID MaxRSS AveRSS ------------ ---------- ---------- 17 17.batch 17.0
##### Wed Apr 28 18:37:34 CST 2021 #####
JobID MaxRSS AveRSS ------------ ---------- ----------
###########
JobID MaxRSS AveRSS ------------ ---------- ---------- 17 17.batch 33380K 33380K 17.0 934964K 934964K
##### Wed Apr 28 18:37:34 CST 2021 #####
JobID MaxRSS AveRSS ------------ ---------- ----------
###########
JobID MaxRSS AveRSS ------------ ---------- ---------- 17 17.batch 33380K 33380K 17.0 934964K 934964K
##### Wed Apr 28 18:37:44 CST 2021 #####
JobID MaxRSS AveRSS ------------ ---------- ----------
###########
JobID MaxRSS AveRSS ------------ ---------- ---------- 17 17.batch 33380K 33380K 17.0 934964K 934964K
##### Wed Apr 28 18:37:55 CST 2021 #####
JobID MaxRSS AveRSS ------------ ---------- ----------
###########
JobID MaxRSS AveRSS ------------ ---------- ---------- 17 17.batch 33380K 33380K 17.0 934964K 934964K
##### Wed Apr 28 18:38:05 CST 2021 #####
JobID MaxRSS AveRSS ------------ ---------- ----------
###########
JobID MaxRSS AveRSS ------------ ---------- ---------- 17 17.batch 33380K 33380K 17.0 934964K 934964K
seff ———all time
https://stackoverflow.com/questions/24020420/find-out-the-cpu-time-and-memory-usage-of-a-slurm-job
https://slurm.schedmd.com/sstat.html#lbAE
[root@head slurm]# seff 7
/usr/bin/perl: symbol lookup error: /usr/lib64/slurm/accounting_storage_slurmdbd.so: undefined symbol: running_in_slurmctld
scontrol show node
scontrol show job —detail
[root@head-C1 ~]# scontrol show job 1373
JobId=1373 JobName=hx_tc_p_03_same_01
UserId=admin(1000) GroupId=admin(1000) MCS_label=N/A
Priority=4294901739 Nice=0 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:15:08 TimeLimit=7-00:00:00 TimeMin=N/A
SubmitTime=2021-04-21T15:19:46 EligibleTime=2021-04-21T15:19:46
AccrueTime=2021-04-21T15:19:46
StartTime=2021-04-21T18:10:24 EndTime=2021-04-28T18:10:24 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-04-21T18:10:24
Partition=head-queue AllocNode:Sid=head:2698865
ReqNodeList=(null) ExcNodeList=(null)
NodeList=head-C1,licosecond
BatchHost=head-C1
NumNodes=2 NumCPUs=114 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=114,node=2,billing=114
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Power=
TresPerNode=gpu:1
MailUser=(null) MailType=NONE
sacct -j ——- finished
scontrol listpid
root@head ~]# scontrol listpid
PID JOBID STEPID LOCALID GLOBALID
708542 7 batch 0 0
708556 7 batch - -
708559 7 batch - -
708562 7 batch - -
708574 7 0 0 0
708583 7 0 - -