- 一、Slurm安装
- 二、CentOS安装Slurm
- 如安装成功,output则如下
- compute-node05
- compute-node04
- compute-node03
- compute-node02
- See the slurm.conf man page for more information.
- BackupController=
- BackupAddr=
- CheckpointType=checkpoint/none
- DisableRootJobs=NO
- EnforcePartLimits=NO
- Epilog=
- EpilogSlurmctld=
- FirstJobId=1
- MaxJobId=999999
- GresTypes=
- GroupUpdateForce=0
- GroupUpdateTime=600
- JobCheckpointDir=/var/slurm/checkpoint
- JobCredentialPrivateKey=
- JobCredentialPublicCertificate=
- JobFileAppend=0
- JobRequeue=1
- JobSubmitPlugins=
- KillOnBadExit=0
- LaunchType=launch/slurm
- Licenses=foo*4,bar
- MailProg=/bin/true
- MaxJobCount=5000
- MaxStepCount=40000
- MaxTasksPerNode=128
- MpiParams=ports=#-
- PluginDir=
- PlugStackConfig=
- PrivateData=jobs
- Prolog=
- PrologFlags=
- PrologSlurmctld=
- PropagatePrioProcess=0
- PropagateResourceLimits=
- PropagateResourceLimitsExcept=
- RebootProgram=
- SallocDefaultCommand=
- SrunEpilog=
- SrunProlog=
- TaskEpilog=
- TaskPluginParam=
- TaskProlog=
- TopologyPlugin=topology/tree
- TmpFS=/tmp
- TrackWCKey=no
- TreeWidth=
- UnkillableStepProgram=
- UsePAM=0
- TIMERS
- BatchStartTimeout=10
- CompleteWait=0
- EpilogMsgTime=2000
- GetEnvTimeout=2
- HealthCheckInterval=0
- HealthCheckProgram=
- MessageTimeout=10
- ResvOverRun=0
- OverTimeLimit=0
- UnkillableStepTimeout=60
- VSizeFactor=0
- SCHEDULING
- DefMemPerCPU=0
- FastSchedule=1
- MaxMemPerCPU=0
- SchedulerTimeSlice=30
- SelectTypeParameters=
- JOB PRIORITY
- PriorityFlags=
- PriorityType=priority/basic
- PriorityDecayHalfLife=
- PriorityCalcPeriod=
- PriorityFavorSmall=
- PriorityMaxAge=
- PriorityUsageResetPeriod=
- PriorityWeightAge=
- PriorityWeightFairshare=
- PriorityWeightJobSize=
- PriorityWeightPartition=
- PriorityWeightQOS=
- LOGGING AND ACCOUNTING
- AccountingStorageEnforce=0
- AccountingStorageHost=
- AccountingStorageLoc=
- AccountingStoragePass=
- AccountingStoragePort=
- AccountingStorageUser=
- DebugFlags=
- JobCompHost=
- JobCompLoc=
- JobCompPass=
- JobCompPort=
- JobCompUser=
- JobContainerType=job_container/none
- SlurmctldLogFile=
- SlurmdLogFile=
- SlurmSchedLogFile=
- SlurmSchedLogLevel=
- POWER SAVE SUPPORT FOR IDLE NODES (optional)
- SuspendProgram=
- ResumeProgram=
- SuspendTimeout=
- ResumeTimeout=
- ResumeRate=
- SuspendExcNodes=
- SuspendExcParts=
- SuspendRate=
- SuspendTime=
- COMPUTE NODES
- Management Node
- Infiniband
- Storage Node
- 192.192.192.35 pgx-storage-system
- Storage Infiniband
- 10.10.10.35 ib-pgx-storage-system
- Compute Node Management
- 192.192.192.13 compute-node13
- 192.192.192.14 compute-node14
- 192.192.192.15 compute-node15
- 192.192.192.16 compute-node16
- 192.192.192.17 compute-node17
- Compute Node Infiniband
- 10.10.10.13 ib-compute-node13
- 10.10.10.14 ib-compute-node14
- 10.10.10.15 ib-compute-node15
- 10.10.10.16 ib-compute-node16
- 10.10.10.17 ib-compute-node17
一、Slurm安装
操作步骤
- 使用PuTTY工具,以root用户登录服务器。
- 执行以下命令安装Slurm服务端。yum install -y slurm-ohpc slurm-slurmctld-ohpc slurm-slurmdbd-ohpc
- 执行以下命令启动服务。
- vi /etc/slrum.conf
- 按“i”进入编辑模式,添加如下内容。
systemctl start slurmctld
systemctl enable slurmctld
- 按“Esc”键,输入:wq!,按“Enter”保存并退出编辑。
- 执行以下命令安装slurm客户端。yum install -y slurm-slurmd-ohpc
- 执行以下命令启动服务。
- vi /etc/slurm.conf
- 按“i”进入编辑模式,添加如下内容。
systemctl enable slurmd
systemctl start slurmd
- 按“Esc”键,输入:wq!,按“Enter”保存并退出编辑。
- Slurm配置参考中的“安装Slurm”。
参考资料:kunpenghpcs_instg.pdf 参考链接:https://support.huaweicloud.com/instg-kunpenghpcs/kunpengopenhpc_03_0012.html
二、CentOS安装Slurm
官方文档:slurm官方文档
参考链接:CentOS 7 安装Slurm
参考链接:华为云-安装slurm
基础常识(经验)
一般来说看到devel、libs都是需要安装的
slurm-devel.x86_64
slurm-libs.x86_64
一般看到pam的话,就意味着是认证用的,可选安装
slurm-pam_slurm.x86_64
这种软件名字列的,一般包含的是用户操作命令
slurm.x86_64 --- 一般只有用户可登录去操作的机器才需要安装,比如管理节点
底下这种带有d结尾的,一般是Daemon
slurm-slurmctld --- 用于管理节点
slurm-slurmd.x86_64 --- 用于计算节点
slurm-slurmdbd.x86_64 --- 可选用于管理节点,主要是用于记录所有任务操作等,方便审计与计费的
1.关闭防火墙和selinux(安全第一)
selinux
安全增强型 Linux(Security-Enhanced Linux)简称 SELinux,它是一个 Linux 内核模块,也是 Linux 的一个安全子系统。SELinux 主要作用就是最大限度地减小系统中服务进程可访问的资源(最小权限原则)
farewall
防火墙是服务器安全的重要保障系统,遵循业务来往的网络通信机制,提供对网络通信的过滤服务。
按照保护对象的不同,防火墙可分为主机防火墙和网络防火墙。主机防火墙是部署在一台计算机系统上的软件,针对单个主机进行防护。网络防火墙是部署在两个网络之间的设备或一整套装置,针对一个网络进行防护。通常部署在网络边界以加强访问控制,其将网络划分为可信与不可信区域,对流入流出的网络流量进行过滤,实现对可信网络的防护。
vim /etc/selinux/config
#设置
SELINUX=disabled
#关闭防火墙及自动启动
systemctl stop firewalld systemctl disable firewalld
#重启使生效
reboot
2.安装EPEL Repo、axel与yum-axelget
yum install epel-release -y
yum install axel yum-axelget -y
3.安装与配置NTP服务
yum install ntp -y
systemctl enable ntpd.service
ntpdate ntp.fudan.edu.cn
systemctl start ntpd
timedatectl set-timezone Asia/Shanghai
4.安装munge
- 所有机器上的munge要保持munge.key文件一致
- 所有机器上Munge的UserID、GroupID一致
yum install munge munge-libs munge-devel -y # 在下列操作之前,Manage Node应该将munge.key传到所有的Compute Nodes chown -R munge: /etc/munge/ /var/log/munge/ chmod 0700 /etc/munge/ /var/log/munge/ systemctl start munge systemctl enable munge # 测试munge服务 munge -n
5.安装slurm依赖包
yum install openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel man2html libibmad libibumad -y --skip-broken
6.计算节点安装slurm
yum install slurm-pmi-devel -y yum install slurm-slurmd -y yum install slurm-libs -y yum install slurm -y yum install slurm-perlapi -y
7.配置slurm.conf和hosts(需要在manage节点修改config)
# 登陆以配置好的节点,如计算节点2,使用rsync同步计算节点4 rsync -avP /etc/slurm/slurm.conf root@192.192.192.4:/etc/slurm/slurm.conf rsync -avP /etc/hosts root@192.192.192.4:/etc/hosts
8.开启Compute Node的slurmd服务
systemctl start slurmd.service systemctl status slurmd.service systemctl enable slurmd.service
9.slurm测试
``` srun -N4 hostname
如安装成功,output则如下
compute-node05
compute-node04
compute-node03
compute-node02
<a name="4IQ5Q"></a>
### 补充
<a name="kUMzK"></a>
#### slurm.conf内容
需要特别留意下 **ControlAddr 以及最后一行内容**
#
See the slurm.conf man page for more information.
# ControlMachine=manage-node ControlAddr=192.192.192.254
BackupController=
BackupAddr=
# AuthType=auth/munge
CheckpointType=checkpoint/none
CryptoType=crypto/munge
DisableRootJobs=NO
EnforcePartLimits=NO
Epilog=
EpilogSlurmctld=
FirstJobId=1
MaxJobId=999999
GresTypes=
GroupUpdateForce=0
GroupUpdateTime=600
JobCheckpointDir=/var/slurm/checkpoint
JobCredentialPrivateKey=
JobCredentialPublicCertificate=
JobFileAppend=0
JobRequeue=1
JobSubmitPlugins=
KillOnBadExit=0
LaunchType=launch/slurm
Licenses=foo*4,bar
MailProg=/bin/true
MaxJobCount=5000
MaxStepCount=40000
MaxTasksPerNode=128
MpiDefault=pmix
MpiParams=ports=#-
PluginDir=
PlugStackConfig=
PrivateData=jobs
ProctrackType=proctrack/cgroup
Prolog=
PrologFlags=
PrologSlurmctld=
PropagatePrioProcess=0
PropagateResourceLimits=
PropagateResourceLimitsExcept=
RebootProgram=
ReturnToService=1
SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurm/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurm/d SlurmUser=root SlurmdUser=root
SrunEpilog=
SrunProlog=
StateSaveLocation=/var/spool/slurm/ctld SwitchType=switch/none
TaskEpilog=
TaskPlugin=task/none
TaskPluginParam=
TaskProlog=
TopologyPlugin=topology/tree
TmpFS=/tmp
TrackWCKey=no
TreeWidth=
UnkillableStepProgram=
UsePAM=0
# #
TIMERS
BatchStartTimeout=10
CompleteWait=0
EpilogMsgTime=2000
GetEnvTimeout=2
HealthCheckInterval=0
HealthCheckProgram=
InactiveLimit=0 KillWait=30
MessageTimeout=10
ResvOverRun=0
MinJobAge=300
OverTimeLimit=0
SlurmctldTimeout=120 SlurmdTimeout=300
UnkillableStepTimeout=60
VSizeFactor=0
Waittime=0 # #
SCHEDULING
DefMemPerCPU=0
FastSchedule=1
MaxMemPerCPU=0
SchedulerTimeSlice=30
SchedulerType=sched/backfill SelectType=select/linear
SelectTypeParameters=
# #
JOB PRIORITY
PriorityFlags=
PriorityType=priority/basic
PriorityDecayHalfLife=
PriorityCalcPeriod=
PriorityFavorSmall=
PriorityMaxAge=
PriorityUsageResetPeriod=
PriorityWeightAge=
PriorityWeightFairshare=
PriorityWeightJobSize=
PriorityWeightPartition=
PriorityWeightQOS=
# #
LOGGING AND ACCOUNTING
AccountingStorageEnforce=0
AccountingStorageHost=
AccountingStorageLoc=
AccountingStoragePass=
AccountingStoragePort=
AccountingStorageType=accounting_storage/none
AccountingStorageUser=
AccountingStoreJobComment=YES ClusterName=cluster
DebugFlags=
JobCompHost=
JobCompLoc=
JobCompPass=
JobCompPort=
JobCompType=jobcomp/none
JobCompUser=
JobContainerType=job_container/none
JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=3
SlurmctldLogFile=
SlurmdDebug=3
SlurmdLogFile=
SlurmSchedLogFile=
SlurmSchedLogLevel=
# #
POWER SAVE SUPPORT FOR IDLE NODES (optional)
SuspendProgram=
ResumeProgram=
SuspendTimeout=
ResumeTimeout=
ResumeRate=
SuspendExcNodes=
SuspendExcParts=
SuspendRate=
SuspendTime=
# #
COMPUTE NODES
NodeName=compute-node[01-12] CPUs=40 State=UNKNOWN PartitionName=debug Nodes=compute-node[01-12] Default=YES MaxTime=INFINITE State=UP
<a name="mc6Hb"></a>
#### hosts内容
127.0.0.1 localhost
Management Node
192.192.192.254 manage-node
Infiniband
10.10.10.254 ib-manage-node
Storage Node
192.192.192.35 pgx-storage-system
Storage Infiniband
10.10.10.35 ib-pgx-storage-system
Compute Node Management
192.192.192.1 compute-node01 192.192.192.2 compute-node02 192.192.192.3 compute-node03 192.192.192.4 compute-node04 192.192.192.5 compute-node05 192.192.192.6 compute-node06 192.192.192.7 compute-node07 192.192.192.8 compute-node08 192.192.192.9 compute-node09 192.192.192.10 compute-node10 192.192.192.11 compute-node11 192.192.192.12 compute-node12
192.192.192.13 compute-node13
192.192.192.14 compute-node14
192.192.192.15 compute-node15
192.192.192.16 compute-node16
192.192.192.17 compute-node17
Compute Node Infiniband
10.10.10.1 ib-compute-node01 10.10.10.2 ib-compute-node02 10.10.10.3 ib-compute-node03 10.10.10.4 ib-compute-node04 10.10.10.5 ib-compute-node05 10.10.10.6 ib-compute-node06 10.10.10.7 ib-compute-node07 10.10.10.8 ib-compute-node08 10.10.10.9 ib-compute-node09 10.10.10.10 ib-compute-node10 10.10.10.11 ib-compute-node11 10.10.10.12 ib-compute-node12
10.10.10.13 ib-compute-node13
10.10.10.14 ib-compute-node14
10.10.10.15 ib-compute-node15
10.10.10.16 ib-compute-node16
10.10.10.17 ib-compute-node17
```