一、Slurm安装

操作步骤

  1. 使用PuTTY工具,以root用户登录服务器。
  2. 执行以下命令安装Slurm服务端。yum install -y slurm-ohpc slurm-slurmctld-ohpc slurm-slurmdbd-ohpc
  3. 执行以下命令启动服务。
    1. vi /etc/slrum.conf
    2. 按“i”进入编辑模式,添加如下内容。

systemctl start slurmctld
systemctl enable slurmctld

  1. 按“Esc”键,输入:wq!,按“Enter”保存并退出编辑。
    1. 执行以下命令安装slurm客户端。yum install -y slurm-slurmd-ohpc
    2. 执行以下命令启动服务。
  2. vi /etc/slurm.conf
  3. 按“i”进入编辑模式,添加如下内容。

systemctl enable slurmd
systemctl start slurmd

  1. 按“Esc”键,输入:wq!,按“Enter”保存并退出编辑。
    1. Slurm配置参考中的“安装Slurm”。


参考资料:kunpenghpcs_instg.pdf 参考链接:https://support.huaweicloud.com/instg-kunpenghpcs/kunpengopenhpc_03_0012.html

二、CentOS安装Slurm

官方文档:slurm官方文档
参考链接:CentOS 7 安装Slurm
参考链接:华为云-安装slurm

基础常识(经验)

  1. 一般来说看到devellibs都是需要安装的
  2. slurm-devel.x86_64
  3. slurm-libs.x86_64
  4. 一般看到pam的话,就意味着是认证用的,可选安装
  5. slurm-pam_slurm.x86_64
  6. 这种软件名字列的,一般包含的是用户操作命令
  7. slurm.x86_64 --- 一般只有用户可登录去操作的机器才需要安装,比如管理节点
  8. 底下这种带有d结尾的,一般是Daemon
  9. slurm-slurmctld --- 用于管理节点
  10. slurm-slurmd.x86_64 --- 用于计算节点
  11. slurm-slurmdbd.x86_64 --- 可选用于管理节点,主要是用于记录所有任务操作等,方便审计与计费的

1.关闭防火墙和selinux(安全第一)

selinux

安全增强型 Linux(Security-Enhanced Linux)简称 SELinux,它是一个 Linux 内核模块,也是 Linux 的一个安全子系统。SELinux 主要作用就是最大限度地减小系统中服务进程可访问的资源(最小权限原则)
farewall
防火墙是服务器安全的重要保障系统,遵循业务来往的网络通信机制,提供对网络通信的过滤服务。
按照保护对象的不同,防火墙可分为主机防火墙和网络防火墙。主机防火墙是部署在一台计算机系统上的软件,针对单个主机进行防护。网络防火墙是部署在两个网络之间的设备或一整套装置,针对一个网络进行防护。通常部署在网络边界以加强访问控制,其将网络划分为可信与不可信区域,对流入流出的网络流量进行过滤,实现对可信网络的防护。

vim /etc/selinux/config 
#设置 
SELINUX=disabled 
#关闭防火墙及自动启动 
systemctl stop firewalld systemctl disable firewalld 
#重启使生效 
reboot

2.安装EPEL Repo、axel与yum-axelget

yum install epel-release -y 
yum install axel yum-axelget -y

3.安装与配置NTP服务

yum install ntp -y
systemctl enable ntpd.service
ntpdate ntp.fudan.edu.cn
systemctl start ntpd
timedatectl set-timezone Asia/Shanghai

4.安装munge

  1. 所有机器上的munge要保持munge.key文件一致
  2. 所有机器上Munge的UserID、GroupID一致
    yum install munge munge-libs munge-devel -y
    # 在下列操作之前,Manage Node应该将munge.key传到所有的Compute Nodes
    chown -R munge: /etc/munge/ /var/log/munge/
    chmod 0700 /etc/munge/ /var/log/munge/
    systemctl start munge
    systemctl enable munge
    # 测试munge服务
    munge -n
    

    5.安装slurm依赖包

    yum install openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel man2html libibmad libibumad -y --skip-broken
    

    6.计算节点安装slurm

    yum install slurm-pmi-devel -y
    yum install slurm-slurmd -y
    yum install slurm-libs -y
    yum install slurm -y
    yum install slurm-perlapi -y
    

    7.配置slurm.conf和hosts(需要在manage节点修改config)

    # 登陆以配置好的节点,如计算节点2,使用rsync同步计算节点4
    rsync -avP /etc/slurm/slurm.conf root@192.192.192.4:/etc/slurm/slurm.conf
    rsync -avP /etc/hosts  root@192.192.192.4:/etc/hosts
    

    8.开启Compute Node的slurmd服务

    systemctl start slurmd.service
    systemctl status slurmd.service
    systemctl enable slurmd.service
    

    9.slurm测试

    ``` srun -N4 hostname

如安装成功,output则如下

compute-node05

compute-node04

compute-node03

compute-node02

<a name="4IQ5Q"></a>
### 补充
<a name="kUMzK"></a>
#### slurm.conf内容
需要特别留意下 **ControlAddr 以及最后一行内容**

#

See the slurm.conf man page for more information.

# ControlMachine=manage-node ControlAddr=192.192.192.254

BackupController=

BackupAddr=

# AuthType=auth/munge

CheckpointType=checkpoint/none

CryptoType=crypto/munge

DisableRootJobs=NO

EnforcePartLimits=NO

Epilog=

EpilogSlurmctld=

FirstJobId=1

MaxJobId=999999

GresTypes=

GroupUpdateForce=0

GroupUpdateTime=600

JobCheckpointDir=/var/slurm/checkpoint

JobCredentialPrivateKey=

JobCredentialPublicCertificate=

JobFileAppend=0

JobRequeue=1

JobSubmitPlugins=

KillOnBadExit=0

LaunchType=launch/slurm

Licenses=foo*4,bar

MailProg=/bin/true

MaxJobCount=5000

MaxStepCount=40000

MaxTasksPerNode=128

MpiDefault=pmix

MpiParams=ports=#-

PluginDir=

PlugStackConfig=

PrivateData=jobs

ProctrackType=proctrack/cgroup

Prolog=

PrologFlags=

PrologSlurmctld=

PropagatePrioProcess=0

PropagateResourceLimits=

PropagateResourceLimitsExcept=

RebootProgram=

ReturnToService=1

SallocDefaultCommand=

SlurmctldPidFile=/var/run/slurm/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurm/d SlurmUser=root SlurmdUser=root

SrunEpilog=

SrunProlog=

StateSaveLocation=/var/spool/slurm/ctld SwitchType=switch/none

TaskEpilog=

TaskPlugin=task/none

TaskPluginParam=

TaskProlog=

TopologyPlugin=topology/tree

TmpFS=/tmp

TrackWCKey=no

TreeWidth=

UnkillableStepProgram=

UsePAM=0

# #

TIMERS

BatchStartTimeout=10

CompleteWait=0

EpilogMsgTime=2000

GetEnvTimeout=2

HealthCheckInterval=0

HealthCheckProgram=

InactiveLimit=0 KillWait=30

MessageTimeout=10

ResvOverRun=0

MinJobAge=300

OverTimeLimit=0

SlurmctldTimeout=120 SlurmdTimeout=300

UnkillableStepTimeout=60

VSizeFactor=0

Waittime=0 # #

SCHEDULING

DefMemPerCPU=0

FastSchedule=1

MaxMemPerCPU=0

SchedulerTimeSlice=30

SchedulerType=sched/backfill SelectType=select/linear

SelectTypeParameters=

# #

JOB PRIORITY

PriorityFlags=

PriorityType=priority/basic

PriorityDecayHalfLife=

PriorityCalcPeriod=

PriorityFavorSmall=

PriorityMaxAge=

PriorityUsageResetPeriod=

PriorityWeightAge=

PriorityWeightFairshare=

PriorityWeightJobSize=

PriorityWeightPartition=

PriorityWeightQOS=

# #

LOGGING AND ACCOUNTING

AccountingStorageEnforce=0

AccountingStorageHost=

AccountingStorageLoc=

AccountingStoragePass=

AccountingStoragePort=

AccountingStorageType=accounting_storage/none

AccountingStorageUser=

AccountingStoreJobComment=YES ClusterName=cluster

DebugFlags=

JobCompHost=

JobCompLoc=

JobCompPass=

JobCompPort=

JobCompType=jobcomp/none

JobCompUser=

JobContainerType=job_container/none

JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=3

SlurmctldLogFile=

SlurmdDebug=3

SlurmdLogFile=

SlurmSchedLogFile=

SlurmSchedLogLevel=

# #

POWER SAVE SUPPORT FOR IDLE NODES (optional)

SuspendProgram=

ResumeProgram=

SuspendTimeout=

ResumeTimeout=

ResumeRate=

SuspendExcNodes=

SuspendExcParts=

SuspendRate=

SuspendTime=

# #

COMPUTE NODES

NodeName=compute-node[01-12] CPUs=40 State=UNKNOWN PartitionName=debug Nodes=compute-node[01-12] Default=YES MaxTime=INFINITE State=UP

<a name="mc6Hb"></a>
#### hosts内容

127.0.0.1 localhost

Management Node

192.192.192.254 manage-node

Infiniband

10.10.10.254 ib-manage-node

Storage Node

192.192.192.35 pgx-storage-system

Storage Infiniband

10.10.10.35 ib-pgx-storage-system

Compute Node Management

192.192.192.1 compute-node01 192.192.192.2 compute-node02 192.192.192.3 compute-node03 192.192.192.4 compute-node04 192.192.192.5 compute-node05 192.192.192.6 compute-node06 192.192.192.7 compute-node07 192.192.192.8 compute-node08 192.192.192.9 compute-node09 192.192.192.10 compute-node10 192.192.192.11 compute-node11 192.192.192.12 compute-node12

192.192.192.13 compute-node13

192.192.192.14 compute-node14

192.192.192.15 compute-node15

192.192.192.16 compute-node16

192.192.192.17 compute-node17

Compute Node Infiniband

10.10.10.1 ib-compute-node01 10.10.10.2 ib-compute-node02 10.10.10.3 ib-compute-node03 10.10.10.4 ib-compute-node04 10.10.10.5 ib-compute-node05 10.10.10.6 ib-compute-node06 10.10.10.7 ib-compute-node07 10.10.10.8 ib-compute-node08 10.10.10.9 ib-compute-node09 10.10.10.10 ib-compute-node10 10.10.10.11 ib-compute-node11 10.10.10.12 ib-compute-node12

10.10.10.13 ib-compute-node13

10.10.10.14 ib-compute-node14

10.10.10.15 ib-compute-node15

10.10.10.16 ib-compute-node16

10.10.10.17 ib-compute-node17

```