SLURM Cluster Configuration on Azure (Part III)

This is the third part of the tutorial to install and configure SLURM on Azure (part I, part II). With this post, we are going to complete the process and we show an example of the execution of one task.

The Simple Linux Utility for Resource Management (SLURM) is an open-source task manager that is used in several clusters around the world, for example, at “Mare Nostrum”. It provides three key components:

  • Resource management: Constraints, limitations and information.
  • Tasks monitoring.
  • Queue management.

It has several plugins and it is very modular. We have chosen a setup that will include:

  • Authentication among nodes using \texttt{munge}.
  • Accounting and job completion monitoring usingMySQL and the interface that provides slurmdbd.
  • The controller daemon  (slurmctl) and the compute node daemon (slurmd)  of SLURM.

We are going to start with \texttt{Munge}, that allows nodes to authenticate each other.

Munge installation

We just have to follow these commands to install Munge on each node (controller and compute nodes):

jjorge@XXX:~$ sudo apt-get install \ 
   libmunge-dev libmunge2 munge

After installing it, we should create a new key that will be shared among nodes later. We might use different mechanisms, one of them is using the command dd, proceeding as follows on the controller node:

jjorge@controller:~$ sudo su

root@controller:# dd if=/dev/urandom bs=1 count=1024 > 
  /etc/munge/munge.key

This will generate a key that will be shared among nodes. It is required the adjustment of key’s permissions:

root@controller:# chown munge:munge \/etc/munge/munge.key
root@controller:# chmod 400 /etc/munge/munge.key

Finally, we set the daemon and start the service.

root@controller:# systemctl enable munge
root@controller:# systemctl start munge

We can check the correctness of this process with the following:

root@controller:# munge -n | unmunge | grep STATUS 
 STATUS:           Success (0)

After doing this, we should copy the key on every compute node, compute0 and compute1 in this case:

root@controller:# scp /etc/munge/munge.key \ 
 compute0:/etc/munge/munge.key
root@controller:# scp /etc/munge/munge.key \ 
 compute1:/etc/munge/munge.key

We have to change permissions accordingly on each node:

root@computeX:# chown munge:munge \ 
  /etc/munge/munge.key
root@computeX:# chmod 400 /etc/munge/munge.key

Database installation

The next step is to install and configure the MySQL server that will store the database for accounting in SLURM on the controller node. We need the following packages to do this:

jjorge@controller:~$ sudo apt-get install ruby ruby-dev \
 python-dev libpam0g-dev libmariadb-client-lgpl-dev \
 libmysqlclient-dev 
jjorge@controller:~$ sudo apt-get install mysql-server
jjorge@controller:~$ sudo apt-get install libmysqld-dev \
 mariadb-server

During these steps, a password could be requested, make sure you will remember this password since it is the password for the database administrator. When the installation is completed, we can access the MySQL server with the following command (if it is required a password, just press enter):

~$ sudo mysql -u root

Inside the manager, we should introduce the following script:

create database slurm_acct_db;
create user 'slurm'@'localhost';
set password for 'slurm'@'localhost' = 
 password('slurmdbpass');
grant usage on *.* to 'slurm'@'localhost';
grant all privileges on slurm_acct_db.* to 
 'slurm'@'localhost';
flush privileges;
exit

This will create the database  “slurm_acct_db”, the user “slurm” with the password “slurmdbpass”. With the “grant” commands we provide privileges to this user over this database.

After configuring MySQL and MySQL, from this point onwards, we are going to work on the controller node to configure the SLURM controller, and eventually the SLURM daemon on the rest of the nodes in the cluster.

Controller node

Now we are prepared to install SLURM. Considering that we are using the distribution “Ubuntu 16.04 Server” and the packages that are required for SLURM are in its repository, we just have to get these packages using apt-get:

jjorge@controller:~$ sudo apt-get install slurm \
  slurm-client slurm-wlm slurm-wlm-basic-plugins \
  slurmctld slurmd slurmdbd

We can find more information about SLURM on this paths:

  • /usr/share/doc/slurmctld
  • /usr/share/doc/slurmd
  • /usr/share/doc/slurmdbd

SLURM requires some configuration files that are not included with the installation process. The first file is the slurm.conf, that configures the nodes’ names, the authentication parameters and so on. One example that could be adapted is provided in the Appendix. The part that deserves more attention in this case is the section related with the definition of nodes and partitions:

# COMPUTE NODES
NodeName=compute[0-1] CPUs=1 Sockets=1 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN
PartitionName=debug Nodes=compute[0-1] Default=YES MaxTime=INFINITE State=UP

These lines are defining the information about the compute nodes that we have as well as the information related with partitions. For this example, we have created one partition called debug with these two compute nodes.

There is another file that is required, the configuration file for the database’s settings. In this case, the file is called slurmdbd.conf, that is also provided as an Appendix.

Both files should be stored in /etc/slurm-llnl/slurm.conf and /etc/slurm-llnl/slurmdbd.conf.

With these last steps we have configured all the files required on the controller node to use SLURM. Now, we have to register the new cluster in the database. First, we make sure that MySQL and slurmdbd are running:

jjorge@controller:~$ sudo /etc/init.d/mysql start
jjorge@controller:~$ sudo /etc/init.d/slurmdbd start

Then, we can access to the database to create the required entities:

jjorge@controller:~$ mysql -u slurm -p slurm_acct_db

Regarding these entities, we are going to create the cluster, the account or role, and a new user within these groups that is also an administrator.

jjorge@controller:~$ sudo sacctmgr add cluster slurmcluster
jjorge@controller:~$ sudo sacctmgr -i add account \
 researcher Description="Researcher"\ 
 Organization="ResearchGroup"
jjorge@controller:~$ sudo sacctmgr -i create user 
jjorge account=researcher adminlevel=Administrator \
 partition=debug

We can show the user’s information by means of the following command:

jjorge@controller:~$ sudo sacctmgr show user name=jjorge    User  Def Acct    Admin 
---------- ---------- --------- 
jjorge researcher Administ+

We can check the status of the slurmctl daemon with:

jjorge@controller:~$ sudo /etc/init.d/slurmctld status 
slurmctld.service - Slurm controller daemon
Loaded: loaded (/lib/systemd/system/slurmctld.service; 
 enabled)
Active: active (running) since Tue 2018-01-02 15:36:09 UTC; 6 days ago
Main PID: 1386 (slurmctld)Tasks: 9
Memory: 9.2M
CPU: 2min 9.986s
CGroup: /system.slice/slurmctld.service1386 /usr/sbin/slurmctld

Starting it in the case it is not started.

jjorge@controller:~$ sudo /etc/init.d/slurmctl start

Compute Node

As we have already installed munge on every node, and we have shared the key among nodes, we just have to install the slurmd daemon for the compute nodes.

jjorge@compute:~$ sudo apt-get install slurmd

Copying the configuration file from the controller:

root@controller:/home/jjorge# scp \
 /etc/slurm-llnl/slurm.conf \
 compute0:/etc/slurm-llnl/slurm.conf 
root@controller:/home/jjorge# scp \
 /etc/slurm-llnl/slurm.conf \
 compute1:/etc/slurm-llnl/slurm.conf

And then, restarting the services and the nodes.

jjorge@computeX:~$ sudo systemctl enable slurmd
jjorge@computeX:~$ sudo systemctl start slurmd
jjorge@XXXX:~$ sudo reboot

After rebooting, detecting nodes could take time. Regarding this, to force the discovery of the nodes, we could use this command on the controller:

jjorge@controller:~$ scontrol update NodeName=compute0 State=resume
jjorge@controller:~$ scontrol update NodeName=compute1 State=resume

After this, we should see both compute nodes as idle.

jjorge@controller:~$ sinfo

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST

debug*       up   infinite      2   idle compute[0-1]

As a simple task, we could run the hostname command on both nodes:

jjorge@controller:~$ srun -N 2 hostname

compute0

compute1

Example task

In this section, we will run a task using the command sbatch. The script that will be launched is the following:

#!/bin/bash
#SBATCH --partition=debug
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --job-name="Medium" 
cd /nfs/jjorge/stream
./memtransf > out_"$HOSTNAME".txt

Where memtransf is a script that measures memory transfer rates in MB/s for simple computational kernels coded in C, adapted from here. This is just an example of a more complex task that takes longer than simple commands.

The launching command is the following

jjorge@controller:/nfs/jjorge/stream$ sbatch medium.job
Submitted batch job 58

jjorge@controller:/nfs/jjorge/stream$ sbatch medium.job
Submitted batch job 59

The result of launching this script could be controlled, for example, with sacct:

jjorge@controller:/nfs/jjorge/stream$ sacct

JobID  JobName  Partition Account  AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 
57  hostname  debug researcher    2  COMPLETED 0:0 
58  Medium    debug researcher    1  RUNNING   0:0 
59  Medium    debug researcher    1  RUNNING   0:0

The command squeue also provides related information:

jjorge@controller:/nfs/jjorge/stream$ squeue 

JOBID PARTITION NAME USER ST TIME  NODES NODELIST(REASON)
58    debug   Medium   jjorge  R       0:02      1 compute0
59    debug   Medium   jjorge  R       0:01      1 compute1

Again, sinfo provides information of the status of the nodes with some flags as it is shown underneath:

jjorge@controller:/nfs/jjorge/stream$ sinfo -Nel

Tue Jan  9 10:22:50 2018

NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK ..
compute0 1    debug*   allocated    1    1:1:1  1       ...compute1 1    debug*   allocated    1    1:1:1  1       ...

When tasks are finished, we could access the results on all nodes thanks to the NFS, for example, from the controller:

jjorge@controller:/nfs/jjorge/stream$ ls -l
total 92
-rw-rw-r-- 1 jjorge jjorge   198 Jan  9 10:18 medium.job
-rwxrwxr-x 1 jjorge jjorge 17912 Jan  9 10:18 memtransf
-rw-rw-r-- 1 jjorge jjorge 1764 Jan 9 10:23 out_comp0.txt
-rw-rw-r-- 1 jjorge jjorge 1764 Jan 9 10:23 out_comp1.txt-rw-rw-r-- 1   jjorge jjorge  0 Jan  9 10:14  slurm-58.out
-rw-rw-r-- 1 jjorge jjorge     0 Jan  9 10:14 slurm-59.out

 

Conclusions

In this work we have deployed a cluster on Azure, the Cloud Platform that Microsoft provides. This cluster offers a network file system to share directories among the nodes, and the job and resources management through the state-of-the-art tool called SLURM, fully functional and extendable.

Regarding Azure’s functionality, we have explored the deployment of instances, the understanding of the networking system underneath, the template deployment, the powerful feature of creating images to deploy and contextualize VM and a brief interaction with the different scripting options: Azure CLI and PowerShell. Sadly, due to the limitations that a free account has, we just used this scripting tools in a limited manner.

Appendix

The slurm.conf file.

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#ControlMachine=controller
#ControlAddr=
#BackupController=
#BackupAddr=
#AuthType=auth/munge
CacheGroups=0
#CheckpointType=checkpoint/noneCrypto
Type=crypto/munge
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/lib/slurm-llnl/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/usr/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=
ports=
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/pgid
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
#SallocDefaultCommand=Slurmctld
PidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
# SCHEDULING
#DefMemPerCPU=0FastSchedule=1
#MaxMemPerCPU=0
#SchedulerRootFilter=1
#SchedulerTimeSlice=30
SchedulerType=sched/backfillSchedulerPort=7321
SelectType=select/linear
#SelectTypeParameters=
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
AccountingStorageHost=localhost
AccountingStorageLoc=slurm_acct_db
AccountingStoragePass=/var/run/munge/munge.socket.2
#AccountingStoragePort=
AccountingStorageType=accounting_storage/slurmdbdAccounting
StorageUser=slurm
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
JobCompHost=localhost
JobCompLoc=slurm_acct_db
JobCompPass=slurmdbpass
#JobCompPort=
JobCompType=jobcomp/slurmdbdJobCompUser=slurm
#JobContainerPlugin=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=3SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
# COMPUTE NODES
NodeName=compute[0-1] CPUs=1 Sockets=1 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN
PartitionName=debug Nodes=compute[0-1] Default=YES MaxTime=INFINITE State=UP

And the slurmdb.conf:

# Example slurmdbd.conf file.
# See the slurmdbd.conf man page for more information.
#
#Archive info
#ArchiveJobs=yes
#ArchiveDir="/tmp"
#ArchiveSteps=yes
#ArchiveScript=
#JobPurge=12
#StepPurge=1
#
#Authentication 
infoAuthType=auth/munge
#AuthInfo=/var/run/munge/munge.socket.2
#
# slurmDBD info 
DbdAddr=localhost
DbdHost=localhost
#DbdPort=7031
SlurmUser=slurm
#MessageTimeout=300
DebugLevel=4
#DefaultQOS=normal,standby
LogFile=/var/log/slurm-llnl/slurmdbd.log
PidFile=/var/run/slurm-llnl/slurmdbd.pid
#PluginDir=/usr/lib/slurm
#PrivateData=accounts,users,usage,jobs
#TrackWCKey=yes
#
# Database info
StorageType=accounting_storage/mysql
StorageHost=localhost
#StoragePort=1234
StoragePass=slurmdbpass
StorageUser=slurm
StorageLoc=slurm_acct_db