SLURM Cluster Configuration on Azure (Part I)

I got some free time to share this project, the deployment of a workload manager to ease the management of my research group’s cluster of GPUs.

In this project, we are going to deploy an experimental cluster in the Microsoft’s Cloud Platform Azure. The requirements for this cluster are basically the implementation of the job management system based on queues and shared directories that allow every node in the cluster to access information within the system.

We will evaluate the use of two tools regarding these functionalities: the Network File System Protocol from Linux (NFS) for sharing files and directories, and the job management system called “Simple Linux Utility for Resource Management” (SLURM). This cloud deployment will be the workbench to practice with these tools for the future use in a real environment.

This document aims to be a tutorial of the whole process, from the creation of Virtual Machines (VM) to the configuration of all the required elements, concluding with an example of a task running on the cluster. In the last part, we will include one option to extend the size of the cluster easily. I’ll split this post into three different parts: Deployment, NFS installation and SLURM installation. There will be parts that could have been done automatically, but I did them manually to illustrate the whole process.

System architecture

First of all, we need to deploy the instances that we will require in Azure, the following section will describe the steps for doing this. The structure of the system that we will be deploying during this tutorial is represented in the following figure:

System configuration: Different worker nodes, the controller and the NFS server.

Creating and deploying an instance on Azure

The information required to create the controller is summarized in the following screenshots:

The main window of the menu “Virtual machines”.
Pick the instance based on “Ubuntu Server 16.04 LTS”.
Parameters’ configuration to the controller node.
Menu to pick the instance’s configuration, pick the one named “A1”.
Establishing network settings and different options for monitoring.
EULA signing step, we can get the template for this VM if we go to the templates’ window clicking on “Download template and parameters”.

Finally, this image shows the last step that allows us to export the template containing all the required files to deploy similar instances later.

Click on “Download” to download the template, this will provide a compressed file with all the required files to deploy a new instance later.

Deploying with templates

Considering that we have the template, the rest of the nodes will be deployed using it. We have to click on the “+ New” button and input the text “Template deployment” to show this option in the search results. We choose this option from the drop-down menu. The following screen shows the interface that we get after clicking on it.

Clicking on the “+New” button leads to the search bar where we have to input “Template deployment” to get to this section.

After clicking on “Create”, we have to choose “Create a Linux Virtual Machine” in the interface that is represented in the image below:

As we are creating all Linux nodes, we have to click on “Create a Linux virtual machine”.

Now, we should load both template.json and  parameters.json from the compressed file that we downloaded before. The following screens illustrate these steps.

This is the main menu to deploy from templates, we should click on “Edit template” to load our template and then “Edit parameters” to load the parameters that we will adjust later.
From the “template.zip” that we downloaded before, we pick the “template.json” file to load the template.
We pick the “parameters.json” after clicking on “Edit parameters” → “Load file”.

When we have uploaded these files, we should change some parameters for each instance, i.e: hostname, network settings, and so on. The next image shows the interface where we should change these values.

We should change the virtual machine name, public IP and the network interface name to compute0, compute1 and nas respectively.

With these steps, we should configure the following nodes (creating them with the first template):

  • controller: The main node that will manage the SLURM cluster (already deployed).
  • compute0 and compute1: Compute nodes in the cluster.
  • nasnode: The node that stores the shared files.

Due to restrictions related to the number of nodes in one region, we have created these four, the upper limit that the free account provides. After creating them, we should see the nodes up and running in the Azure’s interface, as it is shown underneath. Once this step is completed, we can move forward to the configuration of the network.

Deployed VM ready to be configured.

Setting up the communication

First of all, as the nodes should communicate each other through their names, we have to include in the file /etc/hosts the name and the IP address (it could be checked in the “Network” tab in Azure), with something similar to the following:

127.0.0.1 localhost

10.0.0.4 controller

10.0.0.5 compute0

10.0.0.6 compute1

10.0.0.8 nas

#---REST OF THE FILE

We can check if they can contact each other making ping from one node to another.

user@controller:~$ ping compute0

PING compute0 (10.0.0.5) 56(84) bytes of data.

64 bytes from compute0

 (10.0.0.5): icmp_seq=1 ...64 bytes from compute0 

(10.0.0.5): icmp_seq=2 ...64 bytes from compute0 

(10.0.0.5): icmp_seq=3 ...

Configuring root user

We should establish a password for root in all the nodes, creating a public key for the controller’s root and share it among nodes.

 jjorge@controller:~$ sudo su 

root@controller:# cd

root@controller:# passwd

Enter new UNIX password: 

Retype new UNIX password: 

passwd: password updated successfully

Then, edite the file sshd_config:

root@controller:# sudo vi /etc/ssh/sshd_config 

# ...Rest of the file#

 Replace this line

# PermitRootLogin prohibit-password

# For this one 

PermitRootLogin yes 

# ...Rest of the file

We have to do this on every node. After this, we generate the public key and it will be copied on the nodes.

root@controller:#¸ ssh-keygen 
root@controller:# cp /root/.ssh/id_rsa.pub \/root/.ssh/authorized_keys 
root@controller:# ssh nas "mkdir /root/.ssh"
root@controller:# scp /root/.ssh/id_rsa.pub \nas:/root/.ssh/authorized_keys 
root@controller:# ssh compute0 "mkdir /root/.ssh"
root@controller:# scp 
     /root/.ssh/id_rsa.pub 
     compute0:/root/.ssh/authorized_keys 
# ...The same with the other nodes...

After doing this, we are ready to install everything, starting with the NFS, that will be commented on the next post.