SLURM Demo on AWS Ubuntu EC2 instance
29 May 2024
It is recommended to go through Introduction to SLURM before going through this post
Steps for Spinning up EC2 instance
- Either do it via AWS Console interactively via Browser or
-
Use AWS CLI to spin up the instance
- Use AWS IAM to create a Access Key and configure the AWS CLI using
aws configure
-
and Create an EC2 instance using the following command
aws ec2 run-instances --image-id ami-0cf2b4e024cdb6960 --count 1 --instance-type t3.small --key-name XXXXX --security-group-ids sg-XXXXXXXXXX --subnet-id subnet-XXXXXXXX aws ec2 create-tags --resources i-aaabb3234dd --tags Key=Name,Value=slurm-01
- Use AWS IAM to create a Access Key and configure the AWS CLI using
Final check inspecting the OS on the instance
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 24.04 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo
Steps for Setting up Slurm on EC2 instance
Update the grub and disable SELINUX
sudo vim /etc/default/grub
sudo update-grub
sudo reboot
Install the slurm packages
sudo apt update -y
sudo apt install slurmd slurmctld -y
Commands to figure out the system's hardware => RAM and Cores (CPU)
Command to help with finding number of cores
$ lscpu
ubuntu@ip-172-31-21-208:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 2
.........
.........
Other command that lists per processor/core information
cat /proc/cpuinfo
Command to help with finding available memory (RAM)
$ sudo dmidecode --type memory
# dmidecode 3.5
Getting SMBIOS data from sysfs.
SMBIOS 2.7 present.
Handle 0x0008, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Unknown
Maximum Capacity: 2 GB
Error Information Handle: Not Provided
Number Of Devices: 1
Handle 0x0009, DMI type 17, 34 bytes
Memory Device
Array Handle: 0x0008
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 2 GB
Form Factor: DIMM
Set: None
Locator: Not Specified
Bank Locator: Not Specified
Type: DDR4
Type Detail: Static Column Pseudo-static Synchronous Window DRAM
Speed: 2933 MT/s
Manufacturer: Not Specified
Serial Number: Not Specified
Asset Tag: Not Specified
Part Number: Not Specified
Rank: Unknown
Configured Memory Speed: Unknown
Create the slurm config file at /etc/slurm/slurm.conf
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=localcluster
SlurmctldHost=localhost
MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
#
# COMPUTE NODES
NodeName=localhost CPUs=2 RealMemory=1910 State=UNKNOWN
PartitionName=localhost Nodes=ALL Default=YES MaxTime=INFINITE State=UP
Restart the slurmctld
/ slurmd
and munge
sudo systemctl start slurmctld && sudo systemctl start slurmd && sudo systemctl start munge
sudo scontrol update nodename=localhost state=idle
Checking the resources using sinfo --partition localhost
$ sinfo --partition localhost
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
localhost* up infinite 1 idle localhost
Testing Job Submission
Create sample_submission.sh
#!/bin/bash
#SBATCH --job-name=sample_job
#SBATCH --partition=localhost
#SBATCH --time=10:00
#SBATCH --ntasks=1
echo "First sample job running on localhost."
echo "Job Done well .. Exiting!"
exit 0
Submit Job and Check output
2
indicates the job_id
in the system
$ sbatch sample_submission.sh
Submitted batch job 2
Inspect the job
This also has the information about StdOut
information, that holds the output from the job as we used echo
statements to print to stdout
$ scontrol show jobid 2
JobId=2 JobName=sample_job
UserId=ubuntu(1000) GroupId=ubuntu(1000) MCS_label=N/A
Priority=1 Nice=0 Account=(null) QOS=(null)
JobState=COMPLETED Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A
SubmitTime=2024-05-30T01:20:22 EligibleTime=2024-05-30T01:20:22
AccrueTime=2024-05-30T01:20:22
StartTime=2024-05-30T01:20:23 EndTime=2024-05-30T01:20:23 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-05-30T01:20:23 Scheduler=Backfill
Partition=localhost AllocNode:Sid=ip-172-31-21-208:1046
ReqNodeList=(null) ExcNodeList=(null)
NodeList=localhost
BatchHost=localhost
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=953M,node=1,billing=1
AllocTRES=cpu=1,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/ubuntu/sample_submission.sh
WorkDir=/home/ubuntu
StdErr=/home/ubuntu/slurm-2.out
StdIn=/dev/null
StdOut=/home/ubuntu/slurm-2.out
Power=
Checking the output
$ cat /home/ubuntu/slurm-2.out
First sample job running on localhost.
Job Done well .. Exiting!