Running MPI applications on a Torque/Maui cluster.
MPI (the Message Passing Interface) is a set of libraries and tools that is commonly used to write parallel applications. MPI provides the framework and configuration to enable efficient communication between cooperating processes. From a user’s perspective, the mpiexec command is used to start a program that is parallelised with MPI. Since it is so widely used, MPI has many implementations and vendors such as Intel sell optimised MPI libraries. For our purposes we will be using openmpi. Install this with:
yum install -y openmpi openmpi-devel
We can use wcd as a demo application because it can be built with MPI support. This support is enabled using the –enable-mpi flag to configure. Recompile and reinstall it on your cluster head node, making sure to install to the same path as before:
cd wcd-express-0.6.3 make clean ./configure --prefix=/cluster/software/wcd-0.6.3 --enable-mpi make sudo make install
To run a MPI-enabled program in Torque you need to specify three things:
The mpiexec command isn’t in the path, but is available as an environment module so you need to add it with module add openmpi-x86_64.
The number of nodes and processors you want to use. This is specified using the -l nodes=X:ppn=Y syntax, where X is the number of nodes you want to use and Y is the number of processors you want to use on each node. For example, if you want to run on 2 machines and use 4 processors on each machine, you can use -l nodes=2:ppn=4. Note that you must ensure that you have enough resources available on your cluster; if you specify ppn=8 but there are only 4 processor cores available on the node, your job will be stuck in waiting state forever.
The mpiexec command line. Torque provides a hostfile that specifies which machines the job should run on. The location of this is passed to the script in the environment variable PBS_NODEFILE. This must be passed to the mpiexec command (with the -hostfile flag) and then MPI will connect to the specified machines (typically using ssh) to start the program in parallel. If the program that you’re running is found using environment modules, you need to tell mpiexec to export the PATH (and any other settings set by your environment module). This is done with the -x flag.
Here is a sample script,
runwcd-mpi.sh, that can be used to run wcd in parallel:
#!/bin/sh module add openmpi-x86_64 module add wcd mpiexec -x PATH -hostfile $PBS_NODEFILE wcd -c benchmark10000.seq
This would be started with e.g.:
qsub -l nodes=2:ppn=2 -d $(pwd) runwcd-mpi.sh
You can also set Torque settings inside the script by using #PBS lines. E.g.
#!/bin/sh #PBS -l nodes=2:ppn=2 #PBS -d /home/pvh/software/sources module add openmpi-x86_64 module add wcd mpiexec -x PATH -hostfile $PBS_NODEFILE wcd -c benchmark10000.seq
Then you can just submit the script with:
To see that your job is running use qstat and to see which nodes it is running on use qstat -f jobid where jobid is the ID of the running job. This is a sample use of these commands:
[pvh@sl1 sources]$ qsub runwcd-mpi.sh 32.sl1.sanbi.ac.za [pvh@sl1 sources]$ qstat Job ID Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 32.sl1 runwcd-mpi.sh pvh 0 R batch [pvh@sl1 sources]$ qstat -f 32.sl1 Job Id: 32.sl1.sanbi.ac.za Job_Name = runwcd-mpi.sh Job_Owner = email@example.com job_state = R queue = batch server = sl1.sanbi.ac.za Checkpoint = u ctime = Wed Feb 11 07:31:54 2015 Error_Path = sl1.sanbi.ac.za:/home/pvh/software/sources/runwcd-mpi.sh.e32 exec_host = sl2.sanbi.ac.za/1+sl2.sanbi.ac.za/0+sl1.sanbi.ac.za/1+sl1.sanb i.ac.za/0 Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = a mtime = Wed Feb 11 07:31:55 2015 Output_Path = sl1.sanbi.ac.za:/home/pvh/software/sources/runwcd-mpi.sh.o32 Priority = 0 qtime = Wed Feb 11 07:31:54 2015 Rerunable = True Resource_List.neednodes = 2:ppn=2 Resource_List.nodect = 2 Resource_List.nodes = 2:ppn=2 Resource_List.walltime = 01:00:00 substate = 42 Variable_List = PBS_O_QUEUE=batch,PBS_O_HOST=sl1.sanbi.ac.za, PBS_O_HOME=/home/pvh,PBS_O_LANG=en_ZA.UTF-8,PBS_O_LOGNAME=pvh, PBS_O_PATH=/usr/lib64/openmpi/bin:/usr/lib64/qt-3.3/bin:/usr/local/bi n:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/pvh/bin, PBS_O_MAIL=/var/spool/mail/pvh,PBS_O_SHELL=/bin/bash,PBS_SERVER=sl1, PBS_O_INITDIR=/home/pvh/software/sources, PBS_O_WORKDIR=/home/pvh/software/sources euser = pvh egroup = pvh hashname = 32.sl1.sanbi.ac.za queue_rank = 3 queue_type = E etime = Wed Feb 11 07:31:54 2015 submit_args = runwcd-mpi.sh start_time = Wed Feb 11 07:31:55 2015 Walltime.Remaining = 3594 start_count = 1 fault_tolerant = False submit_host = sl1.sanbi.ac.za init_work_dir = /home/pvh/software/sources
If you look at the exec_host line in the output above you can see that the job is running on 2 CPUs of node sl2.sanbi.ac.za and two CPUs of sl1.sanbi.ac.za.