Running MPI applications on a Torque/Maui cluster.

Running MPI applications on a Torque/Maui cluster.

MPI (the Message Passing Interface) is a set of libraries and tools that is commonly used to write parallel applications. MPI provides the framework and configuration to enable efficient communication between cooperating processes. From a user’s perspective, the mpiexec command is used to start a program that is parallelised with MPI. Since it is so widely used, MPI has many implementations and vendors such as Intel sell optimised MPI libraries. For our purposes we will be using openmpi. Install this with:

yum install -y openmpi openmpi-devel

We can use wcd as a demo application because it can be built with MPI support. This support is enabled using the –enable-mpi flag to configure. Recompile and reinstall it on your cluster head node, making sure to install to the same path as before:

cd wcd-express-0.6.3
make clean
./configure --prefix=/cluster/software/wcd-0.6.3 --enable-mpi
make
sudo make install

To run a MPI-enabled program in Torque you need to specify three things:

  1. The mpiexec command isn’t in the path, but is available as an environment module so you need to add it with module add openmpi-x86_64.

  2. The number of nodes and processors you want to use. This is specified using the -l nodes=X:ppn=Y syntax, where X is the number of nodes you want to use and Y is the number of processors you want to use on each node. For example, if you want to run on 2 machines and use 4 processors on each machine, you can use -l nodes=2:ppn=4. Note that you must ensure that you have enough resources available on your cluster; if you specify ppn=8 but there are only 4 processor cores available on the node, your job will be stuck in waiting state forever.

  3. The mpiexec command line. Torque provides a hostfile that specifies which machines the job should run on. The location of this is passed to the script in the environment variable PBS_NODEFILE. This must be passed to the mpiexec command (with the -hostfile flag) and then MPI will connect to the specified machines (typically using ssh) to start the program in parallel. If the program that you’re running is found using environment modules, you need to tell mpiexec to export the PATH (and any other settings set by your environment module). This is done with the -x flag.

Here is a sample script, runwcd-mpi.sh, that can be used to run wcd in parallel:

#!/bin/sh

module add openmpi-x86_64
module add wcd

mpiexec -x PATH -hostfile $PBS_NODEFILE wcd -c benchmark10000.seq

This would be started with e.g.:

qsub -l nodes=2:ppn=2 -d $(pwd) runwcd-mpi.sh

You can also set Torque settings inside the script by using #PBS lines. E.g.

#!/bin/sh

#PBS -l nodes=2:ppn=2
#PBS -d /home/pvh/software/sources

module add openmpi-x86_64
module add wcd

mpiexec -x PATH -hostfile $PBS_NODEFILE wcd -c benchmark10000.seq

Then you can just submit the script with:

qsub runwcd-mpi.sh

To see that your job is running use qstat and to see which nodes it is running on use qstat -f jobid where jobid is the ID of the running job. This is a sample use of these commands:

[pvh@sl1 sources]$ qsub runwcd-mpi.sh
32.sl1.sanbi.ac.za
[pvh@sl1 sources]$ qstat
Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
32.sl1                     runwcd-mpi.sh    pvh                    0 R batch          
[pvh@sl1 sources]$ qstat -f 32.sl1
Job Id: 32.sl1.sanbi.ac.za
    Job_Name = runwcd-mpi.sh
    Job_Owner = pvh@sl1.sanbi.ac.za
    job_state = R
    queue = batch
    server = sl1.sanbi.ac.za
    Checkpoint = u
    ctime = Wed Feb 11 07:31:54 2015
    Error_Path = sl1.sanbi.ac.za:/home/pvh/software/sources/runwcd-mpi.sh.e32
    exec_host = sl2.sanbi.ac.za/1+sl2.sanbi.ac.za/0+sl1.sanbi.ac.za/1+sl1.sanb
    i.ac.za/0
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Wed Feb 11 07:31:55 2015
    Output_Path = sl1.sanbi.ac.za:/home/pvh/software/sources/runwcd-mpi.sh.o32

    Priority = 0
    qtime = Wed Feb 11 07:31:54 2015
    Rerunable = True
    Resource_List.neednodes = 2:ppn=2
    Resource_List.nodect = 2
    Resource_List.nodes = 2:ppn=2
    Resource_List.walltime = 01:00:00
    substate = 42
    Variable_List = PBS_O_QUEUE=batch,PBS_O_HOST=sl1.sanbi.ac.za,
    PBS_O_HOME=/home/pvh,PBS_O_LANG=en_ZA.UTF-8,PBS_O_LOGNAME=pvh,
    PBS_O_PATH=/usr/lib64/openmpi/bin:/usr/lib64/qt-3.3/bin:/usr/local/bi
    n:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/pvh/bin,
    PBS_O_MAIL=/var/spool/mail/pvh,PBS_O_SHELL=/bin/bash,PBS_SERVER=sl1,
    PBS_O_INITDIR=/home/pvh/software/sources,
    PBS_O_WORKDIR=/home/pvh/software/sources
    euser = pvh
    egroup = pvh
    hashname = 32.sl1.sanbi.ac.za
    queue_rank = 3
    queue_type = E
    etime = Wed Feb 11 07:31:54 2015
    submit_args = runwcd-mpi.sh
    start_time = Wed Feb 11 07:31:55 2015
    Walltime.Remaining = 3594
    start_count = 1
    fault_tolerant = False
    submit_host = sl1.sanbi.ac.za
    init_work_dir = /home/pvh/software/sources

If you look at the exec_host line in the output above you can see that the job is running on 2 CPUs of node sl2.sanbi.ac.za and two CPUs of sl1.sanbi.ac.za.

Leave a Reply

Your email address will not be published. Required fields are marked *