Creating a multi-machine cluster with Torque/Maui

Implementing multi-machine cluster

Building a cluster should be approached in stages.

  1. Make shared storage available. This must serve two purposes: provide shared home directories for the users on the cluster and provide a shared space to install applications into. (This second use is only necessary if you’re dealing with applications that can’t be installed using a package manager like yum.) Typically shared stored is made available using NFS, so the first task is: Choose a NFS server and export a shared directory from it, mount this shared directory in a consistent location on all cluster nodes.
  2. Ensure that users of the cluster exist across the entire cluster. A user (e.g. worker) should exist on each cluster node (the login/submit node and all the worker nodes) with the same user ID. The -u option of useradd can be used to set the user ID of a newly created user. Cluster users need to be able to log in to each cluster node without a password or any other prompt. This typically means using ssh public key-based login and either disabling SSH host key checking or gathering all ssh host keys of the cluster nodes (using ssh-keyscan). The home directories of cluster users should be on shared storage; the location of home diretories can be set with the -b (base dir) flag to useradd. Create at least one user that exists across the cluster and whose home directory is on shared storage. Verify that they can login to each cluster machine, can create files on each machine and that id shows that they have the same user ID on each machine.
  3. Install Torque and Maui. The head node of the cluster (in our example we can combine the roles of management node and login node into the notion of head node). The pbs_server and maui daemons should run on the head node and the pbs_mom daemon should run on the worker nodes. Note that this means that the pbs_mom config file should refer to the head node, not to itself. Also note that jobs should not run on the head node; no pbs_mom daemon should run on the head node. Install Torque and Maui and ensure that a user on the head node can submit a job that will run on the cluster.

Notes on resources you find useful and problems you’ve encountered should be loaded on the shared notebook. A an introduction to running jobs on a Torque/Maui cluster can be found here.

Scripts submitted to Torque can include options within them, so e.g. the -d option specifies the working directory of a script and this can be included within the script:

#!/bin/bash
#PBS -d /workspace/worker1/myproject
echo "I am working in $(pwd)"

(This would ensure that when you qsub this script its working directory is /workspace/worker1/myproject.)

In practice you will not want to manually configure a cluster, rather you would script its configuration (using shell scripts, ssh, expect and ansible) or, even better, you would deploy configuration with puppet. As an additonal task consider running the puppet master server on the head node, making the worker nodes puppet clients and create puppet manifests for cluster configuration such as NFS mount points, users, and shared ssh keys.

Leave a Reply

Your email address will not be published. Required fields are marked *