GWU cluster (Corcoran Hall)

If you plan to use openmpi or mvapich2 to run your parallel jobs, don't forget to load the proper modules in your .cshrc file (for example for openmpi you need to have module load openmpi/gnu). Furthermore, if you plan to use cuda v3.1 load module cuda31 (we also have cuda v3.2 installed but our codes are not yet compatible with it).

The scheduling system on the cluster is Sun Grid Engine v6.2u5 (sge). For infiniband we have installed OFED and openmpi (v1.5) and mvapich2 (v1.5.1). We prefer to use openmpi since it provides tight integration with the scheduler (the mpi processes get killed when you delete a job from the scheduler). A simple script is given below:

#$ -S /bin/csh
#$ -cwd
#$ -j y
#$ -o job_output.log$JOB_ID
##$ -q all.q@@node
#$ -q all.q@@gpu
#$ -l gpu_count=1
#$ -pe openmpi 8
#$ -l h_rt = 01:30:00

mpirun -n $NSLOTS --mca btl_openib_flags 1 test_dslash_multi_gpu < check.in >& output_np${NSLOTS}.log

Note that if your job consumes gpus you should use -l gpu_count=1 (one gpu per process) and you should also include --mca btl_openib_flags 1 flag to make sure that openmpi and cuda work together. If you only use the cpu codes, the flag is not necessary. Note that the new nodes are in the all.q queue and you should use all.q@@gpu if you want to use them or all.q@@node if you want to use the old nodes.

Queue admin

Disable/enable nodes

If you want to reboot a node while there are pending jobs in the queue, you should disable it from the queue first:

qmod -d \*@gpu05

disables node gpu05 from all the queues. Note that we need to escape * so that the shell doesn't expand it. Once the node is rebooted, we need to reenable the node using

qmod -e \*@gpu05

Setting up parallel environments

To add/remove/modify a parallel environment (pe) use qconf. For example qconf -spl shows you all available pe's and qconf -sp openmpi shows you the configuration of openmpi pe. To modify it, use qconf -mp openmpi. The important parameters to configure are start_proc_args and end_proc_args which allow you to do a pre/post execution setup and allocation_rule which can be either $round_robin or $fill_up.

GWU cluster (Corcoran Hall)

Contents

Hardware

Configuration

Queue admin

Disable/enable nodes

Setting up parallel environments

Navigation menu

Views

Personal tools

Navigation

Search

Tools